hive - HDFS上的数据导入到Hive中 - 《大数据相关知识》

1.首先下载测试数据，数据也可以创建
2. 数据类型与字段名称
3. 先把数据存放到HDFS上
4. 创建movie_table和rating_table
5. 执行
6. 查看
7. 生成新表(行为表)
8. 把Hive表数据导入到本地
9. 把Hive表数据导入到HDFS上
10. 把本地数据导入到Hive表中
11. 把HDFS上的数导入到HIve表中
12.导入hive表中，需要分区的

1.首先下载测试数据，数据也可以创建

http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

2. 数据类型与字段名称

movies.csv（电影元数据）
movieId,title,genres
ratings.csv（用户打分数据）
userId,movieId,rating,timestamp

3. 先把数据存放到HDFS上

hdfs dfs -mkdir /hive_operate
hdfs dfs -mkdir /hive_operate/movie_table
hdfs dfs -mkdir /hive_operate/rating_table
hdfs dfs -put movies.csv /hive_operate/movie_table
hdfs dfs -put ratings.csv /hive_operate/rating_table

4. 创建movie_table和rating_table

$ cat create_movie_table.sql 
create external table movie_table  -- 创建外表
(
  movieId STRING,
  title STRING,
  genres STRING
)
row format delimited fields terminated by ','
stored as textfile
location '/hive_operate/movie_table'; -- 指定hdfs的目录结构
$ cat create_rating_table.sql
create external table rating_table
(
  userId STRING,
    movieId STRING,
    rating STRING,
    ts STRING
)
row format delimited fields terminated by ','
stored as textfile
location '/hive_operate/rating_table';
其中字段名为timestamp为hive的保留字段，执行的时候会报错，需用反引号或者修改字段名，
我这边修改的字段名

5. 执行

在命令行执行该语句

6. 查看

hive> show tables;
OK
movie_table
rating_table
hive> select * from rating_table limit 10;
OK
1    31    2.5    1260759144
1    1029    3.0    1260759179
1    1061    3.0    1260759182
1    1129    2.0    1260759185
1    1172    4.0    1260759205
1    1263    2.0    1260759151
1    1287    2.0    1260759187
1    1293    2.0    1260759148
1    1339    3.5    1260759125
1    1343    2.0    1260759131

7. 生成新表(行为表)

create table behavior_table as
select B.userid, A.movieid, B.rating, A.title
from movie_table A
join rating_table B
on A.movieid == B.movieid;

8. 把Hive表数据导入到本地

导入到本地中，语句中多了local

table->local file
insert overwrite local directory '/root/hive_test/1.txt' select * from behavior_table;

9. 把Hive表数据导入到HDFS上

table->hdfs file
insert overwrite directory '/root/hive_test/1.txt' select * from behavior_table;

10. 把本地数据导入到Hive表中

命令行中由LOCAL字段信息

local file -> table
LOAD DATA LOCAL INPATH '/root/hive_test/a.txt' OVERWRITE INTO TABLE behavior_table;

11. 把HDFS上的数导入到HIve表中

hdfs file -> table
LOAD DATA INPATH '/a.txt' OVERWRITE INTO TABLE behavior_table;

12.导入hive表中，需要分区的

LOAD DATA LOCAL INPATH '/home/hadoop/hfxdoc/classmem_Misli.txt'  -- 文件名
INTO TABLE classmem  -- 表名
partition (teacher = 'Mis.li')  -- 分区名