1.首先下载测试数据,数据也可以创建

  1. http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

2. 数据类型与字段名称

  1. movies.csv(电影元数据)
  2. movieId,title,genres
  3. ratings.csv(用户打分数据)
  4. userId,movieId,rating,timestamp

3. 先把数据存放到HDFS上

  1. hdfs dfs -mkdir /hive_operate
  2. hdfs dfs -mkdir /hive_operate/movie_table
  3. hdfs dfs -mkdir /hive_operate/rating_table
  4. hdfs dfs -put movies.csv /hive_operate/movie_table
  5. hdfs dfs -put ratings.csv /hive_operate/rating_table

4. 创建movie_table和rating_table

  1. $ cat create_movie_table.sql
  2. create external table movie_table -- 创建外表
  3. (
  4. movieId STRING,
  5. title STRING,
  6. genres STRING
  7. )
  8. row format delimited fields terminated by ','
  9. stored as textfile
  10. location '/hive_operate/movie_table'; -- 指定hdfs的目录结构
  11. $ cat create_rating_table.sql
  12. create external table rating_table
  13. (
  14. userId STRING,
  15. movieId STRING,
  16. rating STRING,
  17. ts STRING
  18. )
  19. row format delimited fields terminated by ','
  20. stored as textfile
  21. location '/hive_operate/rating_table';
  22. 其中字段名为timestamphive的保留字段,执行的时候会报错,需用反引号或者修改字段名,
  23. 我这边修改的字段名

5. 执行

  1. 在命令行执行该语句

6. 查看

  1. hive> show tables;
  2. OK
  3. movie_table
  4. rating_table
  5. hive> select * from rating_table limit 10;
  6. OK
  7. 1 31 2.5 1260759144
  8. 1 1029 3.0 1260759179
  9. 1 1061 3.0 1260759182
  10. 1 1129 2.0 1260759185
  11. 1 1172 4.0 1260759205
  12. 1 1263 2.0 1260759151
  13. 1 1287 2.0 1260759187
  14. 1 1293 2.0 1260759148
  15. 1 1339 3.5 1260759125
  16. 1 1343 2.0 1260759131

7. 生成新表(行为表)

  1. create table behavior_table as
  2. select B.userid, A.movieid, B.rating, A.title
  3. from movie_table A
  4. join rating_table B
  5. on A.movieid == B.movieid;

8. 把Hive表数据导入到本地

导入到本地中,语句中多了local

  1. table->local file
  2. insert overwrite local directory '/root/hive_test/1.txt' select * from behavior_table;

9. 把Hive表数据导入到HDFS上

  1. table->hdfs file
  2. insert overwrite directory '/root/hive_test/1.txt' select * from behavior_table;

10. 把本地数据导入到Hive表中

命令行中由LOCAL字段信息

  1. local file -> table
  2. LOAD DATA LOCAL INPATH '/root/hive_test/a.txt' OVERWRITE INTO TABLE behavior_table;

11. 把HDFS上的数导入到HIve表中

  1. hdfs file -> table
  2. LOAD DATA INPATH '/a.txt' OVERWRITE INTO TABLE behavior_table;

12.导入hive表中,需要分区的

  1. LOAD DATA LOCAL INPATH '/home/hadoop/hfxdoc/classmem_Misli.txt' -- 文件名
  2. INTO TABLE classmem -- 表名
  3. partition (teacher = 'Mis.li') -- 分区名