环境启动

node02,apache-hive-2.3.7-bin.tar.gz,mysql-connector-java hadoop hdfs MR必须先启动 node02~04:./zkServer.sh start node01 hadoop-daemon.sh start journalnode hadoop-daemon.sh start namenode node02 hdfs namenode -bootstrapStandby node01 start-dfs.sh start-yarn.sh node03~04 yarn-daemon.sh start resourcemanager node03 hive —service metastore node04 hive

一:DDL

1:DDL

  1. --展示所有数据库
  2. show databases;
  3. --切换数据库
  4. use database_name;
  5. /*创建数据库
  6. CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
  7. [COMMENT database_comment]
  8. [LOCATION hdfs_path]
  9. [WITH DBPROPERTIES (property_name=property_value, ...)];
  10. */
  11. create database test;
  12. /*
  13. 删除数据库
  14. DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
  15. */
  16. drop database database_name;

2:解释

注意:当进入hive的命令行开始编写SQL语句的时候,如果没有任何相关的数据库操作,那么默认情况下,所有的表存在于default数据库,在hdfs上的展示形式是将此数据库的表保存在hive的默认路径下,如果创建了数据库,那么会在hive的默认路径下生成一个database_name.db的文件夹,此数据库的所有表会保存在database_name.db的目录下

3:基本操作

  1. /*
  2. 创建表的操作
  3. 基本语法:
  4. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  5. [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  6. [COMMENT table_comment]
  7. [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  8. [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  9. [SKEWED BY (col_name, col_name, ...) -- (Note: Available in Hive 0.10.0 and later)]
  10. ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
  11. [STORED AS DIRECTORIES]
  12. [
  13. [ROW FORMAT row_format]
  14. [STORED AS file_format]
  15. | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in Hive 0.6.0 and later)
  16. ]
  17. [LOCATION hdfs_path]
  18. [TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later)
  19. [AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
  20. CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  21. LIKE existing_table_or_view_name
  22. [LOCATION hdfs_path];
  23. 复杂数据类型
  24. data_type
  25. : primitive_type
  26. | array_type
  27. | map_type
  28. | struct_type
  29. | union_type -- (Note: Available in Hive 0.7.0 and later)
  30. 基本数据类型
  31. primitive_type
  32. : TINYINT
  33. | SMALLINT
  34. | INT
  35. | BIGINT
  36. | BOOLEAN
  37. | FLOAT
  38. | DOUBLE
  39. | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  40. | STRING
  41. | BINARY -- (Note: Available in Hive 0.8.0 and later)
  42. | TIMESTAMP -- (Note: Available in Hive 0.8.0 and later)
  43. | DECIMAL -- (Note: Available in Hive 0.11.0 and later)
  44. | DECIMAL(precision, scale) -- (Note: Available in Hive 0.13.0 and later)
  45. | DATE -- (Note: Available in Hive 0.12.0 and later)
  46. | VARCHAR -- (Note: Available in Hive 0.12.0 and later)
  47. | CHAR -- (Note: Available in Hive 0.13.0 and later)
  48. array_type
  49. : ARRAY < data_type >
  50. map_type
  51. : MAP < primitive_type, data_type >
  52. struct_type
  53. : STRUCT < col_name : data_type [COMMENT col_comment], ...>
  54. union_type
  55. : UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later)
  56. 行格式规范
  57. row_format
  58. : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
  59. [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
  60. [NULL DEFINED AS char] -- (Note: Available in Hive 0.13 and later)
  61. | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
  62. 文件基本类型
  63. file_format:
  64. : SEQUENCEFILE
  65. | TEXTFILE -- (Default, depending on hive.default.fileformat configuration)
  66. | RCFILE -- (Note: Available in Hive 0.6.0 and later)
  67. | ORC -- (Note: Available in Hive 0.11.0 and later)
  68. | PARQUET -- (Note: Available in Hive 0.13.0 and later)
  69. | AVRO -- (Note: Available in Hive 0.14.0 and later)
  70. | JSONFILE -- (Note: Available in Hive 4.0.0 and later)
  71. | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
  72. 表约束
  73. constraint_specification:
  74. : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]
  75. [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE
  76. */

4:创建自定义hive表

  1. create table psn2
  2. (
  3. id int,
  4. name string,
  5. likes array<string>,
  6. address map<string,string>
  7. )
  8. row format delimited
  9. fields terminated by ','
  10. collection items terminated by '-'
  11. map keys terminated by ':';
  12. vi /root/data/data
  13. 1,小明1,lol-book-movie,beijing:mashibing-shanghai:pudong
  14. 2,小明2,lol-book-movie,beijing:mashibing-shanghai:pudong
  15. 3,小明3,lol-book-movie,beijing:mashibing-shanghai:pudong
  16. load data local inpath '/root/data/data' into table psn2;

5:创建默认分隔符的hive表(^A、^B、^C)

  1. create table psn3
  2. (
  3. id int,
  4. name string,
  5. likes array<string>,
  6. address map<string,string>
  7. )
  8. row format delimited
  9. fields terminated by '\001'
  10. collection items terminated by '\002'
  11. map keys terminated by '\003';
  12. --或者
  13. create table psn3
  14. (
  15. id int,
  16. name string,
  17. likes array<string>,
  18. address map<string,string>
  19. )
  20. vi /root/data/data3
  21. 1^A小明1^Alol^Bbook^Bmovie^Abeijing^Cmashibing^Bshanghai^Cpudong
  22. load data local inpath '/root/data/data3' into table psn3;

6:创建hive的外部表(需要添加external和location的关键字)

  1. /*
  2. 在之前创建的表都属于hive的内部表(psn,psn2,psn3),而psn4属于hive的外部表,
  3. 内部表跟外部表的区别:
  4. 1、hive内部表创建的时候数据存储在hive的默认存储目录中,外部表在创建的时候需要制定额外的目录
  5. 2、hive内部表删除的时候,会将元数据和数据都删除,而外部表只会删除元数据,不会删除数据
  6. 应用场景:
  7. 内部表:需要先创建表,然后向表中添加数据,适合做中间表的存储
  8. 外部表:可以先创建表,再添加数据,也可以先有数据,再创建表,本质上是将hdfs的某一个目录的数据跟 hive的表关联映射起来,因此适合原始数据的存储,不会因为误操作将数据给删除掉
  9. */
  10. create external table psn4
  11. (
  12. id int,
  13. name string,
  14. likes array<string>,
  15. address map<string,string>
  16. )
  17. row format delimited
  18. fields terminated by ','
  19. collection items terminated by '-'
  20. map keys terminated by ':'
  21. location '/data';

7:创建单分区表

  1. /*
  2. hive的分区表:
  3. hive默认将表的数据保存在某一个hdfs的存储目录下,当需要检索符合条件的某一部分数据的时候,需要全量 遍历数据,io量比较大,效率比较低,因此可以采用分而治之的思想,将符合某些条件的数据放置在某一个目录 ,此时检索的时候只需要搜索指定目录即可,不需要全量遍历数据。
  4. */
  5. create table psn5
  6. (
  7. id int,
  8. name string,
  9. likes array<string>,
  10. address map<string,string>
  11. )
  12. partitioned by(gender string)
  13. row format delimited
  14. fields terminated by ','
  15. collection items terminated by '-'
  16. map keys terminated by ':';
  17. load data local inpath '/root/data/data' into table psn3 partition(gender='man');

image.png

8:创建多分区表

  1. /*
  2. 注意:
  3. 1、当创建完分区表之后,在保存数据的时候,会在hdfs目录中看到分区列会成为一个目录,以多级目录的形式 存在
  4. 2、当创建多分区表之后,插入数据的时候不可以只添加一个分区列,需要将所有的分区列都添加值
  5. 3、多分区表在添加分区列的值得时候,与顺序无关,与分区表的分区列的名称相关,按照名称就行匹配
  6. */
  7. create table psn6
  8. (
  9. id int,
  10. name string,
  11. likes array<string>,
  12. address map<string,string>
  13. )
  14. partitioned by(gender string,age int)
  15. row format delimited
  16. fields terminated by ','
  17. collection items terminated by '-'
  18. map keys terminated by ':';
  19. load data local inpath '/root/data/data' into table psn6 partition(gender='man',age=12);

image.png

9:alter添加、删除分区

  1. #在gender='nv'下增加age=11分区例如gender=nv/age=11
  2. alter table psn6 add partition(gender='nv',age=11)
  3. #FAILED: ValidationFailureSemanticException partition spec {gender=nv} doesn't contain all (2) partition columns
  4. alter table psn6 add partition(gender='nv')
  5. #将包age=12的分区全部删除
  6. #Dropped the partition gender=man/age=12
  7. #Dropped the partition gender=nv/age=12
  8. alter table psn6 drop partition(age=12)

10:hdfs先创建文件,hive修复分区

  1. /*
  2. 修复分区:
  3. 在使用hive外部表的时候,可以先将数据上传到hdfs的某一个目录中,然后再创建外部表建立映射关系,如果在上传数据的时候,参考分区表的形式也创建了多级目录,那么此时创建完表之后,是查询不到数据的,原因是分区的元数据没有保存在mysql中,因此需要修复分区,将元数据同步更新到mysql中,此时才可以查询到元数据。具体操作如下:
  4. */
  5. --在hdfs创建目录并上传文件
  6. hdfs dfs -mkdir /msb
  7. hdfs dfs -mkdir /msb/age=10
  8. hdfs dfs -mkdir /msb/age=20
  9. hdfs dfs -put /root/data/data /msb/age=10
  10. hdfs dfs -put /root/data/data /msb/age=20
  11. --创建外部表
  12. create external table psn7
  13. (
  14. id int,
  15. name string,
  16. likes array,
  17. address map<string,string>
  18. )
  19. partitioned by(age int)
  20. row format delimited
  21. fields terminated by ','
  22. collection items terminated by '-'
  23. map keys terminated by ':'
  24. location '/msb';
  25. --查询结果(没有数据)
  26. select * from psn7;
  27. --修复分区
  28. msck repair table psn7;
  29. --查询结果(有数据)
  30. select * from psn7;

二:DML

1:插入数据

1):Loading files into tables

  1. create external table psn8
  2. (
  3. id int,
  4. name string,
  5. likes array,
  6. address map<string,string>
  7. )
  8. partitioned by(age int)
  9. row format delimited
  10. fields terminated by ','
  11. collection items terminated by '-'
  12. map keys terminated by ':'
  13. location '/msb';
  14. /*
  15. 记载数据文件到某一张表中
  16. 语法:
  17. LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
  18. LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)
  19. */
  20. --加载本地数据到hive
  21. load data local inpath '/root/data/data' into table psn;--(/root/data/data指的是本地 linux目录)
  22. --加载hdfs数据文件到hive
  23. load data inpath '/data/data' into table psn;--(/data/data指的是hdfs的目录)
  24. /*
  25. 注意:
  26. 1、load操作不会对数据做任何的转换修改操作
  27. 2、从本地linux load数据文件是复制文件的过程
  28. 3、从hdfs load数据文件是移动文件的过程
  29. 4、load操作也支持向分区表中load数据,只不过需要添加分区列的值
  30. */

2):Inserting data into Hive Tables from queries

  1. create table psn9
  2. (
  3. id int,
  4. name string
  5. )
  6. row format delimited
  7. fields terminated by ','
  8. collection items terminated by '-'
  9. map keys terminated by ':';
  10. create table psn10
  11. (
  12. id int
  13. );
  14. /*
  15. 从查询语句中获取数据插入某张表
  16. 语法:
  17. Standard syntax:
  18. INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
  19. INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
  20. Hive extension (multiple inserts):
  21. FROM from_statement
  22. INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
  23. [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
  24. [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
  25. FROM from_statement
  26. INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
  27. [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
  28. [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
  29. Hive extension (dynamic partition inserts):
  30. INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
  31. INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
  32. */
  33. --注意:这种方式插入数据的时候需要预先创建好结果表
  34. --从表中查询数据插入结果表
  35. INSERT OVERWRITE TABLE psn9 SELECT id,name FROM psn
  36. --从表中获取部分列插入到新表中
  37. from psn
  38. insert overwrite table psn9
  39. select id,name
  40. insert into table psn10
  41. select id

3):Writing data into the filesystem from queries

  1. /*
  2. 将查询到的结果插入到文件系统中
  3. 语法:
  4. Standard syntax:
  5. INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  6. [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
  7. SELECT ... FROM ...
  8. Hive extension (multiple inserts):
  9. FROM from_statement
  10. INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
  11. [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
  12. row_format
  13. : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
  14. [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
  15. [NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)
  16. */
  17. --注意:路径千万不要填写根目录,会把所有的数据文件都覆盖
  18. --将查询到的结果导入到hdfs文件系统中
  19. insert overwrite directory '/result' select * from psn;
  20. --将查询的结果导入到本地文件系统中
  21. insert overwrite local directory '/result' select * from psn;

4):Inserting values into tables from SQL

  1. /*
  2. 使用传统关系型数据库的方式插入数据,效率较低
  3. 语法:
  4. Standard Syntax:
  5. INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
  6. Where values_row is:
  7. ( value [, value ...] )
  8. where a value is either null or any valid SQL literal
  9. */
  10. --插入数据
  11. insert into psn values(1,'zhangsan')

2:数据更新和删除

在官网中我们明确看到hive中是支持Update和Delete操作的,但是实际上,是需要事务的支持的,Hive对于事务的支持有很多的限制
因此,在使用hive的过程中,我们一般不会产生删除和更新的操作,如果你需要测试的话,参考下面如下配置,hive的hive-site.xml中添加如下配置:

  1. <property>
  2. <name>hive.support.concurrency</name>
  3. <value>true</value>
  4. </property>
  5. <property>
  6. <name>hive.enforce.bucketing</name>
  7. <value>true</value>
  8. </property>
  9. <property>
  10. <name>hive.exec.dynamic.partition.mode</name>
  11. <value>nonstrict</value>
  12. </property>
  13. <property>
  14. <name>hive.txn.manager</name>
  15. <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
  16. </property>
  17. <property>
  18. <name>hive.compactor.initiator.on</name>
  19. <value>true</value>
  20. </property>
  21. <property>
  22. <name>hive.compactor.worker.threads</name>
  23. <value>1</value>
  24. </property>
  1. //操作语句
  2. create table test_trancaction (user_id Int,name String) clustered by (user_id) into 3 buckets stored as orc TBLPROPERTIES ('transactional'='true');
  3. create table test_insert_test(id int,name string) row format delimited fields TERMINATED BY ',';
  4. insert into test_trancaction select * from test_insert_test;
  5. update test_trancaction set name='jerrick_up' where id=1;