- 1.DDL
- 1.2表相关操作
- 1.3管理表和外部表区别(面试)
- 1.4分区表
- 多表连接
- 1查询在2017年4月份购买过的顾客及总人数
- 2查询顾客的购买明细及月购买总额
- 3上述的场景, 将每个顾客的cost按照日期进行累加
- 示例:
- 当前行和前面一行做聚合
- over(partition by name order by orderdate rows between 1 PRECEDING and current row)
- 当前行和前边一行及后面一行
- over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING )
- 当前行及后面所有行
- over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING )
- 查看顾客上次的购买时间
- select name,orderdate,cost, lag(orderdate,1,’1900-01-01’) over(partition by name order by #orderdate ) as time1 from business;
- 5.查询前20%时间的订单信息
- select * from ( select name,orderdate,cost, ntile(5) over(order by orderdate)
- sorted from business) t where sorted = 1;
1.DDL
1.1库相关操作:
1.1.1创建数据库语法
CREATE DATABASE [IF NOT EXISTS] database_name[COMMENT database_comment][LOCATION hdfs_path][WITH DBPROPERTIES (property_name=property_value, ...)];
常见命令
create database if not exists db_hive;show databases like 'db_hive*';desc database db_hive;desc database extended db_hive2;显示数据库详细信息,extendedalter database db_hive set dbproperties('createtime'='20170830');drop database db_hive2;drop database db_hive cascade;不为空强制删除ALTER TABLE table_name RENAME TO new_table_name
1.2表相关操作
1.2.1创建表相关语法
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name[(col_name data_type [COMMENT col_comment], ...)][COMMENT table_comment][PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)][CLUSTERED BY (col_name, col_name, ...)[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS][ROW FORMAT row_format][STORED AS file_format][LOCATION hdfs_path][TBLPROPERTIES (property_name=property_value, ...)][AS select_statement]#相关字段解释#EXTERNAL用户创建一个外部表#COMMENT:为表和列添加注释。#PARTITIONED BY创建分区表#CLUSTERED BY创建分桶表#SORTED BY不常用,对桶中的一个或多个列另外排序#STORED AS指定存储文件类型#LOCATION :指定表在HDFS上的存储位置。#AS:后跟查询语句,根据查询结果创建表。#LIKE允许用户复制现有的表结构,但是不复制数据。data_type: primitive_type| array_type| map_type| struct_type| union_type -- (Note: Available in Hive 0.7.0 and later)array_type: ARRAY < data_type >map_type: MAP < primitive_type, data_type >struct_type: STRUCT < col_name : data_type [COMMENT col_comment], ...>union_type: UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later)row_format: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char][MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char][NULL DEFINED AS char]| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
1.3管理表和外部表区别(面试)
- 默认创建的表都是所谓的管理表(内部表)当我们删除一个管理表时,Hive也会删除这个表中数据。而管理表就不会有这样的问题。因为表是外部表,所以Hive并非认为其完全拥有这份数据。删除该表并不会删除掉这份数据,不过描述表的元数据信息会被删除掉。
查看表的类型 desc formatted tableName;通过alter table student2 set tblproperties(‘EXTERNAL’=’TRUE|FALSE’); 可以将管理表和内部表互相转换。区分大小写!
1.4分区表
1.4.1概念
分区表实际上就是对应一个HDFS文件系统上的独立的文件夹,该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录,把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区,这样的查询效率会提高很多。
1.4.2创建分区的常见语法
dept_19.txt ```sql create table dept_partition( dname string, loc string ) partitioned by (month string) row format delimited fields terminated by ‘\t’;
load data local inpath ‘/opt/module/datas/dept.txt’ into table default.dept_partition partition(month=’201709’); select from dept_partition where month=’201709’; #单分区 select from dept_partition where month=’201709’ union select * from dept_partition where month=’201708’#多分区 alter table dept_partition add partition(month=’201706’) ; 增加分区 alter table dept_partition drop partition (month=’201704’); alter table dept_partition drop partition (month=’201705’), partition (month=’201706’); show partitions dept_partition;#查看分区表有多少分区 desc formatted dept_partition;#查看分区表结构
create table dept_partition2(deptno int, dname string, loc string ) partitioned by (month string, day string) row format delimited fields terminated by ‘\t’;#创建2级分区 load data local inpath ‘/opt/module/datas/dept.txt’ into table default.dept_partition2 partition(month=’201709’, day=’13’); select * from dept_partition2 where month=’201709’ and day=’13’;#查询分区数据
<a name="iRaXF"></a>### 1.4.3分区表和数据产生关联1. 上传数据后修复1. dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=12;1. dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=12;、1. [msck repair table dept_partition2;执行修复]()2. 上传数据后添加分区a. dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=11;<br />b. dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=11;<br />c. alter table dept_partition2 add partition(month='201709', day='11');#添加分区3. 创建文件夹后load数据到分区a. dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=10;<br />b. load data local inpath '/opt/module/datas/dept.txt' into table dept_partition2 partition(month='201709',day='10');<a name="HdJoG"></a>## 1.4 修改表1.更新列<br />ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]<br />2.增加和替换列<br />ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...) <br />注:ADD是代表新增一字段,字段位置在所有列后面(partition列前),REPLACE则是表示替换表中所有字段。<br />3.实操实例
desc dept_partition;#查询表结构 alter table dept_partition add columns(deptdesc string);#添加列 alter table dept_partition change column deptdesc desc int;#更新列 alter table dept_partition replace columns(deptno string, dname string, loc string);#替换列 drop table dept_partition;#删除表
<a name="oJd4R"></a>## 2.DML<a name="jmirA"></a>### 2.1 load数据导入```sqlload data [local] inpath '/opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,…)];#local:表示从本地加载数据到hive表;否则从HDFS加载数据到hive表#inpath:表示加载数据的路径#overwrite:表示覆盖表中已有数据,否则表示追加#into table:表示加载到哪张表#student:表示具体的表#partition:表示上传到指定分区
2.2通过查询语句向表中插入数据(Insert)
create table student(id int, name string) partitioned by (month string) row format delimited fields terminated by '\t';insert into table student partition(month='201709') values(1,'wangwu'),(2,’zhaoliu’);insert overwrite table student partition(month='201708')select id, name from student where month='201709';insert into:以追加数据的方式插入到表或分区,原有数据不会删除insert overwrite:会覆盖表或分区中已存在的数据#多表(多分区)插入模式(根据多张表查询结果)from studentinsert overwrite table student partition(month='201707')select id, name where month='201709'insert overwrite table student partition(month='201706')select id, name where month='201709';#根据结果创建表create table if not exists student3as select id, name from student;
2.3查询语句中创建表并加载数据(As Select)
create table if not exists student3 as select id, name from student;
2.4 创建表时通过Location指定加载数据路径
create external table if not exists student5( id int, name string )row format delimited fields terminated by '\t'location '/student';
2.5 通过Import数据到指定Hive表中
import table student2 partition(month='201709') from '/user/hive/warehouse/export/student';
2.6导出
#1.加local 导入linux本地 不加local导出到hdfsinsert overwrite local directory '/data/hive/export/student[ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ]#格式化select * from student;#2通过Hadoop命令导出到本地dfs -get /user/hive/warehouse/student/month=201709/000000_0 /opt/module/export/student3.txt;#3.hive -f/-e 执行语句或者脚本 > filebin/hive -e 'selct * from default.student;' /opt/module/datas/export/student4.txt;#4Export导出到HDFS上export table default.student to '/user/hive/warehouse/export/student';
2.7Like和RLike
- 使用LIKE运算选择类似的值
- 选择条件可以包含字符或数字:
- % 代表零个或多个字符(任意个字符)。
- _ 代表一个字符。
- RLIKE子句是Hive中这个功能的一个扩展,其可以通过Java的正则表达式这个更强大的语言来指定匹配条件。
2.8连接
dept.txtemp.txtlocation.txt ```sql select emp.*,dept.dname from(select deptno,avg(sal) from emp group by deptno ) emp left join dept on dept.deptno=emp.deptno;
多表连接
SELECT e.ename, d.dname, l.loc_name FROM emp e JOIN dept d ON d.deptno = e.deptno JOIN location l ON d.loc = l.loc;
<a name="dzMlD"></a>### 2.9排序1. Order By:全局排序,只有一个Reducer1. Sort By(局部):对于大规模的数据集order by的效率非常低。在很多情况下,并不需要全局排序,此时可以使用**sort by**。1. Distribute By: 在有些情况下,我们需要控制某个特定行应该到哪个reducer,通常是为了进行后续的聚集操作。[**distribute **]()[**by**]()** **子句可以做这件事。**distribute by**类似MR中partition(自定义分区),进行分区,结合sort by使用<br />注意:- 1. distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后,余数相同的分到一个区。- 2. Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。4. 当distribute by和sorts by字段相同时,可以使用cluster by方式。<a name="BGvPI"></a>### 2.10分桶[student.txt](https://www.yuque.com/attachments/yuque/0/2020/txt/668614/1598777642320-8e4a2c3b-6798-4b9d-baf5-07ebf590d3cb.txt?_lake_card=%7B%22uid%22%3A%221598777642583-0%22%2C%22src%22%3A%22https%3A%2F%2Fwww.yuque.com%2Fattachments%2Fyuque%2F0%2F2020%2Ftxt%2F668614%2F1598777642320-8e4a2c3b-6798-4b9d-baf5-07ebf590d3cb.txt%22%2C%22name%22%3A%22student.txt%22%2C%22size%22%3A165%2C%22type%22%3A%22text%2Fplain%22%2C%22ext%22%3A%22txt%22%2C%22progress%22%3A%7B%22percent%22%3A99%7D%2C%22status%22%3A%22done%22%2C%22percent%22%3A0%2C%22id%22%3A%22ecwic%22%2C%22card%22%3A%22file%22%7D)```sql#1.创建分区表create table stu_buck(id int, name string)clustered by(id) into 4 bucketsrow format delimited fields terminated by '\t';#查看设置是否成功 Num Buckets:4desc formatted stu_buck#执行下面命令后仍然不能实现分桶load data local inpath '/opt/module/datas/student.txt' into table stu_buck;set hive.enforce.bucketing=true;set mapreduce.job.reduces=-1;#要成功分桶必须要通过MR灌数据方式insert into table stu_buck select id, name from stu;#查询表stu_buck中的数据。select * from stu_buck tablesample(bucket 1 out of 4 on id);#TABLESAMPLE(BUCKET x OUT OF y)#y必须是table总bucket数的倍数或者因子。hive根据y的大小,决定抽样的比例。例如,table总共分了4份,当y=2时,#抽取(4/2=)2个bucket的数据,当y=8时,抽取(4/8=)1/2个bucket的数据。# x表示从哪个bucket开始抽取,如果需要取多个分区,以后的分区号为当前分区号加上y。例如,table总bucket数为4,#tablesample(bucket 1 out of 2),表示总共抽取(4/2=)2个bucket的数据,抽取第1(x)个和第3(x+y)个bucket的#数据。注意:x的值必须小于等于y的值,否则报错
2.11窗口函数
2.11.1函数说明
- OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化。
- CURRENT ROW:当前行
- n PRECEDING:往前n行数据
- n FOLLOWING:往后n行数据
- UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
- LAG(col,n,default_val):往前第n行数据
- LEAD(col,n, default_val):往后第n行数据
- NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。注意:n必须为int类型。
```
create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
1查询在2017年4月份购买过的顾客及总人数
select name,count(*) over() from business where substring(orderdate,1,7)=’2017-04’ group by name;2查询顾客的购买明细及月购买总额
select *,sum (cost) over(partition by month(orderdate)) from business ;3上述的场景, 将每个顾客的cost按照日期进行累加
select *,sum(cost) over(partition by name order by orderdate rows between unbounded preceding and current row ) from business ; #示例:
当前行和前面一行做聚合
over(partition by name order by orderdate rows between 1 PRECEDING and current row)
当前行和前边一行及后面一行
over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING )
当前行及后面所有行
over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING )
查看顾客上次的购买时间
select name,orderdate,cost, lag(orderdate,1,’1900-01-01’) over(partition by name order by #orderdate ) as time1 from business;
5.查询前20%时间的订单信息
select * from ( select name,orderdate,cost, ntile(5) over(order by orderdate)
sorted from business) t where sorted = 1;
<a name="c3hYg"></a>### 2.12其他<a name="m46Xg"></a>### 函数1. 空字段赋值:select id,nvl(name,"默认") from stu;1. CASE WHENselect .. sum(case sex when '女' then 1 else 0 end) female_count from tableName3. 行转列- CONCAT(string A/col, string B/col…):返回输入字符串连接后的结果,支持任意个输入字符串;- CONCAT_WS(separator, str1, str2,...):它是一个特殊形式的 CONCAT()。第一个参数剩余参数间的分隔符,过滤空和null- COLLECT_SET(col):函数只接受基本数据类型,它的主要作用是将某字段的值进行去重汇总,产生array类型字段。4. 列转行EXPLODE(col):将hive一列中复杂的array或者map结构拆分成多行<br />[LATERAL VIEW]()用法:[LATERAL VIEW]()udtf(expression) tableAlias AS columnAlias<br />解释:用于和split, explode等UDTF一起使用,它能够将一列数据拆成多行数据,在此基础上可以对拆分后的数据进行聚合。<br />示例:```sql# 行转列select concat_ws('|', collect_set(res.name)) name, res.base from(select person_info.name, concat(person_info.blood_type, person_info.constellation) basefrom person_info person_info) resgroup by res.base#列转行select movie, category_name from movie_info lateral view explode(category) table_tmp as category_name;
2.13Rank
2.13.1.函数说明
- RANK() 排序相同时会重复,总数不会变
- DENSE_RANK() 排序相同时会重复,总数会减少
- ROW_NUMBER() 会根据顺序计算
create table score(name string,subject string,score int)row format delimited fields terminated by "\t";load data local inpath '/opt/module/datas/score.txt' into table score;#查询排名select name,subject,score,rank() over(partition by subject order by score desc) rp,dense_rank() over(partition by subject order by score desc) drp,row_number() over(partition by subject order by score desc) rmpfrom score;
参考文档:
DDL:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
