- 1.DDL
- 1.2表相关操作
- 1.3管理表和外部表区别(面试)
- 1.4分区表
- 多表连接
- 1查询在2017年4月份购买过的顾客及总人数
- 2查询顾客的购买明细及月购买总额
- 3上述的场景, 将每个顾客的cost按照日期进行累加
- 示例:
- 当前行和前面一行做聚合
- over(partition by name order by orderdate rows between 1 PRECEDING and current row)
- 当前行和前边一行及后面一行
- over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING )
- 当前行及后面所有行
- over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING )
- 查看顾客上次的购买时间
- select name,orderdate,cost, lag(orderdate,1,’1900-01-01’) over(partition by name order by #orderdate ) as time1 from business;
- 5.查询前20%时间的订单信息
- select * from ( select name,orderdate,cost, ntile(5) over(order by orderdate)
- sorted from business) t where sorted = 1;
1.DDL
1.1库相关操作:
1.1.1创建数据库语法
CREATE DATABASE [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
常见命令
create database if not exists db_hive;
show databases like 'db_hive*';
desc database db_hive;
desc database extended db_hive2;显示数据库详细信息,extended
alter database db_hive set dbproperties('createtime'='20170830');
drop database db_hive2;
drop database db_hive cascade;不为空强制删除
ALTER TABLE table_name RENAME TO new_table_name
1.2表相关操作
1.2.1创建表相关语法
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement]
#相关字段解释
#EXTERNAL用户创建一个外部表
#COMMENT:为表和列添加注释。
#PARTITIONED BY创建分区表
#CLUSTERED BY创建分桶表
#SORTED BY不常用,对桶中的一个或多个列另外排序
#STORED AS指定存储文件类型
#LOCATION :指定表在HDFS上的存储位置。
#AS:后跟查询语句,根据查询结果创建表。
#LIKE允许用户复制现有的表结构,但是不复制数据。
data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type -- (Note: Available in Hive 0.7.0 and later)
array_type
: ARRAY < data_type >
map_type
: MAP < primitive_type, data_type >
struct_type
: STRUCT < col_name : data_type [COMMENT col_comment], ...>
union_type
: UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later)
row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char]
| SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
1.3管理表和外部表区别(面试)
- 默认创建的表都是所谓的管理表(内部表)当我们删除一个管理表时,Hive也会删除这个表中数据。而管理表就不会有这样的问题。因为表是外部表,所以Hive并非认为其完全拥有这份数据。删除该表并不会删除掉这份数据,不过描述表的元数据信息会被删除掉。
查看表的类型 desc formatted tableName;通过alter table student2 set tblproperties(‘EXTERNAL’=’TRUE|FALSE’); 可以将管理表和内部表互相转换。区分大小写!
1.4分区表
1.4.1概念
分区表实际上就是对应一个HDFS文件系统上的独立的文件夹,该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录,把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区,这样的查询效率会提高很多。
1.4.2创建分区的常见语法
dept_19.txt ```sql create table dept_partition( dname string, loc string ) partitioned by (month string) row format delimited fields terminated by ‘\t’;
load data local inpath ‘/opt/module/datas/dept.txt’ into table default.dept_partition partition(month=’201709’); select from dept_partition where month=’201709’; #单分区 select from dept_partition where month=’201709’ union select * from dept_partition where month=’201708’#多分区 alter table dept_partition add partition(month=’201706’) ; 增加分区 alter table dept_partition drop partition (month=’201704’); alter table dept_partition drop partition (month=’201705’), partition (month=’201706’); show partitions dept_partition;#查看分区表有多少分区 desc formatted dept_partition;#查看分区表结构
create table dept_partition2(deptno int, dname string, loc string ) partitioned by (month string, day string) row format delimited fields terminated by ‘\t’;#创建2级分区 load data local inpath ‘/opt/module/datas/dept.txt’ into table default.dept_partition2 partition(month=’201709’, day=’13’); select * from dept_partition2 where month=’201709’ and day=’13’;#查询分区数据
<a name="iRaXF"></a>
### 1.4.3分区表和数据产生关联
1. 上传数据后修复
1. dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=12;
1. dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=12;、
1. [msck repair table dept_partition2;执行修复]()
2. 上传数据后添加分区
a. dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=11;<br />b. dfs -put /opt/module/datas/dept.txt /user/hive/warehouse/dept_partition2/month=201709/day=11;<br />c. alter table dept_partition2 add partition(month='201709', day='11');#添加分区
3. 创建文件夹后load数据到分区
a. dfs -mkdir -p /user/hive/warehouse/dept_partition2/month=201709/day=10;<br />b. load data local inpath '/opt/module/datas/dept.txt' into table dept_partition2 partition(month='201709',day='10');
<a name="HdJoG"></a>
## 1.4 修改表
1.更新列<br />ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]<br />2.增加和替换列<br />ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...) <br />注:ADD是代表新增一字段,字段位置在所有列后面(partition列前),REPLACE则是表示替换表中所有字段。<br />3.实操实例
desc dept_partition;#查询表结构 alter table dept_partition add columns(deptdesc string);#添加列 alter table dept_partition change column deptdesc desc int;#更新列 alter table dept_partition replace columns(deptno string, dname string, loc string);#替换列 drop table dept_partition;#删除表
<a name="oJd4R"></a>
## 2.DML
<a name="jmirA"></a>
### 2.1 load数据导入
```sql
load data [local] inpath '/opt/module/datas/student.txt' [overwrite] into table student [partition (partcol1=val1,…)];
#local:表示从本地加载数据到hive表;否则从HDFS加载数据到hive表
#inpath:表示加载数据的路径
#overwrite:表示覆盖表中已有数据,否则表示追加
#into table:表示加载到哪张表
#student:表示具体的表
#partition:表示上传到指定分区
2.2通过查询语句向表中插入数据(Insert)
create table student(id int, name string) partitioned by (month string) row format delimited fields terminated by '\t';
insert into table student partition(month='201709') values(1,'wangwu'),(2,’zhaoliu’);
insert overwrite table student partition(month='201708')
select id, name from student where month='201709';
insert into:以追加数据的方式插入到表或分区,原有数据不会删除
insert overwrite:会覆盖表或分区中已存在的数据
#多表(多分区)插入模式(根据多张表查询结果)
from student
insert overwrite table student partition(month='201707')
select id, name where month='201709'
insert overwrite table student partition(month='201706')
select id, name where month='201709';
#根据结果创建表
create table if not exists student3
as select id, name from student;
2.3查询语句中创建表并加载数据(As Select)
create table if not exists student3 as select id, name from student;
2.4 创建表时通过Location指定加载数据路径
create external table if not exists student5( id int, name string )
row format delimited fields terminated by '\t'
location '/student';
2.5 通过Import数据到指定Hive表中
import table student2 partition(month='201709') from '/user/hive/warehouse/export/student';
2.6导出
#1.加local 导入linux本地 不加local导出到hdfs
insert overwrite local directory '/data/hive/export/student
[ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ]#格式化
select * from student;
#2通过Hadoop命令导出到本地
dfs -get /user/hive/warehouse/student/month=201709/000000_0 /opt/module/export/student3.txt;
#3.hive -f/-e 执行语句或者脚本 > file
bin/hive -e 'sel
ct * from default.student;' /opt/module/datas/export/student4.txt;
#4Export导出到HDFS上
export table default.student to '/user/hive/warehouse/export/student';
2.7Like和RLike
- 使用LIKE运算选择类似的值
- 选择条件可以包含字符或数字:
- % 代表零个或多个字符(任意个字符)。
- _ 代表一个字符。
- RLIKE子句是Hive中这个功能的一个扩展,其可以通过Java的正则表达式这个更强大的语言来指定匹配条件。
2.8连接
dept.txtemp.txtlocation.txt ```sql select emp.*,dept.dname from(select deptno,avg(sal) from emp group by deptno ) emp left join dept on dept.deptno=emp.deptno;
多表连接
SELECT e.ename, d.dname, l.loc_name FROM emp e JOIN dept d ON d.deptno = e.deptno JOIN location l ON d.loc = l.loc;
<a name="dzMlD"></a>
### 2.9排序
1. Order By:全局排序,只有一个Reducer
1. Sort By(局部):对于大规模的数据集order by的效率非常低。在很多情况下,并不需要全局排序,此时可以使用**sort by**。
1. Distribute By: 在有些情况下,我们需要控制某个特定行应该到哪个reducer,通常是为了进行后续的聚集操作。[**distribute **]()
[**by**]()** **子句可以做这件事。**distribute by**类似MR中partition(自定义分区),进行分区,结合sort by使用<br />注意:
- 1. distribute by的分区规则是根据分区字段的hash码与reduce的个数进行模除后,余数相同的分到一个区。
- 2. Hive要求DISTRIBUTE BY语句要写在SORT BY语句之前。
4. 当distribute by和sorts by字段相同时,可以使用cluster by方式。
<a name="BGvPI"></a>
### 2.10分桶
[student.txt](https://www.yuque.com/attachments/yuque/0/2020/txt/668614/1598777642320-8e4a2c3b-6798-4b9d-baf5-07ebf590d3cb.txt?_lake_card=%7B%22uid%22%3A%221598777642583-0%22%2C%22src%22%3A%22https%3A%2F%2Fwww.yuque.com%2Fattachments%2Fyuque%2F0%2F2020%2Ftxt%2F668614%2F1598777642320-8e4a2c3b-6798-4b9d-baf5-07ebf590d3cb.txt%22%2C%22name%22%3A%22student.txt%22%2C%22size%22%3A165%2C%22type%22%3A%22text%2Fplain%22%2C%22ext%22%3A%22txt%22%2C%22progress%22%3A%7B%22percent%22%3A99%7D%2C%22status%22%3A%22done%22%2C%22percent%22%3A0%2C%22id%22%3A%22ecwic%22%2C%22card%22%3A%22file%22%7D)
```sql
#1.创建分区表
create table stu_buck(id int, name string)
clustered by(id) into 4 buckets
row format delimited fields terminated by '\t';
#查看设置是否成功 Num Buckets:4
desc formatted stu_buck
#执行下面命令后仍然不能实现分桶
load data local inpath '/opt/module/datas/student.txt' into table stu_buck;
set hive.enforce.bucketing=true;
set mapreduce.job.reduces=-1;
#要成功分桶必须要通过MR灌数据方式
insert into table stu_buck select id, name from stu;
#查询表stu_buck中的数据。
select * from stu_buck tablesample(bucket 1 out of 4 on id);
#TABLESAMPLE(BUCKET x OUT OF y)
#y必须是table总bucket数的倍数或者因子。hive根据y的大小,决定抽样的比例。例如,table总共分了4份,当y=2时,#抽取(4/2=)2个bucket的数据,当y=8时,抽取(4/8=)1/2个bucket的数据。
# x表示从哪个bucket开始抽取,如果需要取多个分区,以后的分区号为当前分区号加上y。例如,table总bucket数为4,#tablesample(bucket 1 out of 2),表示总共抽取(4/2=)2个bucket的数据,抽取第1(x)个和第3(x+y)个bucket的#数据。注意:x的值必须小于等于y的值,否则报错
2.11窗口函数
2.11.1函数说明
- OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化。
- CURRENT ROW:当前行
- n PRECEDING:往前n行数据
- n FOLLOWING:往后n行数据
- UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING表示到后面的终点
- LAG(col,n,default_val):往前第n行数据
- LEAD(col,n, default_val):往后第n行数据
- NTILE(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。注意:n必须为int类型。
```
create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
1查询在2017年4月份购买过的顾客及总人数
select name,count(*) over() from business where substring(orderdate,1,7)=’2017-04’ group by name;2查询顾客的购买明细及月购买总额
select *,sum (cost) over(partition by month(orderdate)) from business ;3上述的场景, 将每个顾客的cost按照日期进行累加
select *,sum(cost) over(partition by name order by orderdate rows between unbounded preceding and current row ) from business ; #示例:
当前行和前面一行做聚合
over(partition by name order by orderdate rows between 1 PRECEDING and current row)
当前行和前边一行及后面一行
over(partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING )
当前行及后面所有行
over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING )
查看顾客上次的购买时间
select name,orderdate,cost, lag(orderdate,1,’1900-01-01’) over(partition by name order by #orderdate ) as time1 from business;
5.查询前20%时间的订单信息
select * from ( select name,orderdate,cost, ntile(5) over(order by orderdate)
sorted from business) t where sorted = 1;
<a name="c3hYg"></a>
### 2.12其他
<a name="m46Xg"></a>
### 函数
1. 空字段赋值:select id,nvl(name,"默认") from stu;
1. CASE WHEN
select .. sum(case sex when '女' then 1 else 0 end) female_count from tableName
3. 行转列
- CONCAT(string A/col, string B/col…):返回输入字符串连接后的结果,支持任意个输入字符串;
- CONCAT_WS(separator, str1, str2,...):它是一个特殊形式的 CONCAT()。第一个参数剩余参数间的分隔符,过滤空和null
- COLLECT_SET(col):函数只接受基本数据类型,它的主要作用是将某字段的值进行去重汇总,产生array类型字段。
4. 列转行
EXPLODE(col):将hive一列中复杂的array或者map结构拆分成多行<br />[LATERAL VIEW]()用法:[LATERAL VIEW]()udtf(expression) tableAlias AS columnAlias<br />解释:用于和split, explode等UDTF一起使用,它能够将一列数据拆成多行数据,在此基础上可以对拆分后的数据进行聚合。<br />示例:
```sql
# 行转列
select concat_ws('|', collect_set(res.name)) name, res.base from
(select person_info.name, concat(person_info.blood_type, person_info.constellation) base
from person_info person_info) res
group by res.base
#列转行
select movie, category_name from movie_info lateral view explode(category) table_tmp as category_name;
2.13Rank
2.13.1.函数说明
- RANK() 排序相同时会重复,总数不会变
- DENSE_RANK() 排序相同时会重复,总数会减少
- ROW_NUMBER() 会根据顺序计算
create table score(
name string,
subject string,
score int)
row format delimited fields terminated by "\t";
load data local inpath '/opt/module/datas/score.txt' into table score;
#查询排名
select name,
subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rmp
from score;
参考文档:
DDL:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL