判断
- if(3 is null,true,false)
- isnull(1)
- nvl(null, default_value)
- coalesce(v1, v2, …) 返回第一个不为 null
时间函数
- date_format(‘2021-05-12’,’yyyy/MM/dd’) 格式化时间
- from_unixtime(1620807786,’yyyy-MM-dd HH:mm:ss’) 时间戳转时间
- unix_timestamp(‘2021-05-12 16:23:06’,’yyyy-MM-dd HH:mm:ss’) — 时间转间戳
- select from_unixtime(unix_timestamp(‘20210512’,’yyyyMMdd’),’yyyy-MM-dd’)
- select date_format(‘2021-05-12’,’yyyyMMdd’)
- select next_day(date_sub(‘2021-05-12’,7),’SUN’) 上周星期天
字符串操作
- CASE WHEN sex = ‘1’ THEN ‘男’ WHEN sex = ‘0’ THEN ‘女’ ELSE ‘未知’ END
- CASE sex WHEN ‘1’ THEN ‘男’ WHEN ‘0’ THEN ‘女’ ELSE ‘未知’ END
- split(Column str,String pattern) 切分
- substring(Column str,int pos,int len)
- concat(Column… exprs) 拼接字符串
- translate(Column src,String matchingString,String replaceString)
- regexp_extract(Column e,String exp,int groupIdx) 抽取
- regexp_replace(Column e,String pattern,String replacement) 替换
- get_json_object(Column e,’$.owner’)
- reverse(‘abc’) 反转字符串
数组
- round 四舍五入
- floor 向下取整数
- ceil 向上取整
行转列 EXPLODE
SELECT moviename, typeFROM t_movieLATERAL VIEW EXPLODE(SPLIT(t_movie.movietype, "|")) ex AS type-- POSEXPLODE 获取到下标 idSELECT moviename, id,typeFROM t_movieLATERAL VIEW POSEXPLODE(SPLIT(t_movie.movietype, "|")) ex AS id,type
列转行 group by
select max(sno), name
, concat_ws(',', collect_set(DEPART)) as DEPART
from students_info
group by name
窗口函数 OVER
分析函数
- sum, avg, max, min 例如:
SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) - row_number()按照值排序时产生一个自增编号,不会重复(如:1、2、3、4、5、6)
- rank() 按照值排序时产生一个自增编号,值相等时会重复,会产生空位(如:1、2、3、3、3、6)
- dense_rank() 按照值排序时产生一个自增编号,值相等时会重复,不会产生空位(如:1、2、3、3、3、4)
- lag(列名,往前的行数,[行数为null时的默认值,不指定为null]),可以计算用户上次购买时间,或者用户下次购买时间。
- lead(列名,往后的行数,[行数为null时的默认值,不指定为null])
- ntile(分区个数) 把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,ntile返回此行所属的组的编号
- CUME_DIST cume_dist() = rank/count 小于等于当前值的行数/分组内总行数,结果是一个比率
- percent_rank() = rank-1/count-1 分组内当前行的RANK值-1/分组内总行数-1
- first_value, last_value
- grouping sets 在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行 UNION ALL
- cube:和 grouping set 一样,不过是根据GROUP BY的维度的所有组合进行聚合
- rollup:是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合
http://lxw1234.com/archives/2015/04/193.htm#/
-- 主要区别在于遇见重复值
SELECT cookieid
, createtime
, pv
, RANK() OVER (PARTITION BY cookieid ORDER BY pv DESC) AS rn1
, DENSE_RANK() OVER (PARTITION BY cookieid ORDER BY pv DESC) AS rn2
, ROW_NUMBER() OVER (PARTITION BY cookieid ORDER BY pv DESC) AS rn3
, PERCENT_RANK() OVER (PARTITION BY dept ORDER BY sal) AS rn2
FROM cookie2;
-- pv 数最多的前 1/3 的天,分层三组,取 rn = 1 的
SELECT cookieid
, createtime
, pv
, NTILE(3) OVER (PARTITION BY cookieid ORDER BY pv DESC) AS rn
FROM cookie2;
-- 统计小于等于当前薪水的人数,所占总人数的比例
SELECT dept
, userid
, sal
, ROUND(CUME_DIST() OVER (ORDER BY sal), 2) AS rn1
, ROUND(CUME_DIST() OVER (PARTITION BY dept ORDER BY sal), 2) AS rn2
FROM cookie3;
SELECT cookieid
, createtime
, url
, ROW_NUMBER() OVER (PARTITION BY cookieid ORDER BY createtime) AS rn
, LEAD(createtime, 1, '1970-01-01 00:00:00') OVER (PARTITION BY cookieid ORDER BY createtime) AS next_1_time
, LEAD(createtime, 2) OVER (PARTITION BY cookieid ORDER BY createtime) AS next_2_time
, LAG(createtime, 1, '1970-01-01 00:00:00') OVER (PARTITION BY cookieid ORDER BY createtime) AS last_1_time
, LAG(createtime, 2) OVER (PARTITION BY cookieid ORDER BY createtime) AS last_2_time
, FIRST_VALUE(url) OVER (PARTITION BY cookieid ORDER BY createtime) AS first1 -- 分组的第一个值
, LAST_VALUE(url) OVER (PARTITION BY cookieid ORDER BY createtime) AS last1 -- 分组的最后一个值
FROM cookie4;
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID
FROM cookie5
GROUP BY month, day GROUPING SETS (month, day)
ORDER BY GROUPING__ID;
-- GROUPING__ID,表示结果属于哪一个分组集合
-- 等价于
SELECT month, NULL, COUNT(DISTINCT cookieid) AS uv, 1 AS GROUPING__ID
FROM cookie5
GROUP BY month
UNION ALL
SELECT NULL, day, COUNT(DISTINCT cookieid) AS uv, 2 AS GROUPING__ID
FROM cookie5
GROUP BY day;
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID
FROM cookie5
GROUP BY month, day
WITH CUBE
ORDER BY GROUPING__ID;
-- CUBE 的子集可以实现这样的上钻过程:月天的UV->月的UV->总UV
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID
FROM cookie5
GROUP BY month, day
WITH ROLLUP
ORDER BY GROUPING__ID;
窗口函数
OVER(// 窗口函数
PARTITION BY AAA //分组字段
ORDER BY BBB //排序字段
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) //WINDOW子句,指定窗口范围
窗口范围
- PRECEDING:往前
- FOLLOWING:往后
- CURRENT ROW:当前行
- UNBOUNDED:起点(一般结合PRECEDING,FOLLOWING使用)
- UNBOUNDED PRECEDING 表示该窗口最前面的行(起点)
- UNBOUNDED FOLLOWING:表示该窗口最后面的行(终点)
比如说:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(表示从起点到当前行)
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING(表示往前2行到往后1行)
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW(表示往前2行到当前行)
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING(表示当前行到终点
数据类型
基础数据类型:
TINYINT,SMALLINT,INT,BIGINT,BOOLEAN,FLOAT,DOUBLE,STRING,BINARY,TIMESTAMP,DECIMAL,CHAR,VARCHAR,DATE。
复杂数据类型:
包括ARRAY,MAP,STRUCT,UNION
复杂类型的导入
- ROW FORMAT DELIMITED是指明后面的关键词是列和元素分隔符的
- FIELDS TERMINATED BY 是字段分隔符
- COLLECTION ITEMS TERMINATED BY是元素分隔符(Array中的各元素、Struct中的各元素、Map中 的key-value对之间)
- MAP KEYS TERMINATED BY是Map中key与value的分隔符
- LINES TERMINATED BY是行之间的分隔符
- STORED AS TEXTFILE指数据文件上传之后保存的格式
CREATE TABLE student
(
name STRING,
favors ARRAY<STRING>,
scores MAP<STRING, FLOAT>,
address STRUCT<province:STRING, city:STRING, phone:STRING, zip:INT>
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ';'
MAP KEYS TERMINATED BY ':';
-- 导入 map xuzheng-yuwen:70,shuxue:65,yingyu:81
CREATE TABLE table_map
(
name string,
scores map<string, int>
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
-- 导入struct 1-english,80,3.3
CREATE TABLE table_struct
(
id int,
course struct<name:string, score:int, jidian:float>
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-'
COLLECTION ITEMS TERMINATED BY ',';
-- 导入array huangbo-beijing,shanghai,tianjin,hangzhou
CREATE TABLE table_array
(
name string,
work_locations array<string>
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-' COLLECTION ITEMS TERMINATED BY ',';
