判断

  • if(3 is null,true,false)
  • isnull(1)
  • nvl(null, default_value)
  • coalesce(v1, v2, …) 返回第一个不为 null

时间函数

  • date_format(‘2021-05-12’,’yyyy/MM/dd’) 格式化时间
  • from_unixtime(1620807786,’yyyy-MM-dd HH:mm:ss’) 时间戳转时间
  • unix_timestamp(‘2021-05-12 16:23:06’,’yyyy-MM-dd HH:mm:ss’) — 时间转间戳
  • select from_unixtime(unix_timestamp(‘20210512’,’yyyyMMdd’),’yyyy-MM-dd’)
  • select date_format(‘2021-05-12’,’yyyyMMdd’)
  • select next_day(date_sub(‘2021-05-12’,7),’SUN’) 上周星期天

字符串操作

  • CASE WHEN sex = ‘1’ THEN ‘男’ WHEN sex = ‘0’ THEN ‘女’ ELSE ‘未知’ END
  • CASE sex WHEN ‘1’ THEN ‘男’ WHEN ‘0’ THEN ‘女’ ELSE ‘未知’ END
  • split(Column str,String pattern) 切分
  • substring(Column str,int pos,int len)
  • concat(Column… exprs) 拼接字符串
  • translate(Column src,String matchingString,String replaceString)
  • regexp_extract(Column e,String exp,int groupIdx) 抽取
  • regexp_replace(Column e,String pattern,String replacement) 替换
  • get_json_object(Column e,’$.owner’)
  • reverse(‘abc’) 反转字符串

数组

  • round 四舍五入
  • floor 向下取整数
  • ceil 向上取整

行转列 EXPLODE

  1. SELECT moviename, type
  2. FROM t_movie
  3. LATERAL VIEW EXPLODE(SPLIT(t_movie.movietype, "|")) ex AS type
  4. -- POSEXPLODE 获取到下标 id
  5. SELECT moviename, id,type
  6. FROM t_movie
  7. LATERAL VIEW POSEXPLODE(SPLIT(t_movie.movietype, "|")) ex AS id,type

列转行 group by

select max(sno), name
   , concat_ws(',', collect_set(DEPART)) as DEPART 
from students_info
group by name

窗口函数 OVER

窗口函数
Cube, Grouping and Rollup

分析函数

  • sum, avg, max, min 例如:SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime)
  • row_number()按照值排序时产生一个自增编号,不会重复(如:1、2、3、4、5、6)
  • rank() 按照值排序时产生一个自增编号,值相等时会重复,会产生空位(如:1、2、3、3、3、6)
  • dense_rank() 按照值排序时产生一个自增编号,值相等时会重复,不会产生空位(如:1、2、3、3、3、4)
  • lag(列名,往前的行数,[行数为null时的默认值,不指定为null]),可以计算用户上次购买时间,或者用户下次购买时间。
  • lead(列名,往后的行数,[行数为null时的默认值,不指定为null])
  • ntile(分区个数) 把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,ntile返回此行所属的组的编号
  • CUME_DIST cume_dist() = rank/count 小于等于当前值的行数/分组内总行数,结果是一个比率
  • percent_rank() = rank-1/count-1 分组内当前行的RANK值-1/分组内总行数-1
  • first_value, last_value
  • grouping sets 在一个GROUP BY查询中,根据不同的维度组合进行聚合,等价于将不同维度的GROUP BY结果集进行 UNION ALL
  • cube:和 grouping set 一样,不过是根据GROUP BY的维度的所有组合进行聚合
  • rollup:是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合

http://lxw1234.com/archives/2015/04/193.htm#/

-- 主要区别在于遇见重复值
SELECT cookieid
     , createtime
     , pv
     , RANK() OVER (PARTITION BY cookieid ORDER BY pv DESC)       AS rn1
     , DENSE_RANK() OVER (PARTITION BY cookieid ORDER BY pv DESC) AS rn2
     , ROW_NUMBER() OVER (PARTITION BY cookieid ORDER BY pv DESC) AS rn3
     , PERCENT_RANK() OVER (PARTITION BY dept ORDER BY sal)       AS rn2
FROM cookie2;
-- pv 数最多的前 1/3 的天,分层三组,取 rn = 1 的
SELECT cookieid
     , createtime
     , pv
     , NTILE(3) OVER (PARTITION BY cookieid ORDER BY pv DESC) AS rn
FROM cookie2;
-- 统计小于等于当前薪水的人数,所占总人数的比例
SELECT dept
     , userid
     , sal
     , ROUND(CUME_DIST() OVER (ORDER BY sal), 2)                   AS rn1
     , ROUND(CUME_DIST() OVER (PARTITION BY dept ORDER BY sal), 2) AS rn2
FROM cookie3;
SELECT cookieid
     , createtime
     , url
     , ROW_NUMBER() OVER (PARTITION BY cookieid ORDER BY createtime)                               AS rn
     , LEAD(createtime, 1, '1970-01-01 00:00:00') OVER (PARTITION BY cookieid ORDER BY createtime) AS next_1_time
     , LEAD(createtime, 2) OVER (PARTITION BY cookieid ORDER BY createtime)                        AS next_2_time
     , LAG(createtime, 1, '1970-01-01 00:00:00') OVER (PARTITION BY cookieid ORDER BY createtime)  AS last_1_time
     , LAG(createtime, 2) OVER (PARTITION BY cookieid ORDER BY createtime)                         AS last_2_time
     , FIRST_VALUE(url) OVER (PARTITION BY cookieid ORDER BY createtime)                           AS first1 -- 分组的第一个值
     , LAST_VALUE(url) OVER (PARTITION BY cookieid ORDER BY createtime)                            AS last1  -- 分组的最后一个值
FROM cookie4;
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID
FROM cookie5
GROUP BY month, day GROUPING SETS (month, day)
ORDER BY GROUPING__ID;
-- GROUPING__ID,表示结果属于哪一个分组集合
-- 等价于
SELECT month, NULL, COUNT(DISTINCT cookieid) AS uv, 1 AS GROUPING__ID
FROM cookie5
GROUP BY month
UNION ALL
SELECT NULL, day, COUNT(DISTINCT cookieid) AS uv, 2 AS GROUPING__ID
FROM cookie5
GROUP BY day;
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID
FROM cookie5
GROUP BY month, day
WITH CUBE
ORDER BY GROUPING__ID;
-- CUBE 的子集可以实现这样的上钻过程:月天的UV->月的UV->总UV
SELECT month, day, COUNT(DISTINCT cookieid) AS uv, GROUPING__ID
FROM cookie5
GROUP BY month, day
WITH ROLLUP
ORDER BY GROUPING__ID;

窗口函数

OVER(// 窗口函数
    PARTITION BY AAA //分组字段
    ORDER BY BBB //排序字段
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) //WINDOW子句,指定窗口范围

窗口范围

  • PRECEDING:往前
  • FOLLOWING:往后
  • CURRENT ROW:当前行
  • UNBOUNDED:起点(一般结合PRECEDING,FOLLOWING使用)
  • UNBOUNDED PRECEDING 表示该窗口最前面的行(起点)
  • UNBOUNDED FOLLOWING:表示该窗口最后面的行(终点)

比如说:

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(表示从起点到当前行)
ROWS BETWEEN 2 PRECEDING AND 1 FOLLOWING(表示往前2行到往后1行)
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW(表示往前2行到当前行)
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING(表示当前行到终点

数据类型

基础数据类型:

TINYINT,SMALLINT,INT,BIGINT,BOOLEAN,FLOAT,DOUBLE,STRING,BINARY,TIMESTAMP,DECIMAL,CHAR,VARCHAR,DATE。

复杂数据类型:

包括ARRAY,MAP,STRUCT,UNION

复杂类型的导入

  • ROW FORMAT DELIMITED是指明后面的关键词是列和元素分隔符的
  • FIELDS TERMINATED BY 是字段分隔符
  • COLLECTION ITEMS TERMINATED BY是元素分隔符(Array中的各元素、Struct中的各元素、Map中 的key-value对之间)
  • MAP KEYS TERMINATED BY是Map中key与value的分隔符
  • LINES TERMINATED BY是行之间的分隔符
  • STORED AS TEXTFILE指数据文件上传之后保存的格式
CREATE TABLE student
(
    name    STRING,
    favors  ARRAY<STRING>,
    scores  MAP<STRING, FLOAT>,
    address STRUCT<province:STRING, city:STRING, phone:STRING, zip:INT>
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
    COLLECTION ITEMS TERMINATED BY ';' 
    MAP KEYS TERMINATED BY ':';

-- 导入 map xuzheng-yuwen:70,shuxue:65,yingyu:81
CREATE TABLE table_map
(
    name   string,
    scores map<string, int>
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-'
    COLLECTION ITEMS TERMINATED BY ','
    MAP KEYS TERMINATED BY ':';
-- 导入struct 1-english,80,3.3
CREATE TABLE table_struct
(
    id     int,
    course struct<name:string, score:int, jidian:float>
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-'
    COLLECTION ITEMS TERMINATED BY ',';
-- 导入array huangbo-beijing,shanghai,tianjin,hangzhou
CREATE TABLE table_array
(
    name           string,
    work_locations array<string>
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-' COLLECTION ITEMS TERMINATED BY ',';