工作中用到了几个hive开窗函数,便想把hive开窗函数系统梳理一遍。
开窗函数
普通的聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。因此,普通的聚合函数每组(Group by)只返回一个值,而开窗函数则可为窗口中的每行都返回一个值。简单理解,就是对查询的结果多出一列,这一列可以是聚合值,也可以是排序值。开窗函数一般分为两类,聚合开窗函数和排序开窗函数。
测试数据
-- 建表create table student_scores(id int,studentId int,language int,math int,english int,classId string,departmentId string);-- 写入数据insert into table student_scores values(1,111,68,69,90,'class1','department1'),(2,112,73,80,96,'class1','department1'),(3,113,90,74,75,'class1','department1'),(4,114,89,94,93,'class1','department1'),(5,115,99,93,89,'class1','department1'),(6,121,96,74,79,'class2','department1'),(7,122,89,86,85,'class2','department1'),(8,123,70,78,61,'class2','department1'),(9,124,76,70,76,'class2','department1'),(10,211,89,93,60,'class1','department2'),(11,212,76,83,75,'class1','department2'),(12,213,71,94,90,'class1','department2'),(13,214,94,94,66,'class1','department2'),(14,215,84,82,73,'class1','department2'),(15,216,85,74,93,'class1','department2'),(16,221,77,99,61,'class2','department2'),(17,222,80,78,96,'class2','department2'),(18,223,79,74,96,'class2','department2'),(19,224,75,80,78,'class2','department2'),(20,225,82,85,63,'class2','department2');
聚合开窗函数
count开窗函数
-- count 开窗函数select studentId,math,departmentId,classId,-- 以符合条件的所有行作为窗口count(math) over() as count1,-- 以按classId分组的所有行作为窗口count(math) over(partition by classId) as count2,-- 以按classId分组、按math排序的所有行作为窗口count(math) over(partition by classId order by math) as count3,-- 以按classId分组、按math排序、按 当前行+往前1行+往后2行的行作为窗口count(math) over(partition by classId order by math rows between 1 preceding and 2 following) as count4from student_scores where departmentId='department1';结果studentid math departmentid classid count1 count2 count3 count4111 69 department1 class1 9 5 1 3113 74 department1 class1 9 5 2 4112 80 department1 class1 9 5 3 4115 93 department1 class1 9 5 4 3114 94 department1 class1 9 5 5 2124 70 department1 class2 9 4 1 3121 74 department1 class2 9 4 2 4123 78 department1 class2 9 4 3 3122 86 department1 class2 9 4 4 2结果解释:studentid=115,count1为所有的行数9,count2为分区class1中的行数5,count3为分区class1中math值<=93的行数4,count4为分区class1中math值向前+1行向后+2行(实际只有1行)的总行数3。
sum开窗函数
-- sum开窗函数select studentId,math,departmentId,classId,-- 以符合条件的所有行作为窗口sum(math) over() as sum1,-- 以按classId分组的所有行作为窗口sum(math) over(partition by classId) as sum2,-- 以按classId分组、按math排序后、按到当前行(含当前行)的所有行作为窗口sum(math) over(partition by classId order by math) as sum3,-- 以按classId分组、按math排序后、按当前行+往前1行+往后2行的行作为窗口sum(math) over(partition by classId order by math rows between 1 preceding and 2 following) as sum4from student_scores where departmentId='department1';结果studentid math departmentid classid sum1 sum2 sum3 sum4111 69 department1 class1 718 410 69 223113 74 department1 class1 718 410 143 316112 80 department1 class1 718 410 223 341115 93 department1 class1 718 410 316 267114 94 department1 class1 718 410 410 187124 70 department1 class2 718 308 70 222121 74 department1 class2 718 308 144 308123 78 department1 class2 718 308 222 238122 86 department1 class2 718 308 308 164结果解释:同count开窗函数
min开窗函数
-- min 开窗函数select studentId,math,departmentId,classId,-- 以符合条件的所有行作为窗口min(math) over() as min1,-- 以按classId分组的所有行作为窗口min(math) over(partition by classId) as min2,-- 以按classId分组、按math排序后、按到当前行(含当前行)的所有行作为窗口min(math) over(partition by classId order by math) as min3,-- 以按classId分组、按math排序后、按当前行+往前1行+往后2行的行作为窗口min(math) over(partition by classId order by math rows between 1 preceding and 2 following) as min4from student_scores where departmentId='department1';结果studentid math departmentid classid min1 min2 min3 min4111 69 department1 class1 69 69 69 69113 74 department1 class1 69 69 69 69112 80 department1 class1 69 69 69 74115 93 department1 class1 69 69 69 80114 94 department1 class1 69 69 69 93124 70 department1 class2 69 70 70 70121 74 department1 class2 69 70 70 70123 78 department1 class2 69 70 70 74122 86 department1 class2 69 70 70 78结果解释:同count开窗函数
max开窗函数
-- max 开窗函数select studentId,math,departmentId,classId,-- 以符合条件的所有行作为窗口max(math) over() as max1,-- 以按classId分组的所有行作为窗口max(math) over(partition by classId) as max2,-- 以按classId分组、按math排序后、按到当前行(含当前行)的所有行作为窗口max(math) over(partition by classId order by math) as max3,-- 以按classId分组、按math排序后、按当前行+往前1行+往后2行的行作为窗口max(math) over(partition by classId order by math rows between 1 preceding and 2 following) as max4from student_scores where departmentId='department1';结果studentid math departmentid classid max1 max2 max3 max4111 69 department1 class1 94 94 69 80113 74 department1 class1 94 94 74 93112 80 department1 class1 94 94 80 94115 93 department1 class1 94 94 93 94114 94 department1 class1 94 94 94 94124 70 department1 class2 94 86 70 78121 74 department1 class2 94 86 74 86123 78 department1 class2 94 86 78 86122 86 department1 class2 94 86 86 86结果解释:同count开窗函数
avg开窗函数
-- avg 开窗函数select studentId,math,departmentId,classId,-- 以符合条件的所有行作为窗口avg(math) over() as avg1,-- 以按classId分组的所有行作为窗口avg(math) over(partition by classId) as avg2,-- 以按classId分组、按math排序后、按到当前行(含当前行)的所有行作为窗口avg(math) over(partition by classId order by math) as avg3,-- 以按classId分组、按math排序后、按当前行+往前1行+往后2行的行作为窗口avg(math) over(partition by classId order by math rows between 1 preceding and 2 following) as avg4from student_scores where departmentId='department1';结果studentid math departmentid classid avg1 avg2 avg3 avg4111 69 department1 class1 79.77777777777777 82.0 69.0 74.33333333333333113 74 department1 class1 79.77777777777777 82.0 71.5 79.0112 80 department1 class1 79.77777777777777 82.0 74.33333333333333 85.25115 93 department1 class1 79.77777777777777 82.0 79.0 89.0114 94 department1 class1 79.77777777777777 82.0 82.0 93.5124 70 department1 class2 79.77777777777777 77.0 70.0 74.0121 74 department1 class2 79.77777777777777 77.0 72.0 77.0123 78 department1 class2 79.77777777777777 77.0 74.0 79.33333333333333122 86 department1 class2 79.77777777777777 77.0 77.0 82.0结果解释:同count开窗函数
first_value开窗函数
返回分区中的第一个值。
-- first_value 开窗函数select studentId,math,departmentId,classId,-- 以符合条件的所有行作为窗口first_value(math) over() as first_value1,-- 以按classId分组的所有行作为窗口first_value(math) over(partition by classId) as first_value2,-- 以按classId分组、按math排序后、按到当前行(含当前行)的所有行作为窗口first_value(math) over(partition by classId order by math) as first_value3,-- 以按classId分组、按math排序后、按当前行+往前1行+往后2行的行作为窗口first_value(math) over(partition by classId order by math rows between 1 preceding and 2 following) as first_value4from student_scores where departmentId='department1';结果studentid math departmentid classid first_value1 first_value2 first_value3 first_value4111 69 department1 class1 69 69 69 69113 74 department1 class1 69 69 69 69112 80 department1 class1 69 69 69 74115 93 department1 class1 69 69 69 80114 94 department1 class1 69 69 69 93124 70 department1 class2 69 74 70 70121 74 department1 class2 69 74 70 70123 78 department1 class2 69 74 70 74122 86 department1 class2 69 74 70 78结果解释:studentid=124 first_value1:第一个值是69,first_value2:classId=class1分区 math的第一个值是69。
last_value开窗函数
返回分区中的第一个值。
-- last_value 开窗函数select studentId,math,departmentId,classId,-- 以符合条件的所有行作为窗口last_value(math) over() as last_value1,-- 以按classId分组的所有行作为窗口last_value(math) over(partition by classId) as last_value2,-- 以按classId分组、按math排序后、按到当前行(含当前行)的所有行作为窗口last_value(math) over(partition by classId order by math) as last_value3,-- 以按classId分组、按math排序后、按当前行+往前1行+往后2行的行作为窗口last_value(math) over(partition by classId order by math rows between 1 preceding and 2 following) as last_value4from student_scores where departmentId='department1';结果studentid math departmentid classid last_value1 last_value2 last_value3 last_value4111 69 department1 class1 70 93 69 80113 74 department1 class1 70 93 74 93112 80 department1 class1 70 93 80 94115 93 department1 class1 70 93 93 94114 94 department1 class1 70 93 94 94124 70 department1 class2 70 70 70 78121 74 department1 class2 70 70 74 86123 78 department1 class2 70 70 78 86122 86 department1 class2 70 70 86 86
lag开窗函数
lag(col,n,default) 用于统计窗口内往上第n个值。
col:列名
n:往上第n行
default:往上第n行为NULL时候,取默认值,不指定则取NULL
-- lag 开窗函数select studentId,math,departmentId,classId,--窗口内 往上取第二个 取不到时赋默认值60lag(math,2,60) over(partition by classId order by math) as lag1,--窗口内 往上取第二个 取不到时赋默认值NULLlag(math,2) over(partition by classId order by math) as lag2from student_scores where departmentId='department1';结果studentid math departmentid classid lag1 lag2111 69 department1 class1 60 NULL113 74 department1 class1 60 NULL112 80 department1 class1 69 69115 93 department1 class1 74 74114 94 department1 class1 80 80124 70 department1 class2 60 NULL121 74 department1 class2 60 NULL123 78 department1 class2 70 70122 86 department1 class2 74 74结果解释:第3行 lag1:窗口内(69 74 80) 当前行80 向上取第二个值为69倒数第3行 lag2:窗口内(70 74) 当前行74 向上取第二个值为NULL
lead开窗函数
lead(col,n,default) 用于统计窗口内往下第n个值。
col:列名
n:往下第n行
default:往下第n行为NULL时候,取默认值,不指定则取NULL
-- lead开窗函数select studentId,math,departmentId,classId,--窗口内 往下取第二个 取不到时赋默认值60lead(math,2,60) over(partition by classId order by math) as lead1,--窗口内 往下取第二个 取不到时赋默认值NULLlead(math,2) over(partition by classId order by math) as lead2from student_scores where departmentId='department1';结果studentid math departmentid classid lead1 lead2111 69 department1 class1 80 80113 74 department1 class1 93 93112 80 department1 class1 94 94115 93 department1 class1 60 NULL114 94 department1 class1 60 NULL124 70 department1 class2 78 78121 74 department1 class2 86 86123 78 department1 class2 60 NULL122 86 department1 class2 60 NULL结果解释:第4行lead1 窗口内向下第二个值为空,赋值60
cume_dist开窗函数
计算某个窗口或分区中某个值的累积分布。假定升序排序,则使用以下公式确定累积分布:
小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中,x 等于 order by 子句中指定的列的当前行中的值。
-- cume_dist 开窗函数select studentId,math,departmentId,classId,-- 统计小于等于当前分数的人数占总人数的比例cume_dist() over(order by math) as cume_dist1,-- 统计大于等于当前分数的人数占总人数的比例cume_dist() over(order by math desc) as cume_dist2,-- 统计分区内小于等于当前分数的人数占总人数的比例cume_dist() over(partition by classId order by math) as cume_dist3from student_scores where departmentId='department1';结果studentid math departmentid classid cume_dist1 cume_dist2 cume_dist3111 69 department1 class1 0.1111111111111111 1.0 0.2113 74 department1 class1 0.4444444444444444 0.7777777777777778 0.4112 80 department1 class1 0.6666666666666666 0.4444444444444444 0.6115 93 department1 class1 0.8888888888888888 0.2222222222222222 0.8114 94 department1 class1 1.0 0.1111111111111111 1.0124 70 department1 class2 0.2222222222222222 0.8888888888888888 0.25121 74 department1 class2 0.4444444444444444 0.7777777777777778 0.5123 78 department1 class2 0.5555555555555556 0.5555555555555556 0.75122 86 department1 class2 0.7777777777777778 0.3333333333333333 1.0结果解释:第三行:cume_dist1=小于等于80的人数为6/总人数9=0.6666666666666666cume_dist2=大于等于80的人数为4/总人数9=0.4444444444444444cume_dist3=分区内小于等于80的人数为3/分区内总人数5=0.6
排序开窗函数
rank开窗函数
rank 开窗函数基于 over 子句中的 order by 确定一组值中一个值的排名。如果存在partition by ,则为每个分区组中的每个值排名。排名可能不是连续的。例如,如果两个行的排名为 1,则下一个排名为 3。
-- rank 开窗函数select *,-- 对全部学生按数学分数排序rank() over(order by math) as rank1,-- 对院系 按数学分数排序rank() over(partition by departmentId order by math) as rank2,-- 对每个院系每个班级 按数学分数排序rank() over(partition by departmentId,classId order by math) as rank3from student_scores;结果id studentid language math english classid departmentid rank1 rank2 rank31 111 68 69 90 class1 department1 1 1 13 113 90 74 75 class1 department1 3 3 22 112 73 80 96 class1 department1 9 6 35 115 99 93 89 class1 department1 15 8 44 114 89 94 93 class1 department1 17 9 59 124 76 70 76 class2 department1 2 2 16 121 96 74 79 class2 department1 3 3 28 123 70 78 61 class2 department1 7 5 37 122 89 86 85 class2 department1 14 7 415 216 85 74 93 class1 department2 3 1 114 215 84 82 73 class1 department2 11 5 211 212 76 83 75 class1 department2 12 6 310 211 89 93 60 class1 department2 15 8 412 213 71 94 90 class1 department2 17 9 513 214 94 94 66 class1 department2 17 9 518 223 79 74 96 class2 department2 3 1 117 222 80 78 96 class2 department2 7 3 219 224 75 80 78 class2 department2 9 4 320 225 82 85 63 class2 department2 13 7 416 221 77 99 61 class2 department2 20 11 5
dense_rank开窗函数
dense_rank与rank有一点不同,当排名一样的时候,接下来的行是连续的。如两个行的排名为 1,则下一个排名为 2。
-- dense_rank 开窗函数select *,-- 对全部学生按数学分数排序dense_rank() over(order by math) as dense_rank1,-- 对院系 按数学分数排序dense_rank() over(partition by departmentId order by math) as dense_rank2,-- 对每个院系每个班级 按数学分数排序dense_rank() over(partition by departmentId,classId order by math) as dense_rank3from student_scores;结果:id studentid language math english classid departmentid dense_rank1 dense_rank2 dense_rank31 111 68 69 90 class1 department1 1 1 13 113 90 74 75 class1 department1 3 3 22 112 73 80 96 class1 department1 5 5 35 115 99 93 89 class1 department1 10 7 44 114 89 94 93 class1 department1 11 8 59 124 76 70 76 class2 department1 2 2 16 121 96 74 79 class2 department1 3 3 28 123 70 78 61 class2 department1 4 4 37 122 89 86 85 class2 department1 9 6 415 216 85 74 93 class1 department2 3 1 114 215 84 82 73 class1 department2 6 4 211 212 76 83 75 class1 department2 7 5 310 211 89 93 60 class1 department2 10 7 412 213 71 94 90 class1 department2 11 8 513 214 94 94 66 class1 department2 11 8 518 223 79 74 96 class2 department2 3 1 117 222 80 78 96 class2 department2 4 2 219 224 75 80 78 class2 department2 5 3 320 225 82 85 63 class2 department2 8 6 416 221 77 99 61 class2 department2 12 9 5
ntile开窗函数
将分区中已排序的行划分为大小尽可能相等的指定数量的排名的组,并返回给定行所在的组的排名。
-- ntile 开窗函数select *,-- 对分区内的数据分成两组ntile(2) over(partition by departmentid order by math) as ntile1,-- 对分区内的数据分成三组ntile(3) over(partition by departmentid order by math) as ntile2from student_scores;结果id studentid language math english classid departmentid ntile1 ntile21 111 68 69 90 class1 department1 1 19 124 76 70 76 class2 department1 1 16 121 96 74 79 class2 department1 1 13 113 90 74 75 class1 department1 1 28 123 70 78 61 class2 department1 1 22 112 73 80 96 class1 department1 2 27 122 89 86 85 class2 department1 2 35 115 99 93 89 class1 department1 2 34 114 89 94 93 class1 department1 2 318 223 79 74 96 class2 department2 1 115 216 85 74 93 class1 department2 1 117 222 80 78 96 class2 department2 1 119 224 75 80 78 class2 department2 1 114 215 84 82 73 class1 department2 1 211 212 76 83 75 class1 department2 1 220 225 82 85 63 class2 department2 2 210 211 89 93 60 class1 department2 2 212 213 71 94 90 class1 department2 2 313 214 94 94 66 class1 department2 2 316 221 77 99 61 class2 department2 2 3结果解释:第8行ntile1:对分区的数据均匀分成2组后,当前行的组排名为2ntile2:对分区的数据均匀分成3组后,当前行的组排名为3
row_number开窗函数
从1开始对分区内的数据排序。
-- row_number 开窗函数select studentid,departmentid,classid,math,-- 对分区departmentid,classid内的数据按math排序row_number() over(partition by departmentid,classid order by math) as row_numberfrom student_scores;结果studentid departmentid classid math row_number111 department1 class1 69 1113 department1 class1 74 2112 department1 class1 80 3115 department1 class1 93 4114 department1 class1 94 5124 department1 class2 70 1121 department1 class2 74 2123 department1 class2 78 3122 department1 class2 86 4216 department2 class1 74 1215 department2 class1 82 2212 department2 class1 83 3211 department2 class1 93 4213 department2 class1 94 5214 department2 class1 94 6223 department2 class2 74 1222 department2 class2 78 2224 department2 class2 80 3225 department2 class2 85 4221 department2 class2 99 5结果解释:同一分区,相同值,不同序。如studentid=213 studentid=214 值都为94 排序为5,6。
percent_rank开窗函数
计算给定行的百分比排名。可以用来计算超过了百分之多少的人。如360小助手开机速度超过了百分之多少的人。
(当前行的rank值-1)/(分组内的总行数-1)
-- percent_rank 开窗函数select studentid,departmentid,classid,math,row_number() over(partition by departmentid,classid order by math) as row_number,percent_rank() over(partition by departmentid,classid order by math) as percent_rankfrom student_scores;结果studentid departmentid classid math row_number percent_rank111 department1 class1 69 1 0.0113 department1 class1 74 2 0.25112 department1 class1 80 3 0.5115 department1 class1 93 4 0.75114 department1 class1 94 5 1.0124 department1 class2 70 1 0.0121 department1 class2 74 2 0.3333333333333333123 department1 class2 78 3 0.6666666666666666122 department1 class2 86 4 1.0216 department2 class1 74 1 0.0215 department2 class1 82 2 0.2212 department2 class1 83 3 0.4211 department2 class1 93 4 0.6213 department2 class1 94 5 0.8214 department2 class1 94 6 0.8223 department2 class2 74 1 0.0222 department2 class2 78 2 0.25224 department2 class2 80 3 0.5225 department2 class2 85 4 0.75221 department2 class2 99 5 1.0结果解释:studentid=115,percent_rank=(4-1)/(5-1)=0.75studentid=123,percent_rank=(3-1)/(4-1)=0.6666666666666666
