在SQL处理中,窗口函数都是最后一步执行,而且仅位于Order by字句之前
**
初始数据
我们准备一张order表,字段分别为name,orderdate,cost.数据内容如下:
jack,2015-01-01,10
tony,2015-01-02,15
jack,2015-02-03,23
tony,2015-01-04,29
jack,2015-01-05,46
jack,2015-04-06,42
tony,2015-01-07,50
jack,2015-01-08,55
mart,2015-04-08,62
mart,2015-04-09,68
neil,2015-05-10,12
mart,2015-04-11,75
neil,2015-06-12,80
mart,2015-04-13,94
建表语句
create table if not exists t_window(
name string,
orderdate string,
cost int
)comment '窗口表'
row format delimited fields terminated by ','
stored as textfile;
加载数据
load data local inpath '/home/admin/data/over.txt' into table t_window;
案例演示
1
select name,orderdate,cost,
sum(cost) over(partition by month(orderdate))
from t_window
jack 2015-01-01 10 205
jack 2015-01-08 55 205
tony 2015-01-07 50 205
jack 2015-01-05 46 205
tony 2015-01-04 29 205
tony 2015-01-02 15 205
jack 2015-02-03 23 23
mart 2015-04-13 94 341
jack 2015-04-06 42 341
mart 2015-04-11 75 341
mart 2015-04-09 68 341
mart 2015-04-08 62 341
neil 2015-05-10 12 12
neil 2015-06-12 80 80
解释:按月分区,窗口大小等于分区大小。205=10+55+50+46+29+15 (即2015-01数据总和)
2
select name,orderdate,cost,
sum(cost) over() as sample1
from t_window;
mart 2015-04-13 94 661
neil 2015-06-12 80 661
mart 2015-04-11 75 661
neil 2015-05-10 12 661
mart 2015-04-09 68 661
mart 2015-04-08 62 661
jack 2015-01-08 55 661
tony 2015-01-07 50 661
jack 2015-04-06 42 661
jack 2015-01-05 46 661
tony 2015-01-04 29 661
jack 2015-02-03 23 661
tony 2015-01-02 15 661
jack 2015-01-01 10 661
解释: 窗口大小是整个数据集。661等于所有行的cost相加。
3
select name,orderdate,cost,
sum(cost) over(partition by name) as sample2
from t_window;
jack 2015-01-05 46 176
jack 2015-01-08 55 176
jack 2015-01-01 10 176
jack 2015-04-06 42 176
jack 2015-02-03 23 176
mart 2015-04-13 94 299
mart 2015-04-11 75 299
mart 2015-04-09 68 299
mart 2015-04-08 62 299
neil 2015-05-10 12 92
neil 2015-06-12 80 92
tony 2015-01-04 29 94
tony 2015-01-02 15 94
tony 2015-01-07 50 94
解释:原理同1. 只不过是按name分区。176=46+55+10+42+23
4
select name,orderdate,cost,
sum(cost) over(partition by name order by orderdate) as sample3
from t_window;
jack 2015-01-01 10 10
jack 2015-01-05 46 56
jack 2015-01-08 55 111
jack 2015-02-03 23 134
jack 2015-04-06 42 176
mart 2015-04-08 62 62
mart 2015-04-09 68 130
mart 2015-04-11 75 205
mart 2015-04-13 94 299
neil 2015-05-10 12 12
neil 2015-06-12 80 92
tony 2015-01-02 15 15
tony 2015-01-04 29 44
tony 2015-01-07 50 94
解释:窗口大小=分区头部到当前行。比如第三行的111来由是 (10+46+55)
5
window子句:
- PRECEDING:往前
- FOLLOWING:往后
- CURRENT ROW:当前行
- UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
select name,orderdate,cost,
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4
from t_window;
jack 2015-01-01 10 10
jack 2015-01-05 46 56
jack 2015-01-08 55 111
jack 2015-02-03 23 134
jack 2015-04-06 42 176
mart 2015-04-08 62 62
mart 2015-04-09 68 130
mart 2015-04-11 75 205
mart 2015-04-13 94 299
neil 2015-05-10 12 12
neil 2015-06-12 80 92
tony 2015-01-02 15 15
tony 2015-01-04 29 44
tony 2015-01-07 50 94
解释:窗口大小=分区头部到当前行。比如第三行的111来由是 (10+46+55)。与4是一样的。
6
select name,orderdate,cost,
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample5
from t_window;
jack 2015-01-01 10 10
jack 2015-01-05 46 56
jack 2015-01-08 55 101
jack 2015-02-03 23 78
jack 2015-04-06 42 65
mart 2015-04-08 62 62
mart 2015-04-09 68 130
mart 2015-04-11 75 143
mart 2015-04-13 94 169
neil 2015-05-10 12 12
neil 2015-06-12 80 92
tony 2015-01-02 15 15
tony 2015-01-04 29 44
tony 2015-01-07 50 79
解释:窗口大小=前一行到当前行。按name分区。第三行的101来源(46+55)。mark第一行为什么是62?因为它的并不存在前一行(它是一个单独的分区)
7
select name,orderdate,cost,
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7
from t_window;
jack 2015-01-01 10 176
jack 2015-01-05 46 166
jack 2015-01-08 55 120
jack 2015-02-03 23 65
jack 2015-04-06 42 42
mart 2015-04-08 62 299
mart 2015-04-09 68 237
mart 2015-04-11 75 169
mart 2015-04-13 94 94
neil 2015-05-10 12 92
neil 2015-06-12 80 80
tony 2015-01-02 15 94
tony 2015-01-04 29 79
tony 2015-01-07 50 50
解释:窗口大小=当前行到分区尾部。比如第4行 jack 2015-02-03 23 65
中的65=23+42
8 row_number
select name,orderdate,cost,
sum(cost) over(partition by name order by cost) as sum_cost,
row_number() over(partition by name order by cost) as rn
from t_window;
jack 2015-01-01 10 10 1
jack 2015-02-03 23 33 2
jack 2015-04-06 42 75 3
jack 2015-01-05 46 121 4
jack 2015-01-08 55 176 5
mart 2015-04-08 62 62 1
mart 2015-04-09 68 130 2
mart 2015-04-11 75 205 3
mart 2015-04-13 94 299 4
neil 2015-05-10 12 12 1
neil 2015-06-12 80 92 2
tony 2015-01-02 15 15 1
tony 2015-01-04 29 44 2
tony 2015-01-07 50 94 3
解释:row_number 表示在分区窗口中的行号
9 rank
为了测试rank,我们需要增加一条数据 jack,2015-02-04,23
此时源数据如下:
jack,2015-01-01,10
tony,2015-01-02,15
jack,2015-02-03,23
jack,2015-02-04,23
tony,2015-01-04,29
jack,2015-01-05,46
jack,2015-04-06,42
tony,2015-01-07,50
jack,2015-01-08,55
mart,2015-04-08,62
mart,2015-04-09,68
neil,2015-05-10,12
mart,2015-04-11,75
neil,2015-06-12,80
mart,2015-04-13,94
重新导入hive: load data local inpath '/home/admin/data/over1.txt' overwrite into table t_window;
select name,orderdate,cost,
rank() over(partition by name order by cost) as rank
from t_window;
jack 2015-01-01 10 1
jack 2015-02-03 23 2
jack 2015-02-04 23 2
jack 2015-04-06 42 4
jack 2015-01-05 46 5
jack 2015-01-08 55 6
mart 2015-04-08 62 1
mart 2015-04-09 68 2
mart 2015-04-11 75 3
mart 2015-04-13 94 4
neil 2015-05-10 12 1
neil 2015-06-12 80 2
tony 2015-01-02 15 1
tony 2015-01-04 29 2
tony 2015-01-07 50 3
解释:根据cost大小进行排序。注意jack 2015-02-03 和2015-02-04 cost都是23。rank值是相同的(2)。但是第三名却不见了。
10 DENSE_RANK
select name,orderdate,cost,
DENSE_RANK() over(partition by name order by cost) as drank
from t_window;
jack 2015-01-01 10 1
jack 2015-02-03 23 2
jack 2015-02-04 23 2
jack 2015-04-06 42 3
jack 2015-01-05 46 4
jack 2015-01-08 55 5
mart 2015-04-08 62 1
mart 2015-04-09 68 2
mart 2015-04-11 75 3
mart 2015-04-13 94 4
neil 2015-05-10 12 1
neil 2015-06-12 80 2
tony 2015-01-02 15 1
tony 2015-01-04 29 2
tony 2015-01-07 50 3
解释:DENSE_RANK和rank的区别就是排名相等时,不会留下空位。注意,drank 2后就是3.但是rank 2后是4