在SQL处理中,窗口函数都是最后一步执行,而且仅位于Order by字句之前
**

初始数据

我们准备一张order表,字段分别为name,orderdate,cost.数据内容如下:

  1. jack,2015-01-01,10
  2. tony,2015-01-02,15
  3. jack,2015-02-03,23
  4. tony,2015-01-04,29
  5. jack,2015-01-05,46
  6. jack,2015-04-06,42
  7. tony,2015-01-07,50
  8. jack,2015-01-08,55
  9. mart,2015-04-08,62
  10. mart,2015-04-09,68
  11. neil,2015-05-10,12
  12. mart,2015-04-11,75
  13. neil,2015-06-12,80
  14. mart,2015-04-13,94

建表语句

create table if not exists t_window(
    name string,
    orderdate string,
    cost int
)comment '窗口表'
row format delimited fields terminated by ','
stored as textfile;

加载数据

load data local inpath '/home/admin/data/over.txt' into table t_window;

案例演示

1

select name,orderdate,cost,
sum(cost) over(partition by month(orderdate))
from t_window
jack    2015-01-01    10    205
jack    2015-01-08    55    205
tony    2015-01-07    50    205
jack    2015-01-05    46    205
tony    2015-01-04    29    205
tony    2015-01-02    15    205
jack    2015-02-03    23    23
mart    2015-04-13    94    341
jack    2015-04-06    42    341
mart    2015-04-11    75    341
mart    2015-04-09    68    341
mart    2015-04-08    62    341
neil    2015-05-10    12    12
neil    2015-06-12    80    80

解释:按月分区,窗口大小等于分区大小。205=10+55+50+46+29+15 (即2015-01数据总和)

2

select name,orderdate,cost,
sum(cost) over() as sample1
from t_window;
mart    2015-04-13    94    661
neil    2015-06-12    80    661
mart    2015-04-11    75    661
neil    2015-05-10    12    661
mart    2015-04-09    68    661
mart    2015-04-08    62    661
jack    2015-01-08    55    661
tony    2015-01-07    50    661
jack    2015-04-06    42    661
jack    2015-01-05    46    661
tony    2015-01-04    29    661
jack    2015-02-03    23    661
tony    2015-01-02    15    661
jack    2015-01-01    10    661

解释: 窗口大小是整个数据集。661等于所有行的cost相加。

3

select name,orderdate,cost,
sum(cost) over(partition by name) as sample2
from t_window;
jack    2015-01-05    46    176
jack    2015-01-08    55    176
jack    2015-01-01    10    176
jack    2015-04-06    42    176
jack    2015-02-03    23    176
mart    2015-04-13    94    299
mart    2015-04-11    75    299
mart    2015-04-09    68    299
mart    2015-04-08    62    299
neil    2015-05-10    12    92
neil    2015-06-12    80    92
tony    2015-01-04    29    94
tony    2015-01-02    15    94
tony    2015-01-07    50    94

解释:原理同1. 只不过是按name分区。176=46+55+10+42+23

4

select name,orderdate,cost,
sum(cost) over(partition by name order by orderdate) as sample3
from t_window;

jack    2015-01-01    10    10
jack    2015-01-05    46    56
jack    2015-01-08    55    111
jack    2015-02-03    23    134
jack    2015-04-06    42    176
mart    2015-04-08    62    62
mart    2015-04-09    68    130
mart    2015-04-11    75    205
mart    2015-04-13    94    299
neil    2015-05-10    12    12
neil    2015-06-12    80    92
tony    2015-01-02    15    15
tony    2015-01-04    29    44
tony    2015-01-07    50    94

解释:窗口大小=分区头部到当前行。比如第三行的111来由是 (10+46+55)


5

window子句:
- PRECEDING:往前
- FOLLOWING:往后
- CURRENT ROW:当前行
- UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点

select name,orderdate,cost,
sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row )  as sample4
from t_window;

jack    2015-01-01    10    10
jack    2015-01-05    46    56
jack    2015-01-08    55    111
jack    2015-02-03    23    134
jack    2015-04-06    42    176
mart    2015-04-08    62    62
mart    2015-04-09    68    130
mart    2015-04-11    75    205
mart    2015-04-13    94    299
neil    2015-05-10    12    12
neil    2015-06-12    80    92
tony    2015-01-02    15    15
tony    2015-01-04    29    44
tony    2015-01-07    50    94

解释:窗口大小=分区头部到当前行。比如第三行的111来由是 (10+46+55)。与4是一样的。

6

select name,orderdate,cost,
sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING   and current row) as sample5
from t_window;

jack    2015-01-01    10    10
jack    2015-01-05    46    56
jack    2015-01-08    55    101
jack    2015-02-03    23    78
jack    2015-04-06    42    65
mart    2015-04-08    62    62
mart    2015-04-09    68    130
mart    2015-04-11    75    143
mart    2015-04-13    94    169
neil    2015-05-10    12    12
neil    2015-06-12    80    92
tony    2015-01-02    15    15
tony    2015-01-04    29    44
tony    2015-01-07    50    79

解释:窗口大小=前一行到当前行。按name分区。第三行的101来源(46+55)。mark第一行为什么是62?因为它的并不存在前一行(它是一个单独的分区)

7

select name,orderdate,cost,
sum(cost) over(partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7
from t_window;

jack    2015-01-01    10    176 
jack    2015-01-05    46    166
jack    2015-01-08    55    120
jack    2015-02-03    23    65
jack    2015-04-06    42    42
mart    2015-04-08    62    299
mart    2015-04-09    68    237
mart    2015-04-11    75    169
mart    2015-04-13    94    94
neil    2015-05-10    12    92
neil    2015-06-12    80    80
tony    2015-01-02    15    94
tony    2015-01-04    29    79
tony    2015-01-07    50    50

解释:窗口大小=当前行到分区尾部。比如第4行 jack 2015-02-03 23 65 中的65=23+42


8 row_number

select name,orderdate,cost,
sum(cost) over(partition by name order by cost) as sum_cost,
row_number() over(partition by name order by cost) as rn
from t_window;

jack    2015-01-01    10    10    1
jack    2015-02-03    23    33    2
jack    2015-04-06    42    75    3
jack    2015-01-05    46    121    4
jack    2015-01-08    55    176    5
mart    2015-04-08    62    62    1
mart    2015-04-09    68    130    2
mart    2015-04-11    75    205    3
mart    2015-04-13    94    299    4
neil    2015-05-10    12    12    1
neil    2015-06-12    80    92    2
tony    2015-01-02    15    15    1
tony    2015-01-04    29    44    2
tony    2015-01-07    50    94    3

解释:row_number 表示在分区窗口中的行号

9 rank

为了测试rank,我们需要增加一条数据 jack,2015-02-04,23
此时源数据如下:

jack,2015-01-01,10
tony,2015-01-02,15
jack,2015-02-03,23
jack,2015-02-04,23
tony,2015-01-04,29
jack,2015-01-05,46
jack,2015-04-06,42
tony,2015-01-07,50
jack,2015-01-08,55
mart,2015-04-08,62
mart,2015-04-09,68
neil,2015-05-10,12
mart,2015-04-11,75
neil,2015-06-12,80
mart,2015-04-13,94

重新导入hive: load data local inpath '/home/admin/data/over1.txt' overwrite into table t_window;

select name,orderdate,cost,
rank() over(partition by name order by cost) as rank
from t_window;

jack    2015-01-01    10    1
jack    2015-02-03    23    2
jack    2015-02-04    23    2
jack    2015-04-06    42    4
jack    2015-01-05    46    5
jack    2015-01-08    55    6
mart    2015-04-08    62    1
mart    2015-04-09    68    2
mart    2015-04-11    75    3
mart    2015-04-13    94    4
neil    2015-05-10    12    1
neil    2015-06-12    80    2
tony    2015-01-02    15    1
tony    2015-01-04    29    2
tony    2015-01-07    50    3

解释:根据cost大小进行排序。注意jack 2015-02-03 和2015-02-04 cost都是23。rank值是相同的(2)。但是第三名却不见了。

10 DENSE_RANK

select name,orderdate,cost,
DENSE_RANK() over(partition by name order by cost) as drank
from t_window;

jack    2015-01-01    10    1
jack    2015-02-03    23    2
jack    2015-02-04    23    2
jack    2015-04-06    42    3
jack    2015-01-05    46    4
jack    2015-01-08    55    5
mart    2015-04-08    62    1
mart    2015-04-09    68    2
mart    2015-04-11    75    3
mart    2015-04-13    94    4
neil    2015-05-10    12    1
neil    2015-06-12    80    2
tony    2015-01-02    15    1
tony    2015-01-04    29    2
tony    2015-01-07    50    3

解释:DENSE_RANK和rank的区别就是排名相等时,不会留下空位。注意,drank 2后就是3.但是rank 2后是4