基础 - 3. SQL 高级特性 - 《PostgreSQL》

1.复杂查询CTE
2.递归查询CTE
3.批量插入
4.returning 返回修改的数据
5.upsert解决冲突
6.数据抽样
- 6.1 system 抽样方式:
- 6.2 bernoull抽样方式：

[https://blog.csdn.net/weixin_41191813/article/details/118523192]

介绍使用公用表达式可以让语句更加清晰简练. 除此之外，根据微软对CTE好处的描述，可以归结为四点:

可以定义递归公用表表达式(CTE)
当不需要将结果集作为视图被多个地方引用时，CTE可以使其更加简洁
GROUP BY语句可以直接作用于子查询所得的标量列
可以在一个语句中多次引用公用表表达式(CTE)

其他：窗口函数联机分析处理： rank max avg 等函数： https://zhuanlan.zhihu.com/p/143561453

1.复杂查询CTE

① 定义一个简单的cte：
postgres=# with t as (select generate_series(1,3)) select * from t;
 generate_series 
-----------------
               1
               2
               3

2.递归查询CTE

① 使用recursive属性引用自己的输出，从而实现递归，例如从1加到100：
postgres=# with recursive t (x) as (select 1 union select x+1 from t where x<100) select sum(x) from t;
 sum  
------
 5050
(1 row)
② 相关案例：
#说明：当给定一个id时能给它完整的地名，例如当id=7时，地名时：中国辽宁沈阳和平区，当id=5时，地名是：中国辽宁大连，建表语句如下：
create table test_ared(id int4,name varchar(32),fatherid int4);
insert into test_ared values (1,'中国',0);
insert into test_ared values (2,'辽宁',1);
insert into test_ared values (3,'山东',1);
insert into test_ared values (4,'沈阳',2);
insert into test_ared values (5,'大连',2);
insert into test_ared values (6,'济南',3);
insert into test_ared values (7,'和平区',4);
insert into test_ared values (8,'沈河区',4);
#使用postgresql的with查询索引ID为7以及以上的所有父节点：
with recursive a as (select test_ared.* from test_ared where id=7 union all select test_ared.* from test_ared,a where test_ared.id=a.fatherid) select * from a order by a.id;
#将输出结果name字段合并成“中国辽宁沈阳和平区”，这里使用string_agg函数：
with recursive a as (select test_ared.* from test_ared where id=7 union all select test_ared.* from test_ared,a where test_ared.id=a.fatherid) select string_agg(name,'') from (select name from a order by id)n; 
#以上是查询当前节点的所有父节点，现在查询当前节点以及其下的子节点，如查找沈阳市及管辖的区：
with recursive a as (select test_ared.* from test_ared where id=4 union all select test_ared.* from test_ared,a where test_ared.fatherid=a.id) select * from a order by a.id;

3.批量插入

①    使用insert into…select…进行批量插入：
·    通过表数据批量插入：
#例如创建一张表结构和user_ini相同的表并插入user_ini表的全量数据：
create table user_ini(user_id int8,user_name text);
insert into user_ini(user_id,user_name) values(1,'张三'),(2,'李四');
#创建表结构相同的表，并进行数据批量插入：
create table tb1_batch1 (user_id int4,user_name text);
insert into tb1_batch1(user_id,user_name) select user_id,user_name from user_ini;
·    通过generate_series函数实现批量插入数据：
#创建表结构：
create table tb1_batch2(id int4,info text);
#进行批量数据的插入：
insert into tb1_batch2(id,info) select generate_series(1,5),'batch2';
②    使用insert into values(),()…()方式进行批量插入：
#创建一张表：
create table tb1_batch3(id int4,info text);
#进行插入数据：
insert into tb1_batch3(id,info) values (1,'a'),(2,'b'),(3,'c');
③    通过cppy或\copy元命令进行插入数据：
#说明：通过copy拷贝数据效率会高一些：
#创建一张表，并插入数据：
create table tb1_batch4(id int4,info text);
insert into tb1_batch4(id,info) select n,n||'_test' from generate_series(1,10)n;
#导出数据：
copy tb1_batch4 to '/tmp/tb1_batch4';
#导入数据：
truncate tb1_batch4;
copy tb1_batch4 from '/tmp/tb1_batch4';

4.returning 返回修改的数据

①    returning返回插入的数据：
#创建一张表：
create table test_r1(id serial,flag char(11));
#插入数据，此时发现可以返回插入的数据：
postgres=# insert into test_r1(flag) values('b') returning *;
 id |    flag     
----+-------------
  2 | b          
(1 row)
insert 1
②     returning返回更新后的数据：
# 例如更新一行数据：
postgres=# update test_r1 set flag='c' where id=2 returning *;
 id |    flag     
----+-------------
  2 | c          
(1 row)
update 1
③     returning返回删除的数据：
# 例如删除一行数据：
postgres=# delete from test_r1 where id=1 returning *;
 id |    flag     
----+-------------
  1 | a          
(1 row)
delete 1

5.upsert解决冲突

①    解决插入冲突问题：
# 创建一张表：
create table test (user_name text primary key, login_cnt int4,last_login_time timestamp(0) without time zone);
# 插入一行数据：
insert into test (user_name,login_cnt) values ('francs',1);
# 再次重复插入，此时会发现会报错，因为设置了主键：
insert into test (user_name,login_cnt) values ('matiler',1),('francs',1);
ERROR:  duplicate key value violates unique constraint "test_pkey"
DETAIL:  Key (user_name)=(francs) already exists.
# 解决报错问题：
# on conflict(user_name)字段，如果出现冲突，那么则执行do update set之后的动作，并且记录时间：
insert into test (user_name,login_cnt) values ('matiler',1),('francs',1) on conflict(user_name) do update set login_cnt=test.login_cnt+excluded.login_cnt,last_login_time=now();   
#也就可以定义数据冲突后，什么也不做，只将可以插入的行进行插入，这时就需要指定do nothing属性：
insert into test(user_name,login_cnt) values ('matiler',1),('francs',1) on conflict(user_name) do nothing;

6.数据抽样

6.1 system 抽样方式:

数据抽样（TABLESAMPLE）在数据处理方面经常用到，特别是当表数据量比较大时，随机查询表中一定数量记录的操作很常见

PostgreSQL早在9.5版时就已经提供了 TABLESAMPLE数据抽样功能，9.5版前通常通过ORDER BY random（）方式实现数据抽样，这种方式虽然在功能上满足随机返回指定行数据，但性能很低，如下所示：

mydb=> SELECT * FROM user_ini ORDER BY random() LIMIT 1; 
          id | user_id | user_name | create_time 
-------------+---------+-----------+------------------------------- 
      500449 |  768810 |    2TY6P4 | 2021-08-05 15:59:32.294761+08 
(1 row) 
mydb=> SELECT * FROM user_ini ORDER BY random() LIMIT 1; 
          id | user_id | user_name | create_time 
-------------+---------+-----------+------------------------------- 
      324823 |  740720 |    07SKCU | 2021-08-05 15:59:29.913984+08 (1 row)

表user_ini数据量为100万，从100万随机取一条上述SQL的执行时间为367ms，这种方法进行了全表扫描和排序，效率非常低，当表数据量大时，性能几乎无法接受。

9.5版本以后PostgreSQL支持TABLESAMPLE数据抽样，语法： SELECT … FROM table_name TABLESAMPLE sampling_method ( argument [, …])[ REPEATABLE ( seed )]

6.2 bernoull抽样方式：

BERNOULLI抽样方式随机抽取表的数据行，并返回指定百分比数据，BERNOULLI抽样方式基于数据行级别，理论上被抽样表的每行记录被检索的概率是一样的，因此 BERNOULLI抽样方式抽取的数据相比SYSTEM抽样方式具有更好的随机性，但性能上相比SYSTEM抽样方式低很多。

BERNOULLI抽样方式，同样基于test_sample测试表。设置抽样方式为BERNOULLI，抽样因子为0.01，如下所示：

mydb=> EXPLAIN ANALYZE SELECT * FROM test_sample TABLESAMPLE BERNOULLI (0.01); 
                                QUERY PLAN 
-------------------------------------------------------------------------------- 
Sample Scan on test_sample (cost=0.00..14020.50 rows=150 width=45) (actual time=0.025..22.541 rows=152 loops=1) 
    Sampling: bernoulli ('0.01'::real) 
Planning time: 0.063 ms 
Execution time: 22.569 ms 
(4 rows)