org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
replication slot 名字重复
standby 连接数超过 max_wal_senders 配置的数量
连接器使用 CREATE PUBLICATION FOR ALL TABLES; 创建发布，但由于缺少权限而失败
Flink CDC 查询的时候报 read-only transaction 错误
查表的时候没有权限
FATAL: remaining connection slots are reserved for non-replication superuser connections
ERROR: replication slot “users” is active for PID 20795
ERROR: all replication slots are in use
在扫描表的快照期间无法执行 checkpoint
PostgreSQL 主节点 wal_level = replica，从节点 wal_level = logical

org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy

报错：

org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy

解决方法：

开启 checkpoint，每隔3秒做一次 checkpoint

-- Flink SQL                   
Flink SQL> SET execution.checkpointing.interval = 3s;

replication slot 名字重复

报错：

Causedby: org.postgresql.util.PSQLException: ERROR: replication slot "flink" already exists

解决方法：

查看 replication slot

postgres=# select * from pg_replication_slots;
 slot_name |  plugin  | slot_type |  datoid  | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+----------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 flink     | pgoutput | logical   | 15590180 | mydb     | f         | f      |            |      |     30228717 | B7/6FFD7BE8 | B7/6FFD7C20
 users     | pgoutput | logical   |  3554061 | account  | f         | t      |      20795 |      |     30402293 | B7/9E0296E0 | B7/9E029718
(2 rows)

通过 Connector 的 slot.name 参数配置 slot 名称，如果不设置默认名字为 flink

'slot.name' = 'test'

或者将名字为 flink 的 slot 删除即可

postgres=# select pg_drop_replication_slot('flink');
 pg_drop_replication_slot
--------------------------
(1 row)
postgres=# select * from pg_replication_slots;
 slot_name |  plugin  | slot_type |  datoid  | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+----------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 users     | pgoutput | logical   |  3554061 | account  | f         | t      |      20795 |      |     30402293 | B7/9E0296E0 | B7/9E029718
(1 rows)

standby 连接数超过 max_wal_senders 配置的数量

报错：

number of requested standby connections exceeds max_wal_senders (currently 8)

解决方法：

将 postgresql.conf 中的 max_wal_senders 改大

vim /pitrix/postgresql-10.4/data/postgresql.conf
max_wal_senders = 16

然后重启 postgresql

systemctl restart postgresql

连接器使用 CREATE PUBLICATION FOR ALL TABLES; 创建发布，但由于缺少权限而失败

报错：
```
must be superuser to create FOR ALL TABLES publication
```
在使用 pgoutput 插件时，建议将过滤 publication.autocreate.mode 配置为 filtered。如果使用 all_tables（publication.autocreate.mode 的默认值）并且未找到该发布，则连接器会尝试使用 CREATE PUBLICATION FOR ALL TABLES; 创建一个，但由于缺少权限而失败。
解决方法：

将 Connector 的 debezium.publication.autocreate.mode 参数配置为 filtered

'debezium.publication.autocreate.mode' = 'filtered'

Flink CDC 查询的时候报 read-only transaction 错误

报错：

cannot execute CREATE PUBLICATION in a read-only transaction

解决方法：

关闭 read-only transaction

# 关闭系统级别的只读模式，数据库不需要重启也永久生效。
postgres=# alter system set default_transaction_read_only=off;  
# 关闭 Session 级别的只读模式，退出 SQL 交互窗口后失效。
postgres=# set session default_transaction_read_only=off;
# 关闭指定登陆数据库的用户的只读模式，数据库不需要重启也永久生效。
postgres=# alter user debezium set default_transaction_read_only=off;
# 关闭指定数据库的只读模式
postgres=# alter database account set default_transaction_read_only = off;

然后重载配置

postgres=# select pg_reload_conf();  
 pg_reload_conf   
----------------  
 t  
(1 row)  
postgres=# show default_transaction_read_only;  
 default_transaction_read_only   
-------------------------------  
 on  
(1 row)

注意⚠️ 必须连接主库，否则一样会报错 cannot execute CREATE PUBLICATION in a read-only transaction

参考文档

查表的时候没有权限

报错：
```
Permission denied for relation
```
解决方法：

给用户授予对应表的权限

GRANT ALL PRIVILEGES ON TABLE public.users TO debezium;

参考文档

FATAL: remaining connection slots are reserved for non-replication superuser connections

报错：

FATAL: remaining connection slots are reserved for non-replication superuser connections

解决方法：

通过这个命令终止所有的空闲连接

# 显示系统最大连接数
show max_connections;
# 当前总共正在使用的连接数
select count(1) from pg_stat_activity;
# 显示系统保留的用户数
show superuser_reserved_connections;
# 终止所有的空闲连接
select pg_terminate_backend(pid) from pg_stat_activity;

或者将 postgresql.conf 中的 max_connections 改大一点，shared_buffers 的值也可能需要增加

vim /pitrix/postgresql-10.4/data/postgresql.conf
max_connections = 2048

参考文档

ERROR: replication slot “users” is active for PID 20795

报错：

ERROR: replication slot "users" is active for PID 20795

解决方法

查看 replication slot

postgres=# select * from pg_replication_slots;
 slot_name |  plugin  | slot_type |  datoid  | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+----------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 flink     | pgoutput | logical   | 15590180 | mydb     | f         | f      |            |      |     30228717 | B7/6FFD7BE8 | B7/6FFD7C20
 users     | pgoutput | logical   |  3554061 | account  | f         | t      |      20795 |      |     30402293 | B7/9E0296E0 | B7/9E029718
 test      | pgoutput | logical   | 20147811 | test     | f         | f      |            |      |     30248941 | B7/755BCFB0 | B7/755BCFE8
(3 rows)

将在使用名字为 users 的 slot 的程序停掉即可

ERROR: all replication slots are in use

Hint: Free one or increase max_replication_slots.

报错：

ERROR: all replication slots are in use
Hint: Free one or increase max_replication_slots.

解决方法：

释放 slot

postgres=# select * from pg_replication_slots;
 slot_name |  plugin  | slot_type |  datoid  | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+----------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
 users     | pgoutput | logical   |  3554061 | account  | f         | t      |      20795 |      |     30402293 | B7/9E0296E0 | B7/9E029718
(1 rows)
postgres=# select pg_drop_replication_slot('flink');
 pg_drop_replication_slot
--------------------------
(1 row)

或者将 postgresql.conf 中的 max_replication_slots 改大一点

vim /pitrix/postgresql-10.4/data/postgresql.conf
max_replication_slots = 2048

在扫描表的快照期间无法执行 checkpoint

报错：

Exceeded checkpoint tolerable failure threshold

解决方法：

Postgres CDC 在扫描全表数据时，是一次性读取完，没有可用于恢复的位点，所以无法在全表扫描阶段去执行 Checkpoint。
如果不执行 Checkpoint，则 Postgres CDC 的源表会让执行中的 Checkpoint 一直等待，甚至到 Checkpoint 超时（如果表超级大，扫描耗时非常长）。超时的 Checkpoint 会被认为是失败的 Checkpoint。而在 Flink 默认配置下，失败的 Checkpoint 会引发 Flink 作业 Failover。

因此，建议在表超大时，为了避免因为 Checkpoint 超时而导致作业失败，需要配置如下作业参数

SET execution.checkpointing.interval = 10min;
SET execution.checkpointing.tolerable-failed-checkpoints = 100;
SET restart-strategy = fixed-delay;
SET restart-strategy.fixed-delay.attempts = 2147483647;

参数	说明
execution.checkpointing.interval	Checkpoint 的时间间隔。单位是 Duration 类型，例如 10min 或 30s。
execution.checkpointing.tolerable-failed-checkpoints	容忍 Checkpoint 失败的次数。该参数的取值与 Checkpoint 调度间隔时间的乘积就是允许的快照读取时间。如果表特别大，这个可以配置大一些。
restart-strategy	重启策略，参数取值如下： - fixed-delay：固定延迟重启策略。 - failure-rate：故障率重启策略。 - exponential-delay：指数延迟重启策略。

详情请参见 Restart Strategies。 | | restart-strategy.fixed-delay.attempts | 固定延迟重启策略下，尝试重启的最大次数。 |

PostgreSQL 主节点 wal_level = replica，从节点 wal_level = logical

报错：

Flink CDC 连接主节点报错

org.postgresql.util.PSQLException: ERROR: logical decoding requires wal_level >= logical

Flink CDC 连接从节点报错

org.postgresql.util.PSQLException: ERROR: logical decoding cannot be used while in recovery

主节点查看从节点情况，此时从节点正在异步同步

select * from pg_stat_replication ;
  pid  | usesysid | usename  | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   state   | sent_lsn  | write_lsn | flush_lsn | replay_lsn | write_la
g | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+----------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+---------
--+-----------+------------+---------------+------------
 16342 |       10 | postgres | walreceiver      | 172.31.60.108 |                 |       54272 | 2022-07-13 14:22:59.320299+08 |              | streaming | B9/AF8C88 | B9/AF8C88 | B9/AF8C88 | B9/AF8C88  |
  |           |            |             0 | async
(1 row)

解决方案：
主节点修改为 wal_level = logical

vim /pitrix/postgresql-10.4/data/postgresql.conf
wal_level = logical