- org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
- replication slot 名字重复
- standby 连接数超过 max_wal_senders 配置的数量
- 连接器使用 CREATE PUBLICATION
FOR ALL TABLES; 创建发布,但由于缺少权限而失败 - Flink CDC 查询的时候报 read-only transaction 错误
- 查表的时候没有权限
- FATAL: remaining connection slots are reserved for non-replication superuser connections
- ERROR: replication slot “users” is active for PID 20795
- ERROR: all replication slots are in use
- 在扫描表的快照期间无法执行 checkpoint
- PostgreSQL 主节点 wal_level = replica,从节点 wal_level = logical
org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
报错:
org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
解决方法:
开启 checkpoint,每隔3秒做一次 checkpoint
-- Flink SQL
Flink SQL> SET execution.checkpointing.interval = 3s;
replication slot 名字重复
报错:
Causedby: org.postgresql.util.PSQLException: ERROR: replication slot "flink" already exists
解决方法:
查看 replication slot
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+----------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
flink | pgoutput | logical | 15590180 | mydb | f | f | | | 30228717 | B7/6FFD7BE8 | B7/6FFD7C20
users | pgoutput | logical | 3554061 | account | f | t | 20795 | | 30402293 | B7/9E0296E0 | B7/9E029718
(2 rows)
通过 Connector 的 slot.name 参数配置 slot 名称,如果不设置默认名字为 flink
'slot.name' = 'test'
或者将名字为 flink 的 slot 删除即可
postgres=# select pg_drop_replication_slot('flink');
pg_drop_replication_slot
--------------------------
(1 row)
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+----------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
users | pgoutput | logical | 3554061 | account | f | t | 20795 | | 30402293 | B7/9E0296E0 | B7/9E029718
(1 rows)
standby 连接数超过 max_wal_senders 配置的数量
报错:
number of requested standby connections exceeds max_wal_senders (currently 8)
解决方法:
将 postgresql.conf 中的 max_wal_senders 改大
vim /pitrix/postgresql-10.4/data/postgresql.conf
max_wal_senders = 16
然后重启 postgresql
systemctl restart postgresql
连接器使用 CREATE PUBLICATION FOR ALL TABLES; 创建发布,但由于缺少权限而失败
报错:
must be superuser to create FOR ALL TABLES publication
在使用 pgoutput 插件时,建议将过滤 publication.autocreate.mode 配置为 filtered。 如果使用 all_tables(publication.autocreate.mode 的默认值)并且未找到该发布,则连接器会尝试使用 CREATE PUBLICATION
FOR ALL TABLES; 创建一个,但由于缺少权限而失败。 解决方法:
将 Connector 的 debezium.publication.autocreate.mode 参数配置为 filtered
'debezium.publication.autocreate.mode' = 'filtered'
Flink CDC 查询的时候报 read-only transaction 错误
报错:
cannot execute CREATE PUBLICATION in a read-only transaction
解决方法:
关闭 read-only transaction
# 关闭系统级别的只读模式,数据库不需要重启也永久生效。
postgres=# alter system set default_transaction_read_only=off;
# 关闭 Session 级别的只读模式,退出 SQL 交互窗口后失效。
postgres=# set session default_transaction_read_only=off;
# 关闭指定登陆数据库的用户的只读模式,数据库不需要重启也永久生效。
postgres=# alter user debezium set default_transaction_read_only=off;
# 关闭指定数据库的只读模式
postgres=# alter database account set default_transaction_read_only = off;
然后重载配置
postgres=# select pg_reload_conf();
pg_reload_conf
----------------
t
(1 row)
postgres=# show default_transaction_read_only;
default_transaction_read_only
-------------------------------
on
(1 row)
注意⚠️ 必须连接主库,否则一样会报错 cannot execute CREATE PUBLICATION in a read-only transaction
参考文档
查表的时候没有权限
报错:
Permission denied for relation
解决方法:
给用户授予对应表的权限
GRANT ALL PRIVILEGES ON TABLE public.users TO debezium;
参考文档
FATAL: remaining connection slots are reserved for non-replication superuser connections
报错:
FATAL: remaining connection slots are reserved for non-replication superuser connections
解决方法:
通过这个命令终止所有的空闲连接
# 显示系统最大连接数
show max_connections;
# 当前总共正在使用的连接数
select count(1) from pg_stat_activity;
# 显示系统保留的用户数
show superuser_reserved_connections;
# 终止所有的空闲连接
select pg_terminate_backend(pid) from pg_stat_activity;
或者将 postgresql.conf 中的 max_connections 改大一点,shared_buffers 的值也可能需要增加
vim /pitrix/postgresql-10.4/data/postgresql.conf
max_connections = 2048
参考文档
ERROR: replication slot “users” is active for PID 20795
报错:
ERROR: replication slot "users" is active for PID 20795
解决方法
查看 replication slot
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+----------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
flink | pgoutput | logical | 15590180 | mydb | f | f | | | 30228717 | B7/6FFD7BE8 | B7/6FFD7C20
users | pgoutput | logical | 3554061 | account | f | t | 20795 | | 30402293 | B7/9E0296E0 | B7/9E029718
test | pgoutput | logical | 20147811 | test | f | f | | | 30248941 | B7/755BCFB0 | B7/755BCFE8
(3 rows)
将在使用名字为 users 的 slot 的程序停掉即可
ERROR: all replication slots are in use
Hint: Free one or increase max_replication_slots.
报错:
ERROR: all replication slots are in use
Hint: Free one or increase max_replication_slots.
解决方法:
释放 slot
postgres=# select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
-----------+----------+-----------+----------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
users | pgoutput | logical | 3554061 | account | f | t | 20795 | | 30402293 | B7/9E0296E0 | B7/9E029718
(1 rows)
postgres=# select pg_drop_replication_slot('flink');
pg_drop_replication_slot
--------------------------
(1 row)
或者将 postgresql.conf 中的 max_replication_slots 改大一点
vim /pitrix/postgresql-10.4/data/postgresql.conf
max_replication_slots = 2048
在扫描表的快照期间无法执行 checkpoint
报错:
Exceeded checkpoint tolerable failure threshold
解决方法:
Postgres CDC 在扫描全表数据时,是一次性读取完,没有可用于恢复的位点,所以无法在全表扫描阶段去执行 Checkpoint。
如果不执行 Checkpoint,则 Postgres CDC 的源表会让执行中的 Checkpoint 一直等待,甚至到 Checkpoint 超时(如果表超级大,扫描耗时非常长)。超时的 Checkpoint 会被认为是失败的 Checkpoint。而在 Flink 默认配置下,失败的 Checkpoint 会引发 Flink 作业 Failover。
因此,建议在表超大时,为了避免因为 Checkpoint 超时而导致作业失败,需要配置如下作业参数
SET execution.checkpointing.interval = 10min;
SET execution.checkpointing.tolerable-failed-checkpoints = 100;
SET restart-strategy = fixed-delay;
SET restart-strategy.fixed-delay.attempts = 2147483647;
参数 | 说明 |
---|---|
execution.checkpointing.interval | Checkpoint 的时间间隔。 单位是 Duration 类型,例如 10min 或 30s。 |
execution.checkpointing.tolerable-failed-checkpoints | 容忍 Checkpoint 失败的次数。 该参数的取值与 Checkpoint 调度间隔时间的乘积就是允许的快照读取时间。如果表特别大,这个可以配置大一些。 |
restart-strategy | 重启策略,参数取值如下: - fixed-delay:固定延迟重启策略。 - failure-rate:故障率重启策略。 - exponential-delay:指数延迟重启策略。 |
详情请参见 Restart Strategies。 | | restart-strategy.fixed-delay.attempts | 固定延迟重启策略下,尝试重启的最大次数。 |
PostgreSQL 主节点 wal_level = replica,从节点 wal_level = logical
报错:
Flink CDC 连接主节点报错
org.postgresql.util.PSQLException: ERROR: logical decoding requires wal_level >= logical
Flink CDC 连接从节点报错
org.postgresql.util.PSQLException: ERROR: logical decoding cannot be used while in recovery
主节点查看从节点情况,此时从节点正在异步同步
select * from pg_stat_replication ;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_la
g | flush_lag | replay_lag | sync_priority | sync_state
-------+----------+----------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+---------
--+-----------+------------+---------------+------------
16342 | 10 | postgres | walreceiver | 172.31.60.108 | | 54272 | 2022-07-13 14:22:59.320299+08 | | streaming | B9/AF8C88 | B9/AF8C88 | B9/AF8C88 | B9/AF8C88 |
| | | 0 | async
(1 row)
解决方案:
主节点修改为 wal_level = logical
vim /pitrix/postgresql-10.4/data/postgresql.conf
wal_level = logical
- Flink CDC 连接从节点报错
org.postgresql.util.PSQLException: ERROR: logical decoding cannot be used while in recovery
PostgreSQL CDC只支持连接主库