org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy

  • 报错:

    1. org.apache.flink.runtime.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
  • 解决方法:

开启 checkpoint,每隔3秒做一次 checkpoint

  1. -- Flink SQL
  2. Flink SQL> SET execution.checkpointing.interval = 3s;

replication slot 名字重复

  • 报错:

    1. Causedby: org.postgresql.util.PSQLException: ERROR: replication slot "flink" already exists
  • 解决方法:

查看 replication slot

  1. postgres=# select * from pg_replication_slots;
  2. slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
  3. -----------+----------+-----------+----------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
  4. flink | pgoutput | logical | 15590180 | mydb | f | f | | | 30228717 | B7/6FFD7BE8 | B7/6FFD7C20
  5. users | pgoutput | logical | 3554061 | account | f | t | 20795 | | 30402293 | B7/9E0296E0 | B7/9E029718
  6. (2 rows)

通过 Connector 的 slot.name 参数配置 slot 名称,如果不设置默认名字为 flink

  1. 'slot.name' = 'test'

或者将名字为 flink 的 slot 删除即可

  1. postgres=# select pg_drop_replication_slot('flink');
  2. pg_drop_replication_slot
  3. --------------------------
  4. (1 row)
  5. postgres=# select * from pg_replication_slots;
  6. slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
  7. -----------+----------+-----------+----------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
  8. users | pgoutput | logical | 3554061 | account | f | t | 20795 | | 30402293 | B7/9E0296E0 | B7/9E029718
  9. (1 rows)

standby 连接数超过 max_wal_senders 配置的数量

  • 报错:

    1. number of requested standby connections exceeds max_wal_senders (currently 8)
  • 解决方法:

将 postgresql.conf 中的 max_wal_senders 改大

  1. vim /pitrix/postgresql-10.4/data/postgresql.conf
  2. max_wal_senders = 16

然后重启 postgresql

  1. systemctl restart postgresql

连接器使用 CREATE PUBLICATION FOR ALL TABLES; 创建发布,但由于缺少权限而失败

  • 报错:

    1. must be superuser to create FOR ALL TABLES publication

    在使用 pgoutput 插件时,建议将过滤 publication.autocreate.mode 配置为 filtered。 如果使用 all_tables(publication.autocreate.mode 的默认值)并且未找到该发布,则连接器会尝试使用 CREATE PUBLICATION FOR ALL TABLES; 创建一个,但由于缺少权限而失败。

  • 解决方法:

将 Connector 的 debezium.publication.autocreate.mode 参数配置为 filtered

  1. 'debezium.publication.autocreate.mode' = 'filtered'

Flink CDC 查询的时候报 read-only transaction 错误

  • 报错:

    1. cannot execute CREATE PUBLICATION in a read-only transaction
  • 解决方法:

关闭 read-only transaction

  1. # 关闭系统级别的只读模式,数据库不需要重启也永久生效。
  2. postgres=# alter system set default_transaction_read_only=off;
  3. # 关闭 Session 级别的只读模式,退出 SQL 交互窗口后失效。
  4. postgres=# set session default_transaction_read_only=off;
  5. # 关闭指定登陆数据库的用户的只读模式,数据库不需要重启也永久生效。
  6. postgres=# alter user debezium set default_transaction_read_only=off;
  7. # 关闭指定数据库的只读模式
  8. postgres=# alter database account set default_transaction_read_only = off;

然后重载配置

  1. postgres=# select pg_reload_conf();
  2. pg_reload_conf
  3. ----------------
  4. t
  5. (1 row)
  6. postgres=# show default_transaction_read_only;
  7. default_transaction_read_only
  8. -------------------------------
  9. on
  10. (1 row)

注意⚠️ 必须连接主库,否则一样会报错 cannot execute CREATE PUBLICATION in a read-only transaction

参考文档

查表的时候没有权限

  • 报错:

    1. Permission denied for relation
  • 解决方法:

给用户授予对应表的权限

  1. GRANT ALL PRIVILEGES ON TABLE public.users TO debezium;

参考文档

FATAL: remaining connection slots are reserved for non-replication superuser connections

  • 报错:

    1. FATAL: remaining connection slots are reserved for non-replication superuser connections
  • 解决方法:

通过这个命令终止所有的空闲连接

  1. # 显示系统最大连接数
  2. show max_connections;
  3. # 当前总共正在使用的连接数
  4. select count(1) from pg_stat_activity;
  5. # 显示系统保留的用户数
  6. show superuser_reserved_connections;
  7. # 终止所有的空闲连接
  8. select pg_terminate_backend(pid) from pg_stat_activity;

或者将 postgresql.conf 中的 max_connections 改大一点,shared_buffers 的值也可能需要增加

  1. vim /pitrix/postgresql-10.4/data/postgresql.conf
  2. max_connections = 2048

参考文档

ERROR: replication slot “users” is active for PID 20795

  • 报错:

    1. ERROR: replication slot "users" is active for PID 20795
  • 解决方法

查看 replication slot

  1. postgres=# select * from pg_replication_slots;
  2. slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
  3. -----------+----------+-----------+----------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
  4. flink | pgoutput | logical | 15590180 | mydb | f | f | | | 30228717 | B7/6FFD7BE8 | B7/6FFD7C20
  5. users | pgoutput | logical | 3554061 | account | f | t | 20795 | | 30402293 | B7/9E0296E0 | B7/9E029718
  6. test | pgoutput | logical | 20147811 | test | f | f | | | 30248941 | B7/755BCFB0 | B7/755BCFE8
  7. (3 rows)

将在使用名字为 users 的 slot 的程序停掉即可

ERROR: all replication slots are in use

Hint: Free one or increase max_replication_slots.

  • 报错:

    1. ERROR: all replication slots are in use
    2. Hint: Free one or increase max_replication_slots.
  • 解决方法:

释放 slot

  1. postgres=# select * from pg_replication_slots;
  2. slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
  3. -----------+----------+-----------+----------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
  4. users | pgoutput | logical | 3554061 | account | f | t | 20795 | | 30402293 | B7/9E0296E0 | B7/9E029718
  5. (1 rows)
  6. postgres=# select pg_drop_replication_slot('flink');
  7. pg_drop_replication_slot
  8. --------------------------
  9. (1 row)

或者将 postgresql.conf 中的 max_replication_slots 改大一点

  1. vim /pitrix/postgresql-10.4/data/postgresql.conf
  2. max_replication_slots = 2048

在扫描表的快照期间无法执行 checkpoint

  • 报错:

    1. Exceeded checkpoint tolerable failure threshold
  • 解决方法:

Postgres CDC 在扫描全表数据时,是一次性读取完,没有可用于恢复的位点,所以无法在全表扫描阶段去执行 Checkpoint。
如果不执行 Checkpoint,则 Postgres CDC 的源表会让执行中的 Checkpoint 一直等待,甚至到 Checkpoint 超时(如果表超级大,扫描耗时非常长)。超时的 Checkpoint 会被认为是失败的 Checkpoint。而在 Flink 默认配置下,失败的 Checkpoint 会引发 Flink 作业 Failover。

因此,建议在表超大时,为了避免因为 Checkpoint 超时而导致作业失败,需要配置如下作业参数

  1. SET execution.checkpointing.interval = 10min;
  2. SET execution.checkpointing.tolerable-failed-checkpoints = 100;
  3. SET restart-strategy = fixed-delay;
  4. SET restart-strategy.fixed-delay.attempts = 2147483647;
参数 说明
execution.checkpointing.interval Checkpoint 的时间间隔。
单位是 Duration 类型,例如 10min 或 30s。
execution.checkpointing.tolerable-failed-checkpoints 容忍 Checkpoint 失败的次数。
该参数的取值与 Checkpoint 调度间隔时间的乘积就是允许的快照读取时间。如果表特别大,这个可以配置大一些。
restart-strategy 重启策略,参数取值如下:
- fixed-delay:固定延迟重启策略。
- failure-rate:故障率重启策略。
- exponential-delay:指数延迟重启策略。

详情请参见 Restart Strategies。 | | restart-strategy.fixed-delay.attempts | 固定延迟重启策略下,尝试重启的最大次数。 |

PostgreSQL 主节点 wal_level = replica,从节点 wal_level = logical

  • 报错:

    • Flink CDC 连接主节点报错

      1. org.postgresql.util.PSQLException: ERROR: logical decoding requires wal_level >= logical
    • Flink CDC 连接从节点报错

      1. org.postgresql.util.PSQLException: ERROR: logical decoding cannot be used while in recovery

主节点查看从节点情况,此时从节点正在异步同步

  1. select * from pg_stat_replication ;
  2. pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_la
  3. g | flush_lag | replay_lag | sync_priority | sync_state
  4. -------+----------+----------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+---------
  5. --+-----------+------------+---------------+------------
  6. 16342 | 10 | postgres | walreceiver | 172.31.60.108 | | 54272 | 2022-07-13 14:22:59.320299+08 | | streaming | B9/AF8C88 | B9/AF8C88 | B9/AF8C88 | B9/AF8C88 |
  7. | | | 0 | async
  8. (1 row)

解决方案:
主节点修改为 wal_level = logical

  1. vim /pitrix/postgresql-10.4/data/postgresql.conf
  2. wal_level = logical
  • Flink CDC 连接从节点报错
    1. org.postgresql.util.PSQLException: ERROR: logical decoding cannot be used while in recovery

PostgreSQL CDC只支持连接主库