Zookeeper(源码分析)

1 章 算法基础

思考:Zookeeper 是如何保证数据一致性的?这也是困扰分布式系统框架的一个难题。

1.1 拜占庭将军问题

zookeeper源码分析 - 图1

1.2 Paxos 算法

zookeeper源码分析 - 图2

Paxos算法描述:

zookeeper源码分析 - 图3

zookeeper源码分析 - 图4

下面我们针对上述描述做三种情况的推演举例:为了简化流程,我们这里不设置 Learner。

情况一:

zookeeper源码分析 - 图5

情况二:

zookeeper源码分析 - 图6

Paxos 算法缺陷:在网络复杂的情况下,一个应用 Paxos 算法的分布式系统,可能很久无法收敛,甚至陷入活锁的情况。

情况三:

zookeeper源码分析 - 图7

  1. 造成这种情况的原因是系统中有一个以上的 Proposer,多个 Proposers 相互争夺 Acceptor,造成迟迟无法达成一致的情况。针对这种情况,一种改进的 Paxos 算法被提出:从系统中选出一个节点作为 Leader,只有 Leader 能够发起提案。这样,一次 Paxos 流程中只有一个Proposer,不会出现活锁的情况,此时只会出现例子中第一种情况。

1.3 ZAB 协议

1.3.1 什么是 ZAB 算法

Zab 借鉴了 Paxos 算法,是特别为 Zookeeper 设计的支持崩溃恢复的原子广播协议。基于该协议,Zookeeper 设计为只有一台客户端(Leader)负责处理外部的写事务请求,然后Leader 客户端将数据同步到其他 Follower 节点。即 Zookeeper 只有一个 Leader 可以发起提案。

1.3.2 Zab 协议内容

Zab 协议包括两种基本的模式:消息广播、崩溃恢复。

消息广播:

zookeeper源码分析 - 图8

崩溃恢复——异常假设

zookeeper源码分析 - 图9

崩溃恢复——Leader选举

zookeeper源码分析 - 图10

崩溃恢复——数据恢复

zookeeper源码分析 - 图11

1.4 CAP

CAP理论告诉我们,一个分布式系统不可能同时满足以下三种

CAP理论

⚫ 一致性(C:Consistency)

⚫ 可用性(A:Available)

⚫ 分区容错性(P:Partition Tolerance)

这三个基本需求,最多只能同时满足其中的两项,因为P是必须的,因此往往选择就在CP或者AP中。

1)一致性(C:Consistency)

在分布式环境中,一致性是指数据在多个副本之间是否能够保持数据一致的特性。在一致性的需求下,当一个系统在数据一致的状态下执行更新操作后,应该保证系统的数据仍然处于一致的状态。

2)可用性(A:Available)

可用性是指系统提供的服务必须一直处于可用的状态,对于用户的每一个操作请求总是能够在有限的时间内返回结果。

3)分区容错性(P:Partition Tolerance)

分布式系统在遇到任何网络分区故障的时候,仍然需要能够保证对外提供满足一致性和可用性的服务,除非是整个网络环境都发生了故障。

ZooKeeper保证的是CP

(1)ZooKeeper不能保证每次服务请求的可用性。(注:在极端环境下,ZooKeeper可能会丢弃一些请求,消费者程序需要重新请求才能获得结果)。所以说,ZooKeeper不能保证服务可用性。

(2)进行Leader选举时集群都是不可用。

2 章 源码详解

2.1 辅助源码

2.1.1 持久化源码

Leader 和 Follower 中的数据会在内存和磁盘中各保存一份。所以需要将内存中的数据持久化到磁盘中。

在 org.apache.zookeeper.server.persistence 包下的相关类都是序列化相关的代码。

zookeeper源码分析 - 图12

1)快照

  1. public interface SnapShot {
  2. // 反序列化方法
  3. long deserialize(DataTree dt, Map<Long, Integer> sessions)
  4. throws IOException;
  5. // 序列化方法
  6. void serialize(DataTree dt, Map<Long, Integer> sessions,
  7. File name)
  8. throws IOException;
  9. /**
  10. * find the most recent snapshot file
  11. * 查找最近的快照文件
  12. */
  13. File findMostRecentSnapshot() throws IOException;
  14. // 释放资源
  15. void close() throws IOException;
  16. }

2)操作日志

  1. public interface TxnLog {
  2. // 设置服务状态
  3. void setServerStats(ServerStats serverStats);
  4. // 滚动日志
  5. void rollLog() throws IOException;
  6. // 追加
  7. boolean append(TxnHeader hdr, Record r) throws IOException;
  8. // 读取数据
  9. TxnIterator read(long zxid) throws IOException;
  10. // 获取最后一个 zxid
  11. long getLastLoggedZxid() throws IOException;
  12. // 删除日志
  13. boolean truncate(long zxid) throws IOException;
  14. // 获取 DbId
  15. long getDbId() throws IOException;
  16. // 提交
  17. void commit() throws IOException;
  18. // 日志同步时间
  19. long getTxnLogSyncElapsedTime();
  20. // 关闭日志
  21. void close() throws IOException;
  22. // 读取日志的接口
  23. public interface TxnIterator {
  24. // 获取头信息
  25. TxnHeader getHeader();
  26. // 获取传输的内容
  27. Record getTxn();
  28. // 下一条记录
  29. boolean next() throws IOException;
  30. // 关闭资源
  31. void close() throws IOException;
  32. // 获取存储的大小
  33. long getStorageSize() throws IOException;
  34. }
  35. }

3)处理持久化的核心类

zookeeper源码分析 - 图13

2.1.2 序列化源码

zookeeper-jute 代码是关于 Zookeeper 序列化相关源码

zookeeper源码分析 - 图14

1)序列化和反序列化方法

  1. public interface Record {
  2. // 序列化方法
  3. public void serialize(OutputArchive archive, String tag)
  4. throws IOException;
  5. // 反序列化方法
  6. public void deserialize(InputArchive archive, String tag)
  7. throws IOException;
  8. }

2)迭代

  1. public interface Index {
  2. // 结束
  3. public boolean done();
  4. // 下一个
  5. public void incr();
  6. }

3)序列化支持的数据类型

  1. /**
  2. * Interface that alll the serializers have to implement.
  3. *
  4. */
  5. public interface OutputArchive {
  6. public void writeByte(byte b, String tag) throws IOException;
  7. public void writeBool(boolean b, String tag) throws IOException;
  8. public void writeInt(int i, String tag) throws IOException;
  9. public void writeLong(long l, String tag) throws IOException;
  10. public void writeFloat(float f, String tag) throws IOException;
  11. public void writeDouble(double d, String tag) throws IOException;
  12. public void writeString(String s, String tag) throws IOException;
  13. public void writeBuffer(byte buf[], String tag) throws IOException;
  14. public void writeRecord(Record r, String tag) throws IOException;
  15. public void startRecord(Record r, String tag) throws IOException;
  16. public void endRecord(Record r, String tag) throws IOException;
  17. public void startVector(List<?> v, String tag) throws IOException;
  18. public void endVector(List<?> v, String tag) throws IOException;
  19. public void startMap(TreeMap<?,?> v, String tag) throws IOException;
  20. public void endMap(TreeMap<?,?> v, String tag) throws IOException;
  21. }

4)反序列化支持的数据类型

  1. /**
  2. * Interface that all the Deserializers have to implement.
  3. *
  4. */
  5. public interface InputArchive {
  6. public byte readByte(String tag) throws IOException;
  7. public boolean readBool(String tag) throws IOException;
  8. public int readInt(String tag) throws IOException;
  9. public long readLong(String tag) throws IOException;
  10. public float readFloat(String tag) throws IOException;
  11. public double readDouble(String tag) throws IOException;
  12. public String readString(String tag) throws IOException;
  13. public byte[] readBuffer(String tag) throws IOException;
  14. public void readRecord(Record r, String tag) throws IOException;
  15. public void startRecord(String tag) throws IOException;
  16. public void endRecord(String tag) throws IOException;
  17. public Index startVector(String tag) throws IOException;
  18. public void endVector(String tag) throws IOException;
  19. public Index startMap(String tag) throws IOException;
  20. public void endMap(String tag) throws IOException;
  21. }

2.2 ZK 服务端初始化源码解析

zookeeper源码分析 - 图15

2.2.1 ZK 服务端启动脚本分析

1)Zookeeper 服务的启动命令是 zkServer.sh start

zkServer.sh

  1. #!/usr/bin/env bash
  2. # use POSTIX interface, symlink is followed automatically
  3. ZOOBIN="${BASH_SOURCE-$0}"
  4. ZOOBIN="$(dirname "${ZOOBIN}")"
  5. ZOOBINDIR="$(cd "${ZOOBIN}"; pwd)"
  6. if [ -e "$ZOOBIN/../libexec/zkEnv.sh" ]; then
  7. . "$ZOOBINDIR"/../libexec/zkEnv.sh
  8. else
  9. . "$ZOOBINDIR"/zkEnv.sh //相当于获取 zkEnv.sh 中的环境变量(ZOOCFG="zoo.cfg")
  10. fi
  11. # See the following page for extensive details on setting
  12. # up the JVM to accept JMX remote management:
  13. # http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html
  14. # by default we allow local JMX connections
  15. if [ "x$JMXLOCALONLY" = "x" ]
  16. then
  17. JMXLOCALONLY=false
  18. fi
  19. if [ "x$JMXDISABLE" = "x" ] || [ "$JMXDISABLE" = 'false' ]
  20. then
  21. echo "ZooKeeper JMX enabled by default" >&2
  22. if [ "x$JMXPORT" = "x" ]
  23. then
  24. # for some reason these two options are necessary on jdk6 on Ubuntu
  25. # accord to the docs they are not necessary, but otw jconsole cannot
  26. # do a local attach
  27. ZOOMAIN="-Dcom.sun.management.jmxremote -
  28. Dcom.sun.management.jmxremote.local.only=$JMXLOCALONLY
  29. org.apache.zookeeper.server.quorum.QuorumPeerMain"
  30. else
  31. if [ "x$JMXAUTH" = "x" ]
  32. then
  33. JMXAUTH=false
  34. fi
  35. if [ "x$JMXSSL" = "x" ]
  36. then
  37. JMXSSL=false
  38. fi
  39. if [ "x$JMXLOG4J" = "x" ]
  40. then
  41. JMXLOG4J=true
  42. fi
  43. echo "ZooKeeper remote JMX Port set to $JMXPORT" >&2
  44. echo "ZooKeeper remote JMX authenticate set to $JMXAUTH" >&2
  45. echo "ZooKeeper remote JMX ssl set to $JMXSSL" >&2
  46. echo "ZooKeeper remote JMX log4j set to $JMXLOG4J" >&2
  47. ZOOMAIN="-Dcom.sun.management.jmxremote -
  48. Dcom.sun.management.jmxremote.port=$JMXPORT -
  49. Dcom.sun.management.jmxremote.authenticate=$JMXAUTH -
  50. Dcom.sun.management.jmxremote.ssl=$JMXSSL -
  51. Dzookeeper.jmx.log4j.disable=$JMXLOG4J
  52. org.apache.zookeeper.server.quorum.QuorumPeerMain"
  53. fi
  54. else
  55. echo "JMX disabled by user request" >&2
  56. ZOOMAIN="org.apache.zookeeper.server.quorum.QuorumPeerMain"
  57. fi
  58. if [ "x$SERVER_JVMFLAGS" != "x" ]
  59. then
  60. JVMFLAGS="$SERVER_JVMFLAGS $JVMFLAGS"
  61. fi
  62. case $1 in
  63. start)
  64. echo -n "Starting zookeeper ... "
  65. if [ -f "$ZOOPIDFILE" ]; then
  66. if kill -0 `cat "$ZOOPIDFILE"` > /dev/null 2>&1; then
  67. echo $command already running as process `cat "$ZOOPIDFILE"`
  68. exit 1
  69. fi
  70. fi
  71. nohup "$JAVA" $ZOO_DATADIR_AUTOCREATE "-
  72. Dzookeeper.log.dir=${ZOO_LOG_DIR}" \
  73. "-Dzookeeper.log.file=${ZOO_LOG_FILE}" "-
  74. Dzookeeper.root.logger=${ZOO_LOG4J_PROP}" \
  75. -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError='kill -9 %p' \
  76. -cp "$CLASSPATH" $JVMFLAGS $ZOOMAIN "$ZOOCFG" >
  77. "$_ZOO_DAEMON_OUT" 2>&1 < /dev/null &
  78. ;;
  79. stop)
  80. echo -n "Stopping zookeeper ... "
  81. if [ ! -f "$ZOOPIDFILE" ]
  82. then
  83. echo "no zookeeper to stop (could not find file $ZOOPIDFILE)"
  84. else
  85. $KILL $(cat "$ZOOPIDFILE")
  86. rm "$ZOOPIDFILE"
  87. sleep 1
  88. echo STOPPED
  89. fi
  90. exit 0
  91. ;;
  92. restart)
  93. shift
  94. "$0" stop ${@}
  95. sleep 3
  96. "$0" start ${@}
  97. ;;
  98. status)
  99. ;;
  100. *)
  101. echo "Usage: $0 [--config <conf-dir>] {start|start-foreground|stop|restart|status|printcmd}" >&2
  102. esac

2)zkServer.sh start 底层的实际执行内容

  1. nohup "$JAVA"
  2. + 一堆提交参数
  3. + $ZOOMAINorg.apache.zookeeper.server.quorum.QuorumPeerMain + "$ZOOCFG" zkEnv.sh 文件中 ZOOCFG="zoo.cfg"

3)所以程序的入口是 QuorumPeerMain.java 类

2.2.2 ZK 服务端启动入口

1)ctrl + n,查找 QuorumPeerMain

QuorumPeerMain.java

  1. public static void main(String[] args) {
  2. // 创建了一个 zk 节点
  3. QuorumPeerMain main = new QuorumPeerMain();
  4. try {
  5. // 初始化节点并运行,args 相当于提交参数中的 zoo.cfg
  6. main.initializeAndRun(args);
  7. } catch (IllegalArgumentException e) {
  8. ... ...
  9. }
  10. LOG.info("Exiting normally");
  11. System.exit(0);
  12. }

2)initializeAndRun

  1. protected void initializeAndRun(String[] args) throws ConfigException, IOException, AdminServerException {
  2. // 管理 zk 的配置信息
  3. QuorumPeerConfig config = new QuorumPeerConfig();
  4. if (args.length == 1) {
  5. // 1 解析参数,zoo.cfg 和 myid
  6. config.parse(args[0]);
  7. }
  8. // 2 启动定时任务,对过期的快照,执行删除(默认该功能关闭)
  9. // Start and schedule the the purge task
  10. DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config
  11. .getDataDir(), config.getDataLogDir(), config
  12. .getSnapRetainCount(), config.getPurgeInterval());
  13. purgeMgr.start();
  14. if (args.length == 1 && config.isDistributed()) {
  15. // 3 启动集群
  16. runFromConfig(config);
  17. } else {
  18. LOG.warn("Either no config or no quorum defined in config, running "
  19. + " in standalone mode");
  20. // there is only server in the quorum -- run as standalone
  21. ZooKeeperServerMain.main(args);
  22. }
  23. }

2.2.3 解析参数 zoo.cfg myid

QuorumPeerConfig.java

  1. public void parse(String path) throws ConfigException {
  2. LOG.info("Reading configuration from: " + path);
  3. try {
  4. // 校验文件路径及是否存在
  5. File configFile = (new VerifyingFileFactory.Builder(LOG)
  6. .warnForRelativePath()
  7. .failForNonExistingPath()
  8. .build()).create(path);
  9. Properties cfg = new Properties();
  10. FileInputStream in = new FileInputStream(configFile);
  11. try {
  12. // 加载配置文件
  13. cfg.load(in);
  14. configFileStr = path;
  15. } finally {
  16. in.close();
  17. }
  18. // 解析配置文件
  19. parseProperties(cfg);
  20. } catch (IOException e) {
  21. throw new ConfigException("Error processing " + path, e);
  22. } catch (IllegalArgumentException e) {
  23. throw new ConfigException("Error processing " + path, e);
  24. }
  25. ... ...
  26. }

QuorumPeerConfig.java

  1. public void parseProperties(Properties zkProp)
  2. throws IOException, ConfigException {
  3. int clientPort = 0;
  4. int secureClientPort = 0;
  5. String clientPortAddress = null;
  6. String secureClientPortAddress = null;
  7. VerifyingFileFactory vff = new
  8. VerifyingFileFactory.Builder(LOG).warnForRelativePath().build();
  9. // 读取 zoo.cfg 文件中的属性值,并赋值给 QuorumPeerConfig 的类对象
  10. for (Entry<Object, Object> entry : zkProp.entrySet()) {
  11. String key = entry.getKey().toString().trim();
  12. String value = entry.getValue().toString().trim();
  13. if (key.equals("dataDir")) {
  14. dataDir = vff.create(value);
  15. } else if (key.equals("dataLogDir")) {
  16. dataLogDir = vff.create(value);
  17. } else if (key.equals("clientPort")) {
  18. clientPort = Integer.parseInt(value);
  19. } else if (key.equals("localSessionsEnabled")) {
  20. localSessionsEnabled = Boolean.parseBoolean(value);
  21. } else if (key.equals("localSessionsUpgradingEnabled")) {
  22. localSessionsUpgradingEnabled = Boolean.parseBoolean(value);
  23. } else if (key.equals("clientPortAddress")) {
  24. clientPortAddress = value.trim();
  25. } else if (key.equals("secureClientPort")) {
  26. secureClientPort = Integer.parseInt(value);
  27. } else if (key.equals("secureClientPortAddress")){
  28. secureClientPortAddress = value.trim();
  29. } else if (key.equals("tickTime")) {
  30. tickTime = Integer.parseInt(value);
  31. } else if (key.equals("maxClientCnxns")) {
  32. maxClientCnxns = Integer.parseInt(value);
  33. } else if (key.equals("minSessionTimeout")) {
  34. minSessionTimeout = Integer.parseInt(value);
  35. }
  36. ... ...
  37. }
  38. ... ...
  39. if (dynamicConfigFileStr == null) {
  40. setupQuorumPeerConfig(zkProp, true);
  41. if (isDistributed() && isReconfigEnabled()) {
  42. // we don't backup static config for standalone mode.
  43. // we also don't backup if reconfig feature is disabled.
  44. backupOldConfig();
  45. }
  46. }
  47. }

QuorumPeerConfig.java

  1. void setupQuorumPeerConfig(Properties prop, boolean configBackwardCompatibilityMode)
  2. throws IOException, ConfigException {
  3. quorumVerifier = parseDynamicConfig(prop, electionAlg, true, configBackwardCompatibilityMode);
  4. setupMyId();
  5. setupClientPort();
  6. setupPeerType();
  7. checkValidity();
  8. }

QuorumPeerConfig.java

  1. private void setupMyId() throws IOException {
  2. File myIdFile = new File(dataDir, "myid");
  3. // standalone server doesn't need myid file.
  4. if (!myIdFile.isFile()) {
  5. return;
  6. }
  7. BufferedReader br = new BufferedReader(new FileReader(myIdFile));
  8. String myIdString;
  9. try {
  10. myIdString = br.readLine();
  11. } finally {
  12. br.close();
  13. }
  14. try {
  15. // 将解析 myid 文件中的 id 赋值给 serverId
  16. serverId = Long.parseLong(myIdString);
  17. MDC.put("myid", myIdString);
  18. } catch (NumberFormatException e) {
  19. throw new IllegalArgumentException("serverid " + myIdString + " is not a number");
  20. }
  21. }

2.2.4 过期快照删除

可以启动定时任务,对过期的快照,执行删除。默认该功能时关闭的

  1. protected void initializeAndRun(String[] args) throws ConfigException, IOException, AdminServerException{
  2. // 管理 zk 的配置信息
  3. QuorumPeerConfig config = new QuorumPeerConfig();
  4. if (args.length == 1) {
  5. // 1 解析参数,zoo.cfg 和 myid
  6. config.parse(args[0]);
  7. }
  8. // 2 启动定时任务,对过期的快照,执行删除(默认是关闭)
  9. // config.getSnapRetainCount() = 3 最少保留的快照个数
  10. // config.getPurgeInterval() = 0 默认 0 表示关闭
  11. // Start and schedule the the purge task
  12. DatadirCleanupManager purgeMgr = new DatadirCleanupManager(config.getDataDir(), config.getDataLogDir(), config.getSnapRetainCount(), config.getPurgeInterval());
  13. purgeMgr.start();
  14. if (args.length == 1 && config.isDistributed()) {
  15. // 3 启动集群
  16. runFromConfig(config);
  17. } else {
  18. LOG.warn("Either no config or no quorum defined in config, running "
  19. + " in standalone mode");
  20. // there is only server in the quorum -- run as standalone
  21. ZooKeeperServerMain.main(args);
  22. }
  23. }
  24. protected int snapRetainCount = 3;
  25. protected int purgeInterval = 0;
  26. public void start() {
  27. if (PurgeTaskStatus.STARTED == purgeTaskStatus) {
  28. LOG.warn("Purge task is already running.");
  29. return;
  30. }
  31. // 默认情况 purgeInterval=0,该任务关闭,直接返回
  32. // Don't schedule the purge task with zero or negative purge interval.
  33. if (purgeInterval <= 0) {
  34. LOG.info("Purge task is not scheduled.");
  35. return;
  36. }
  37. // 创建一个定时器
  38. timer = new Timer("PurgeTask", true);
  39. // 创建一个清理快照任务
  40. TimerTask task = new PurgeTask(dataLogDir, snapDir, snapRetainCount);
  41. // 如果 purgeInterval 设置的值是 1,表示 1 小时检查一次,判断是否有过期快照,
  42. 有则删除
  43. timer.scheduleAtFixedRate(task, 0, TimeUnit.HOURS.toMillis(purgeInterval));
  44. purgeTaskStatus = PurgeTaskStatus.STARTED;
  45. }
  46. static class PurgeTask extends TimerTask {
  47. private File logsDir;
  48. private File snapsDir;
  49. private int snapRetainCount;
  50. public PurgeTask(File dataDir, File snapDir, int count) {
  51. logsDir = dataDir;
  52. snapsDir = snapDir;
  53. snapRetainCount = count;
  54. }
  55. @Override
  56. public void run() {
  57. LOG.info("Purge task started.");
  58. try {
  59. // 清理过期的数据
  60. PurgeTxnLog.purge(logsDir, snapsDir, snapRetainCount);
  61. } catch (Exception e) {
  62. LOG.error("Error occurred while purging.", e);
  63. }
  64. LOG.info("Purge task completed.");
  65. }
  66. }
  67. public static void purge(File dataDir, File snapDir, int num) throws IOException {
  68. if (num < 3) {
  69. throw new IllegalArgumentException(COUNT_ERR_MSG);
  70. }
  71. FileTxnSnapLog txnLog = new FileTxnSnapLog(dataDir, snapDir);
  72. List<File> snaps = txnLog.findNRecentSnapshots(num);
  73. int numSnaps = snaps.size();
  74. if (numSnaps > 0) {
  75. purgeOlderSnapshots(txnLog, snaps.get(numSnaps - 1));
  76. }
  77. }

2.2.5 初始化通信组件