failover过程

完整日志如下(GTID模式)

  1. Mon Jun 10 11:25:51 2019 - [warning] Got error on MySQL select ping: 1053 (Server shutdown in progress)
  2. Mon Jun 10 11:25:51 2019 - [info] Executing SSH check script: exit 0
  3. Mon Jun 10 11:25:51 2019 - [warning] HealthCheck: SSH to db1 is NOT reachable.
  4. Mon Jun 10 11:25:52 2019 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.100.64.119' (111))
  5. Mon Jun 10 11:25:52 2019 - [warning] Connection failed 2 time(s)..
  6. Mon Jun 10 11:25:53 2019 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.100.64.119' (111))
  7. Mon Jun 10 11:25:53 2019 - [warning] Connection failed 3 time(s)..
  8. Mon Jun 10 11:25:54 2019 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.100.64.119' (111))
  9. Mon Jun 10 11:25:54 2019 - [warning] Connection failed 4 time(s)..
  10. Mon Jun 10 11:25:54 2019 - [warning] Master is not reachable from health checker!
  11. Mon Jun 10 11:25:54 2019 - [warning] Master db1(10.100.64.119:3306) is not reachable!
  12. Mon Jun 10 11:25:54 2019 - [warning] SSH is NOT reachable.
  13. Mon Jun 10 11:25:54 2019 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/masterha_default.conf and /etc/masterha/app1.conf again, and trying to connect to all servers to check server status..
  14. Mon Jun 10 11:25:54 2019 - [info] Reading default configuration from /etc/masterha/masterha_default.conf..
  15. Mon Jun 10 11:25:54 2019 - [info] Reading application default configuration from /etc/masterha/app1.conf..
  16. Mon Jun 10 11:25:54 2019 - [info] Reading server configuration from /etc/masterha/app1.conf..
  17. Mon Jun 10 11:25:55 2019 - [info] GTID failover mode = 1
  18. Mon Jun 10 11:25:55 2019 - [info] Dead Servers:
  19. Mon Jun 10 11:25:55 2019 - [info] db1(10.100.64.119:3306)
  20. Mon Jun 10 11:25:55 2019 - [info] Alive Servers:
  21. Mon Jun 10 11:25:55 2019 - [info] db2(10.100.64.120:3306)
  22. Mon Jun 10 11:25:55 2019 - [info] db3(10.100.64.121:3306)
  23. Mon Jun 10 11:25:55 2019 - [info] Alive Slaves:
  24. Mon Jun 10 11:25:55 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  25. Mon Jun 10 11:25:55 2019 - [info] GTID ON
  26. Mon Jun 10 11:25:55 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  27. Mon Jun 10 11:25:55 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  28. Mon Jun 10 11:25:55 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  29. Mon Jun 10 11:25:55 2019 - [info] GTID ON
  30. Mon Jun 10 11:25:55 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  31. Mon Jun 10 11:25:55 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  32. Mon Jun 10 11:25:55 2019 - [info] Checking slave configurations..
  33. Mon Jun 10 11:25:55 2019 - [info] read_only=1 is not set on slave db2(10.100.64.120:3306).
  34. Mon Jun 10 11:25:55 2019 - [info] read_only=1 is not set on slave db3(10.100.64.121:3306).
  35. Mon Jun 10 11:25:55 2019 - [info] Checking replication filtering settings..
  36. Mon Jun 10 11:25:55 2019 - [info] Replication filtering check ok.
  37. Mon Jun 10 11:25:55 2019 - [info] Master is down!
  38. Mon Jun 10 11:25:55 2019 - [info] Terminating monitoring script.
  39. Mon Jun 10 11:25:55 2019 - [info] Got exit code 20 (Master dead).
  40. Mon Jun 10 11:25:55 2019 - [info] MHA::MasterFailover version 0.58.
  41. Mon Jun 10 11:25:55 2019 - [info] Starting master failover.
  42. Mon Jun 10 11:25:55 2019 - [info]
  43. Mon Jun 10 11:25:55 2019 - [info] * Phase 1: Configuration Check Phase..
  44. Mon Jun 10 11:25:55 2019 - [info]
  45. Mon Jun 10 11:25:56 2019 - [info] GTID failover mode = 1
  46. Mon Jun 10 11:25:56 2019 - [info] Dead Servers:
  47. Mon Jun 10 11:25:56 2019 - [info] db1(10.100.64.119:3306)
  48. Mon Jun 10 11:25:56 2019 - [info] Checking master reachability via MySQL(double check)...
  49. Mon Jun 10 11:25:56 2019 - [info] ok.
  50. Mon Jun 10 11:25:56 2019 - [info] Alive Servers:
  51. Mon Jun 10 11:25:56 2019 - [info] db2(10.100.64.120:3306)
  52. Mon Jun 10 11:25:56 2019 - [info] db3(10.100.64.121:3306)
  53. Mon Jun 10 11:25:56 2019 - [info] Alive Slaves:
  54. Mon Jun 10 11:25:56 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  55. Mon Jun 10 11:25:56 2019 - [info] GTID ON
  56. Mon Jun 10 11:25:56 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  57. Mon Jun 10 11:25:56 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  58. Mon Jun 10 11:25:56 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  59. Mon Jun 10 11:25:56 2019 - [info] GTID ON
  60. Mon Jun 10 11:25:56 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  61. Mon Jun 10 11:25:56 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  62. Mon Jun 10 11:25:56 2019 - [info] Starting GTID based failover.
  63. Mon Jun 10 11:25:56 2019 - [info]
  64. Mon Jun 10 11:25:56 2019 - [info] ** Phase 1: Configuration Check Phase completed.
  65. Mon Jun 10 11:25:56 2019 - [info]
  66. Mon Jun 10 11:25:56 2019 - [info] * Phase 2: Dead Master Shutdown Phase..
  67. Mon Jun 10 11:25:56 2019 - [info]
  68. Mon Jun 10 11:25:56 2019 - [info] Forcing shutdown so that applications never connect to the current master..
  69. Mon Jun 10 11:25:56 2019 - [info] Executing master IP deactivation script:
  70. Mon Jun 10 11:25:56 2019 - [info] /etc/masterha/master_ip_failover --orig_master_host=db1 --orig_master_ip=10.100.64.119 --orig_master_port=3306 --command=stop --orig_master_ssh_port=52000
  71. drop vip from original master
  72. ssh: connect to host db1 port 52000: Connection refused
  73. Mon Jun 10 11:25:59 2019 - [info] done.
  74. Mon Jun 10 11:25:59 2019 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
  75. Mon Jun 10 11:25:59 2019 - [info] * Phase 2: Dead Master Shutdown Phase completed.
  76. Mon Jun 10 11:25:59 2019 - [info]
  77. Mon Jun 10 11:25:59 2019 - [info] * Phase 3: Master Recovery Phase..
  78. Mon Jun 10 11:25:59 2019 - [info]
  79. Mon Jun 10 11:25:59 2019 - [info] * Phase 3.1: Getting Latest Slaves Phase..
  80. Mon Jun 10 11:25:59 2019 - [info]
  81. Mon Jun 10 11:25:59 2019 - [info] The latest binary log file/position on all slaves is mysql-bin.000037:194
  82. Mon Jun 10 11:25:59 2019 - [info] Latest slaves (Slaves that received relay log files to the latest):
  83. Mon Jun 10 11:25:59 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  84. Mon Jun 10 11:25:59 2019 - [info] GTID ON
  85. Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  86. Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  87. Mon Jun 10 11:25:59 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  88. Mon Jun 10 11:25:59 2019 - [info] GTID ON
  89. Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  90. Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  91. Mon Jun 10 11:25:59 2019 - [info] The oldest binary log file/position on all slaves is mysql-bin.000037:194
  92. Mon Jun 10 11:25:59 2019 - [info] Oldest slaves:
  93. Mon Jun 10 11:25:59 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  94. Mon Jun 10 11:25:59 2019 - [info] GTID ON
  95. Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  96. Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  97. Mon Jun 10 11:25:59 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  98. Mon Jun 10 11:25:59 2019 - [info] GTID ON
  99. Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  100. Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  101. Mon Jun 10 11:25:59 2019 - [info]
  102. Mon Jun 10 11:25:59 2019 - [info] * Phase 3.3: Determining New Master Phase..
  103. Mon Jun 10 11:25:59 2019 - [info]
  104. Mon Jun 10 11:25:59 2019 - [info] Searching new master from slaves..
  105. Mon Jun 10 11:25:59 2019 - [info] Candidate masters from the configuration file:
  106. Mon Jun 10 11:25:59 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  107. Mon Jun 10 11:25:59 2019 - [info] GTID ON
  108. Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  109. Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  110. Mon Jun 10 11:25:59 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  111. Mon Jun 10 11:25:59 2019 - [info] GTID ON
  112. Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  113. Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  114. Mon Jun 10 11:25:59 2019 - [info] Non-candidate masters:
  115. Mon Jun 10 11:25:59 2019 - [info] Searching from candidate_master slaves which have received the latest relay log events..
  116. Mon Jun 10 11:25:59 2019 - [info] New master is db2(10.100.64.120:3306)
  117. Mon Jun 10 11:25:59 2019 - [info] Starting master failover..
  118. Mon Jun 10 11:25:59 2019 - [info]
  119. From:
  120. db1(10.100.64.119:3306) (current master)
  121. +--db2(10.100.64.120:3306)
  122. +--db3(10.100.64.121:3306)
  123. To:
  124. db2(10.100.64.120:3306) (new master)
  125. +--db3(10.100.64.121:3306)
  126. Mon Jun 10 11:25:59 2019 - [info]
  127. Mon Jun 10 11:25:59 2019 - [info] * Phase 3.3: New Master Recovery Phase..
  128. Mon Jun 10 11:25:59 2019 - [info]
  129. Mon Jun 10 11:25:59 2019 - [info] Waiting all logs to be applied..
  130. Mon Jun 10 11:25:59 2019 - [info] done.
  131. Mon Jun 10 11:25:59 2019 - [info] Getting new master's binlog name and position..
  132. Mon Jun 10 11:25:59 2019 - [info] mysql-bin.000032:194
  133. Mon Jun 10 11:25:59 2019 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='db2 or 10.100.64.120', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
  134. Mon Jun 10 11:25:59 2019 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000032, 194, 479dbc89-e31e-11e8-83eb-005056a8803c:1-696587
  135. Mon Jun 10 11:25:59 2019 - [info] Executing master IP activate script:
  136. Mon Jun 10 11:25:59 2019 - [info] /etc/masterha/master_ip_failover --command=start --ssh_user=root --orig_master_host=db1 --orig_master_ip=10.100.64.119 --orig_master_port=3306 --new_master_host=db2 --new_master_ip=10.100.64.120 --new_master_port=3306 --new_master_user='xucl' --orig_master_ssh_port=52000 --new_master_ssh_port=52000 --new_master_password=xxx
  137. Set read_only=0 on the new master.
  138. add vip to new master
  139. Warning: Permanently added '[db2]:52000' (ECDSA) to the list of known hosts.
  140. Mon Jun 10 11:25:59 2019 - [info] OK.
  141. Mon Jun 10 11:25:59 2019 - [info] ** Finished master recovery successfully.
  142. Mon Jun 10 11:25:59 2019 - [info] * Phase 3: Master Recovery Phase completed.
  143. Mon Jun 10 11:25:59 2019 - [info]
  144. Mon Jun 10 11:25:59 2019 - [info] * Phase 4: Slaves Recovery Phase..
  145. Mon Jun 10 11:25:59 2019 - [info]
  146. Mon Jun 10 11:25:59 2019 - [info]
  147. Mon Jun 10 11:25:59 2019 - [info] * Phase 4.1: Starting Slaves in parallel..
  148. Mon Jun 10 11:25:59 2019 - [info]
  149. Mon Jun 10 11:25:59 2019 - [info] -- Slave recovery on host db3(10.100.64.121:3306) started, pid: 2886. Check tmp log /var/log/masterha/app1/db3_3306_20190610112555.log if it takes time..
  150. Mon Jun 10 11:26:00 2019 - [info]
  151. Mon Jun 10 11:26:00 2019 - [info] Log messages from db3 ...
  152. Mon Jun 10 11:26:00 2019 - [info]
  153. Mon Jun 10 11:25:59 2019 - [info] Resetting slave db3(10.100.64.121:3306) and starting replication from the new master db2(10.100.64.120:3306)..
  154. Mon Jun 10 11:25:59 2019 - [info] Executed CHANGE MASTER.
  155. Mon Jun 10 11:25:59 2019 - [info] Slave started.
  156. Mon Jun 10 11:25:59 2019 - [info] gtid_wait(479dbc89-e31e-11e8-83eb-005056a8803c:1-696587) completed on db3(10.100.64.121:3306). Executed 0 events.
  157. Mon Jun 10 11:26:00 2019 - [info] End of log messages from db3.
  158. Mon Jun 10 11:26:00 2019 - [info] -- Slave on host db3(10.100.64.121:3306) started.
  159. Mon Jun 10 11:26:00 2019 - [info] All new slave servers recovered successfully.
  160. Mon Jun 10 11:26:00 2019 - [info]
  161. Mon Jun 10 11:26:00 2019 - [info] * Phase 5: New master cleanup phase..
  162. Mon Jun 10 11:26:00 2019 - [info]
  163. Mon Jun 10 11:26:00 2019 - [info] Resetting slave info on the new master..
  164. Mon Jun 10 11:26:00 2019 - [info] db2: Resetting slave info succeeded.
  165. Mon Jun 10 11:26:00 2019 - [info] Master failover to db2(10.100.64.120:3306) completed successfully.
  166. Mon Jun 10 11:26:00 2019 - [info]
  167. ----- Failover Report -----
  168. app1: MySQL Master failover db1(10.100.64.119:3306) to db2(10.100.64.120:3306) succeeded
  169. Master db1(10.100.64.119:3306) is down!
  170. Check MHA Manager logs at manager:/var/log/masterha/app1/app1.log for details.
  171. Started automated(non-interactive) failover.
  172. Invalidated master IP address on db1(10.100.64.119:3306)
  173. Selected db2(10.100.64.120:3306) as a new master.
  174. db2(10.100.64.120:3306): OK: Applying all logs succeeded.
  175. db2(10.100.64.120:3306): OK: Activated master IP address.
  176. db3(10.100.64.121:3306): OK: Slave started, replicating from db2(10.100.64.120:3306)
  177. db2(10.100.64.120:3306): Resetting slave info succeeded.
  178. Master failover to db2(10.100.64.120:3306) completed successfully.

从日志上我们可以做一下分解、切割

Phase 1: Configuration Check Phase..

  1. Mon Jun 10 11:25:56 2019 - [info] GTID failover mode = 1
  2. Mon Jun 10 11:25:56 2019 - [info] Dead Servers:
  3. Mon Jun 10 11:25:56 2019 - [info] db1(10.100.64.119:3306)
  4. Mon Jun 10 11:25:56 2019 - [info] Checking master reachability via MySQL(double check)...
  5. Mon Jun 10 11:25:56 2019 - [info] ok.
  6. Mon Jun 10 11:25:56 2019 - [info] Alive Servers:
  7. Mon Jun 10 11:25:56 2019 - [info] db2(10.100.64.120:3306)
  8. Mon Jun 10 11:25:56 2019 - [info] db3(10.100.64.121:3306)
  9. Mon Jun 10 11:25:56 2019 - [info] Alive Slaves:
  10. Mon Jun 10 11:25:56 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  11. Mon Jun 10 11:25:56 2019 - [info] GTID ON
  12. Mon Jun 10 11:25:56 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  13. Mon Jun 10 11:25:56 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  14. Mon Jun 10 11:25:56 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  15. Mon Jun 10 11:25:56 2019 - [info] GTID ON
  16. Mon Jun 10 11:25:56 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  17. Mon Jun 10 11:25:56 2019 - [info] Primary candidate for the new Master (candidate_master is set)

该阶段主要做的是:

  • 检查failover模式(这里是基于GTID的failover)
  • 检查dead master(double check)
  • 检查alive servers,并检查每个节点信息(GTID_MODE、从哪里复制、是否是candidate_master、read_only、replication filter)

Phase 2: Dead Master Shutdown Phase..

  1. Mon Jun 10 11:25:56 2019 - [info] * Phase 2: Dead Master Shutdown Phase..
  2. Mon Jun 10 11:25:56 2019 - [info]
  3. Mon Jun 10 11:25:56 2019 - [info] Forcing shutdown so that applications never connect to the current master..
  4. Mon Jun 10 11:25:56 2019 - [info] Executing master IP deactivation script:
  5. Mon Jun 10 11:25:56 2019 - [info] /etc/masterha/master_ip_failover --orig_master_host=db1 --orig_master_ip=10.100.64.119 --orig_master_port=3306 --command=stop --orig_master_ssh_port=52000
  6. drop vip from original master
  7. ssh: connect to host db1 port 52000: Connection refused
  8. Mon Jun 10 11:25:59 2019 - [info] done.
  9. Mon Jun 10 11:25:59 2019 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
  10. Mon Jun 10 11:25:59 2019 - [info] * Phase 2: Dead Master Shutdown Phase completed.
  1. /etc/masterha/master_ip_failover --orig_master_host=db1 --orig_master_ip=10.100.64.119 --orig_master_port=3306 --command=stop --orig_master_ssh_port=52000

调用master_ip_failover脚本,调用stop命令,我们再看下调用stop命令是进行了什么操作

  1. if ( $command eq "stop" || $command eq "stopssh" ) {
  2. # $orig_master_host, $orig_master_ip, $orig_master_port are passed.
  3. # If you manage master ip address at global catalog database,
  4. # invalidate orig_master_ip here.
  5. &drop_vip();
  6. my $exit_code = 1;
  7. eval {
  8. # updating global catalog, etc
  9. $exit_code = 0;
  10. };
  11. if ($@) {
  12. warn "Got Error: $@\n";
  13. exit $exit_code;
  14. }
  15. exit $exit_code;
  16. }

实际上,就是去原先的master执行了删除vip的动作,从我们的日志上看到的是,因为原先的server服务器无法ssh上去,因此也无法执行drop动作

Phase 3: Master Recovery Phase..

  1. Mon Jun 10 11:25:59 2019 - [info] * Phase 3: Master Recovery Phase..
  2. Mon Jun 10 11:25:59 2019 - [info]
  3. Mon Jun 10 11:25:59 2019 - [info] * Phase 3.1: Getting Latest Slaves Phase..
  4. Mon Jun 10 11:25:59 2019 - [info]
  5. Mon Jun 10 11:25:59 2019 - [info] The latest binary log file/position on all slaves is mysql-bin.000037:194
  6. Mon Jun 10 11:25:59 2019 - [info] Latest slaves (Slaves that received relay log files to the latest):
  7. Mon Jun 10 11:25:59 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  8. Mon Jun 10 11:25:59 2019 - [info] GTID ON
  9. Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  10. Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  11. Mon Jun 10 11:25:59 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  12. Mon Jun 10 11:25:59 2019 - [info] GTID ON
  13. Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  14. Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  15. Mon Jun 10 11:25:59 2019 - [info] The oldest binary log file/position on all slaves is mysql-bin.000037:194
  16. Mon Jun 10 11:25:59 2019 - [info] Oldest slaves:
  17. Mon Jun 10 11:25:59 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  18. Mon Jun 10 11:25:59 2019 - [info] GTID ON
  19. Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  20. Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  21. Mon Jun 10 11:25:59 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  22. Mon Jun 10 11:25:59 2019 - [info] GTID ON
  23. Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  24. Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  25. Mon Jun 10 11:25:59 2019 - [info]
  26. Mon Jun 10 11:25:59 2019 - [info] * Phase 3.3: Determining New Master Phase..
  27. Mon Jun 10 11:25:59 2019 - [info]
  28. Mon Jun 10 11:25:59 2019 - [info] Searching new master from slaves..
  29. Mon Jun 10 11:25:59 2019 - [info] Candidate masters from the configuration file:
  30. Mon Jun 10 11:25:59 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  31. Mon Jun 10 11:25:59 2019 - [info] GTID ON
  32. Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  33. Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  34. Mon Jun 10 11:25:59 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
  35. Mon Jun 10 11:25:59 2019 - [info] GTID ON
  36. Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
  37. Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
  38. Mon Jun 10 11:25:59 2019 - [info] Non-candidate masters:
  39. Mon Jun 10 11:25:59 2019 - [info] Searching from candidate_master slaves which have received the latest relay log events..
  40. Mon Jun 10 11:25:59 2019 - [info] New master is db2(10.100.64.120:3306)
  41. Mon Jun 10 11:25:59 2019 - [info] Starting master failover..
  42. Mon Jun 10 11:25:59 2019 - [info]
  43. From:
  44. db1(10.100.64.119:3306) (current master)
  45. +--db2(10.100.64.120:3306)
  46. +--db3(10.100.64.121:3306)
  47. To:
  48. db2(10.100.64.120:3306) (new master)
  49. +--db3(10.100.64.121:3306)
  50. Mon Jun 10 11:25:59 2019 - [info]
  51. Mon Jun 10 11:25:59 2019 - [info] * Phase 3.3: New Master Recovery Phase..
  52. Mon Jun 10 11:25:59 2019 - [info]
  53. Mon Jun 10 11:25:59 2019 - [info] Waiting all logs to be applied..
  54. Mon Jun 10 11:25:59 2019 - [info] done.
  55. Mon Jun 10 11:25:59 2019 - [info] Getting new master's binlog name and position..
  56. Mon Jun 10 11:25:59 2019 - [info] mysql-bin.000032:194
  57. Mon Jun 10 11:25:59 2019 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='db2 or 10.100.64.120', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
  58. Mon Jun 10 11:25:59 2019 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000032, 194, 479dbc89-e31e-11e8-83eb-005056a8803c:1-696587
  59. Mon Jun 10 11:25:59 2019 - [info] Executing master IP activate script:
  60. Mon Jun 10 11:25:59 2019 - [info] /etc/masterha/master_ip_failover --command=start --ssh_user=root --orig_master_host=db1 --orig_master_ip=10.100.64.119 --orig_master_port=3306 --new_master_host=db2 --new_master_ip=10.100.64.120 --new_master_port=3306 --new_master_user='xucl' --orig_master_ssh_port=52000 --new_master_ssh_port=52000 --new_master_password=xxx
  61. Set read_only=0 on the new master.
  62. add vip to new master
  63. Warning: Permanently added '[db2]:52000' (ECDSA) to the list of known hosts.
  64. Mon Jun 10 11:25:59 2019 - [info] OK.
  65. Mon Jun 10 11:25:59 2019 - [info] ** Finished master recovery successfully.
  66. Mon Jun 10 11:25:59 2019 - [info] * Phase 3: Master Recovery Phase completed.

整个第三阶段步骤是最多的,主要有如下:

  • 3.1: Getting Latest Slaves Phase
  • 该阶段主要通过复制信息确定哪个slave节点有最新的binlog
  • 3.2: Saving Dead Master’s Binlog Phase
  • 保存dead master上的binlog,如果是由于down机导致无法ssh,那么会提示Transactions that were not sent to the latest slave (Read_Master_Log_Pos to the tail of the dead master's binlog) were lost
  • 3.3: Determining New Master Phase
  • 该阶段即确认哪个slave节点会被选为新的master节点
  • 3.4:New Master Diff Log Generation Phase
  • 该阶段即调用save_binary_logs来补齐binlog
  • 3.5: Master Log Apply Phase
  • 等待新master应用完全部的binlog,确定应用完日志的最新位点,生成change master to语句
  • 完成以后调用failover命令
  1. /etc/masterha/master_ip_failover --command=start --ssh_user=root --orig_master_host=db1 --orig_master_ip=10.100.64.119 --orig_master_port=3306 --new_master_host=db2 --new_master_ip=10.100.64.120 --new_master_port=3306 --new_master_user='xucl' --orig_master_ssh_port=52000 --new_master_ssh_port=52000 --new_master_password=xxx

这里调用start命令,源码如下:

  1. elsif ( $command eq "start" ) {
  2. # all arguments are passed.
  3. # If you manage master ip address at global catalog database,
  4. # activate new_master_ip here.
  5. # You can also grant write access (create user, set read_only=0, etc) here.
  6. my $exit_code = 10;
  7. eval {
  8. my $new_master_handler = new MHA::DBHelper();
  9. # args: hostname, port, user, password, raise_error_or_not
  10. $new_master_handler->connect( $new_master_ip, $new_master_port,
  11. $new_master_user, $new_master_password, 1 );
  12. ## Set read_only=0 on the new master
  13. $new_master_handler->disable_log_bin_local();
  14. print "Set read_only=0 on the new master.\n";
  15. $new_master_handler->disable_read_only();
  16. ## Creating an app user on the new master
  17. #print "Creating app user on the new master..\n";
  18. #FIXME_xxx_create_user( $new_master_handler->{dbh} );
  19. $new_master_handler->enable_log_bin_local();
  20. $new_master_handler->disconnect();
  21. ## Update master ip on the catalog database, etc
  22. &add_vip();
  23. $exit_code = 0;
  24. };
  25. if ($@) {
  26. warn $@;
  27. # If you want to continue failover, exit 10.
  28. exit $exit_code;
  29. }
  30. exit $exit_code;
  31. }

这里实际上执行的是设置read_only,然后绑定VIP的操作

Phase 4: Slaves Recovery Phase

  • Phase 4.1: Starting Parallel Slave Diff Log Generation Phase
  • 其他从库全部应用完relay log以后,从新的master上补齐差异binlog
  • Phase 4.2: Starting Parallel Slave Log Apply Phase
  • 其他节点应用差异的binlog

Phase 5: New master cleanup phase

该阶段即清理new master上的slave信息

最后完成fail over后,给出简单的报告:

  1. app1: MySQL Master failover db1(10.100.64.119:3306) to db2(10.100.64.120:3306) succeeded
  2. Master db1(10.100.64.119:3306) is down!
  3. Check MHA Manager logs at manager:/var/log/masterha/app1/app1.log for details.
  4. Started automated(non-interactive) failover.
  5. Invalidated master IP address on db1(10.100.64.119:3306)
  6. Selected db2(10.100.64.120:3306) as a new master.
  7. db2(10.100.64.120:3306): OK: Applying all logs succeeded.
  8. db2(10.100.64.120:3306): OK: Activated master IP address.
  9. db3(10.100.64.121:3306): OK: Slave started, replicating from db2(10.100.64.120:3306)
  10. db2(10.100.64.120:3306): Resetting slave info succeeded.
  11. Master failover to db2(10.100.64.120:3306) completed successfully.

简单给出了从哪个节点选举了新的master节点,其他节点成功复制,failover完成

【MySQL】从General Log看MHA原理 - 图1