failover过程
完整日志如下(GTID模式)
Mon Jun 10 11:25:51 2019 - [warning] Got error on MySQL select ping: 1053 (Server shutdown in progress)
Mon Jun 10 11:25:51 2019 - [info] Executing SSH check script: exit 0
Mon Jun 10 11:25:51 2019 - [warning] HealthCheck: SSH to db1 is NOT reachable.
Mon Jun 10 11:25:52 2019 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.100.64.119' (111))
Mon Jun 10 11:25:52 2019 - [warning] Connection failed 2 time(s)..
Mon Jun 10 11:25:53 2019 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.100.64.119' (111))
Mon Jun 10 11:25:53 2019 - [warning] Connection failed 3 time(s)..
Mon Jun 10 11:25:54 2019 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.100.64.119' (111))
Mon Jun 10 11:25:54 2019 - [warning] Connection failed 4 time(s)..
Mon Jun 10 11:25:54 2019 - [warning] Master is not reachable from health checker!
Mon Jun 10 11:25:54 2019 - [warning] Master db1(10.100.64.119:3306) is not reachable!
Mon Jun 10 11:25:54 2019 - [warning] SSH is NOT reachable.
Mon Jun 10 11:25:54 2019 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/masterha_default.conf and /etc/masterha/app1.conf again, and trying to connect to all servers to check server status..
Mon Jun 10 11:25:54 2019 - [info] Reading default configuration from /etc/masterha/masterha_default.conf..
Mon Jun 10 11:25:54 2019 - [info] Reading application default configuration from /etc/masterha/app1.conf..
Mon Jun 10 11:25:54 2019 - [info] Reading server configuration from /etc/masterha/app1.conf..
Mon Jun 10 11:25:55 2019 - [info] GTID failover mode = 1
Mon Jun 10 11:25:55 2019 - [info] Dead Servers:
Mon Jun 10 11:25:55 2019 - [info] db1(10.100.64.119:3306)
Mon Jun 10 11:25:55 2019 - [info] Alive Servers:
Mon Jun 10 11:25:55 2019 - [info] db2(10.100.64.120:3306)
Mon Jun 10 11:25:55 2019 - [info] db3(10.100.64.121:3306)
Mon Jun 10 11:25:55 2019 - [info] Alive Slaves:
Mon Jun 10 11:25:55 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:55 2019 - [info] GTID ON
Mon Jun 10 11:25:55 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:55 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:55 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:55 2019 - [info] GTID ON
Mon Jun 10 11:25:55 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:55 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:55 2019 - [info] Checking slave configurations..
Mon Jun 10 11:25:55 2019 - [info] read_only=1 is not set on slave db2(10.100.64.120:3306).
Mon Jun 10 11:25:55 2019 - [info] read_only=1 is not set on slave db3(10.100.64.121:3306).
Mon Jun 10 11:25:55 2019 - [info] Checking replication filtering settings..
Mon Jun 10 11:25:55 2019 - [info] Replication filtering check ok.
Mon Jun 10 11:25:55 2019 - [info] Master is down!
Mon Jun 10 11:25:55 2019 - [info] Terminating monitoring script.
Mon Jun 10 11:25:55 2019 - [info] Got exit code 20 (Master dead).
Mon Jun 10 11:25:55 2019 - [info] MHA::MasterFailover version 0.58.
Mon Jun 10 11:25:55 2019 - [info] Starting master failover.
Mon Jun 10 11:25:55 2019 - [info]
Mon Jun 10 11:25:55 2019 - [info] * Phase 1: Configuration Check Phase..
Mon Jun 10 11:25:55 2019 - [info]
Mon Jun 10 11:25:56 2019 - [info] GTID failover mode = 1
Mon Jun 10 11:25:56 2019 - [info] Dead Servers:
Mon Jun 10 11:25:56 2019 - [info] db1(10.100.64.119:3306)
Mon Jun 10 11:25:56 2019 - [info] Checking master reachability via MySQL(double check)...
Mon Jun 10 11:25:56 2019 - [info] ok.
Mon Jun 10 11:25:56 2019 - [info] Alive Servers:
Mon Jun 10 11:25:56 2019 - [info] db2(10.100.64.120:3306)
Mon Jun 10 11:25:56 2019 - [info] db3(10.100.64.121:3306)
Mon Jun 10 11:25:56 2019 - [info] Alive Slaves:
Mon Jun 10 11:25:56 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:56 2019 - [info] GTID ON
Mon Jun 10 11:25:56 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:56 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:56 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:56 2019 - [info] GTID ON
Mon Jun 10 11:25:56 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:56 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:56 2019 - [info] Starting GTID based failover.
Mon Jun 10 11:25:56 2019 - [info]
Mon Jun 10 11:25:56 2019 - [info] ** Phase 1: Configuration Check Phase completed.
Mon Jun 10 11:25:56 2019 - [info]
Mon Jun 10 11:25:56 2019 - [info] * Phase 2: Dead Master Shutdown Phase..
Mon Jun 10 11:25:56 2019 - [info]
Mon Jun 10 11:25:56 2019 - [info] Forcing shutdown so that applications never connect to the current master..
Mon Jun 10 11:25:56 2019 - [info] Executing master IP deactivation script:
Mon Jun 10 11:25:56 2019 - [info] /etc/masterha/master_ip_failover --orig_master_host=db1 --orig_master_ip=10.100.64.119 --orig_master_port=3306 --command=stop --orig_master_ssh_port=52000
drop vip from original master
ssh: connect to host db1 port 52000: Connection refused
Mon Jun 10 11:25:59 2019 - [info] done.
Mon Jun 10 11:25:59 2019 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Mon Jun 10 11:25:59 2019 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] * Phase 3: Master Recovery Phase..
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] The latest binary log file/position on all slaves is mysql-bin.000037:194
Mon Jun 10 11:25:59 2019 - [info] Latest slaves (Slaves that received relay log files to the latest):
Mon Jun 10 11:25:59 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:59 2019 - [info] GTID ON
Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:59 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:59 2019 - [info] GTID ON
Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:59 2019 - [info] The oldest binary log file/position on all slaves is mysql-bin.000037:194
Mon Jun 10 11:25:59 2019 - [info] Oldest slaves:
Mon Jun 10 11:25:59 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:59 2019 - [info] GTID ON
Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:59 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:59 2019 - [info] GTID ON
Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] * Phase 3.3: Determining New Master Phase..
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] Searching new master from slaves..
Mon Jun 10 11:25:59 2019 - [info] Candidate masters from the configuration file:
Mon Jun 10 11:25:59 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:59 2019 - [info] GTID ON
Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:59 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:59 2019 - [info] GTID ON
Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:59 2019 - [info] Non-candidate masters:
Mon Jun 10 11:25:59 2019 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Mon Jun 10 11:25:59 2019 - [info] New master is db2(10.100.64.120:3306)
Mon Jun 10 11:25:59 2019 - [info] Starting master failover..
Mon Jun 10 11:25:59 2019 - [info]
From:
db1(10.100.64.119:3306) (current master)
+--db2(10.100.64.120:3306)
+--db3(10.100.64.121:3306)
To:
db2(10.100.64.120:3306) (new master)
+--db3(10.100.64.121:3306)
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] * Phase 3.3: New Master Recovery Phase..
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] Waiting all logs to be applied..
Mon Jun 10 11:25:59 2019 - [info] done.
Mon Jun 10 11:25:59 2019 - [info] Getting new master's binlog name and position..
Mon Jun 10 11:25:59 2019 - [info] mysql-bin.000032:194
Mon Jun 10 11:25:59 2019 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='db2 or 10.100.64.120', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Mon Jun 10 11:25:59 2019 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000032, 194, 479dbc89-e31e-11e8-83eb-005056a8803c:1-696587
Mon Jun 10 11:25:59 2019 - [info] Executing master IP activate script:
Mon Jun 10 11:25:59 2019 - [info] /etc/masterha/master_ip_failover --command=start --ssh_user=root --orig_master_host=db1 --orig_master_ip=10.100.64.119 --orig_master_port=3306 --new_master_host=db2 --new_master_ip=10.100.64.120 --new_master_port=3306 --new_master_user='xucl' --orig_master_ssh_port=52000 --new_master_ssh_port=52000 --new_master_password=xxx
Set read_only=0 on the new master.
add vip to new master
Warning: Permanently added '[db2]:52000' (ECDSA) to the list of known hosts.
Mon Jun 10 11:25:59 2019 - [info] OK.
Mon Jun 10 11:25:59 2019 - [info] ** Finished master recovery successfully.
Mon Jun 10 11:25:59 2019 - [info] * Phase 3: Master Recovery Phase completed.
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] * Phase 4: Slaves Recovery Phase..
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] * Phase 4.1: Starting Slaves in parallel..
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] -- Slave recovery on host db3(10.100.64.121:3306) started, pid: 2886. Check tmp log /var/log/masterha/app1/db3_3306_20190610112555.log if it takes time..
Mon Jun 10 11:26:00 2019 - [info]
Mon Jun 10 11:26:00 2019 - [info] Log messages from db3 ...
Mon Jun 10 11:26:00 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] Resetting slave db3(10.100.64.121:3306) and starting replication from the new master db2(10.100.64.120:3306)..
Mon Jun 10 11:25:59 2019 - [info] Executed CHANGE MASTER.
Mon Jun 10 11:25:59 2019 - [info] Slave started.
Mon Jun 10 11:25:59 2019 - [info] gtid_wait(479dbc89-e31e-11e8-83eb-005056a8803c:1-696587) completed on db3(10.100.64.121:3306). Executed 0 events.
Mon Jun 10 11:26:00 2019 - [info] End of log messages from db3.
Mon Jun 10 11:26:00 2019 - [info] -- Slave on host db3(10.100.64.121:3306) started.
Mon Jun 10 11:26:00 2019 - [info] All new slave servers recovered successfully.
Mon Jun 10 11:26:00 2019 - [info]
Mon Jun 10 11:26:00 2019 - [info] * Phase 5: New master cleanup phase..
Mon Jun 10 11:26:00 2019 - [info]
Mon Jun 10 11:26:00 2019 - [info] Resetting slave info on the new master..
Mon Jun 10 11:26:00 2019 - [info] db2: Resetting slave info succeeded.
Mon Jun 10 11:26:00 2019 - [info] Master failover to db2(10.100.64.120:3306) completed successfully.
Mon Jun 10 11:26:00 2019 - [info]
----- Failover Report -----
app1: MySQL Master failover db1(10.100.64.119:3306) to db2(10.100.64.120:3306) succeeded
Master db1(10.100.64.119:3306) is down!
Check MHA Manager logs at manager:/var/log/masterha/app1/app1.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on db1(10.100.64.119:3306)
Selected db2(10.100.64.120:3306) as a new master.
db2(10.100.64.120:3306): OK: Applying all logs succeeded.
db2(10.100.64.120:3306): OK: Activated master IP address.
db3(10.100.64.121:3306): OK: Slave started, replicating from db2(10.100.64.120:3306)
db2(10.100.64.120:3306): Resetting slave info succeeded.
Master failover to db2(10.100.64.120:3306) completed successfully.
从日志上我们可以做一下分解、切割
Phase 1: Configuration Check Phase..
Mon Jun 10 11:25:56 2019 - [info] GTID failover mode = 1
Mon Jun 10 11:25:56 2019 - [info] Dead Servers:
Mon Jun 10 11:25:56 2019 - [info] db1(10.100.64.119:3306)
Mon Jun 10 11:25:56 2019 - [info] Checking master reachability via MySQL(double check)...
Mon Jun 10 11:25:56 2019 - [info] ok.
Mon Jun 10 11:25:56 2019 - [info] Alive Servers:
Mon Jun 10 11:25:56 2019 - [info] db2(10.100.64.120:3306)
Mon Jun 10 11:25:56 2019 - [info] db3(10.100.64.121:3306)
Mon Jun 10 11:25:56 2019 - [info] Alive Slaves:
Mon Jun 10 11:25:56 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:56 2019 - [info] GTID ON
Mon Jun 10 11:25:56 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:56 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:56 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:56 2019 - [info] GTID ON
Mon Jun 10 11:25:56 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:56 2019 - [info] Primary candidate for the new Master (candidate_master is set)
该阶段主要做的是:
- 检查failover模式(这里是基于GTID的failover)
- 检查dead master(double check)
- 检查alive servers,并检查每个节点信息(GTID_MODE、从哪里复制、是否是candidate_master、read_only、replication filter)
Phase 2: Dead Master Shutdown Phase..
Mon Jun 10 11:25:56 2019 - [info] * Phase 2: Dead Master Shutdown Phase..
Mon Jun 10 11:25:56 2019 - [info]
Mon Jun 10 11:25:56 2019 - [info] Forcing shutdown so that applications never connect to the current master..
Mon Jun 10 11:25:56 2019 - [info] Executing master IP deactivation script:
Mon Jun 10 11:25:56 2019 - [info] /etc/masterha/master_ip_failover --orig_master_host=db1 --orig_master_ip=10.100.64.119 --orig_master_port=3306 --command=stop --orig_master_ssh_port=52000
drop vip from original master
ssh: connect to host db1 port 52000: Connection refused
Mon Jun 10 11:25:59 2019 - [info] done.
Mon Jun 10 11:25:59 2019 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Mon Jun 10 11:25:59 2019 - [info] * Phase 2: Dead Master Shutdown Phase completed.
/etc/masterha/master_ip_failover --orig_master_host=db1 --orig_master_ip=10.100.64.119 --orig_master_port=3306 --command=stop --orig_master_ssh_port=52000
调用master_ip_failover脚本,调用stop命令,我们再看下调用stop命令是进行了什么操作
if ( $command eq "stop" || $command eq "stopssh" ) {
# $orig_master_host, $orig_master_ip, $orig_master_port are passed.
# If you manage master ip address at global catalog database,
# invalidate orig_master_ip here.
&drop_vip();
my $exit_code = 1;
eval {
# updating global catalog, etc
$exit_code = 0;
};
if ($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
实际上,就是去原先的master执行了删除vip的动作,从我们的日志上看到的是,因为原先的server服务器无法ssh上去,因此也无法执行drop动作
Phase 3: Master Recovery Phase..
Mon Jun 10 11:25:59 2019 - [info] * Phase 3: Master Recovery Phase..
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] The latest binary log file/position on all slaves is mysql-bin.000037:194
Mon Jun 10 11:25:59 2019 - [info] Latest slaves (Slaves that received relay log files to the latest):
Mon Jun 10 11:25:59 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:59 2019 - [info] GTID ON
Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:59 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:59 2019 - [info] GTID ON
Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:59 2019 - [info] The oldest binary log file/position on all slaves is mysql-bin.000037:194
Mon Jun 10 11:25:59 2019 - [info] Oldest slaves:
Mon Jun 10 11:25:59 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:59 2019 - [info] GTID ON
Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:59 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:59 2019 - [info] GTID ON
Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] * Phase 3.3: Determining New Master Phase..
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] Searching new master from slaves..
Mon Jun 10 11:25:59 2019 - [info] Candidate masters from the configuration file:
Mon Jun 10 11:25:59 2019 - [info] db2(10.100.64.120:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:59 2019 - [info] GTID ON
Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:59 2019 - [info] db3(10.100.64.121:3306) Version=5.7.23-log (oldest major version between slaves) log-bin:enabled
Mon Jun 10 11:25:59 2019 - [info] GTID ON
Mon Jun 10 11:25:59 2019 - [info] Replicating from 10.100.64.119(10.100.64.119:3306)
Mon Jun 10 11:25:59 2019 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Jun 10 11:25:59 2019 - [info] Non-candidate masters:
Mon Jun 10 11:25:59 2019 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Mon Jun 10 11:25:59 2019 - [info] New master is db2(10.100.64.120:3306)
Mon Jun 10 11:25:59 2019 - [info] Starting master failover..
Mon Jun 10 11:25:59 2019 - [info]
From:
db1(10.100.64.119:3306) (current master)
+--db2(10.100.64.120:3306)
+--db3(10.100.64.121:3306)
To:
db2(10.100.64.120:3306) (new master)
+--db3(10.100.64.121:3306)
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] * Phase 3.3: New Master Recovery Phase..
Mon Jun 10 11:25:59 2019 - [info]
Mon Jun 10 11:25:59 2019 - [info] Waiting all logs to be applied..
Mon Jun 10 11:25:59 2019 - [info] done.
Mon Jun 10 11:25:59 2019 - [info] Getting new master's binlog name and position..
Mon Jun 10 11:25:59 2019 - [info] mysql-bin.000032:194
Mon Jun 10 11:25:59 2019 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='db2 or 10.100.64.120', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Mon Jun 10 11:25:59 2019 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000032, 194, 479dbc89-e31e-11e8-83eb-005056a8803c:1-696587
Mon Jun 10 11:25:59 2019 - [info] Executing master IP activate script:
Mon Jun 10 11:25:59 2019 - [info] /etc/masterha/master_ip_failover --command=start --ssh_user=root --orig_master_host=db1 --orig_master_ip=10.100.64.119 --orig_master_port=3306 --new_master_host=db2 --new_master_ip=10.100.64.120 --new_master_port=3306 --new_master_user='xucl' --orig_master_ssh_port=52000 --new_master_ssh_port=52000 --new_master_password=xxx
Set read_only=0 on the new master.
add vip to new master
Warning: Permanently added '[db2]:52000' (ECDSA) to the list of known hosts.
Mon Jun 10 11:25:59 2019 - [info] OK.
Mon Jun 10 11:25:59 2019 - [info] ** Finished master recovery successfully.
Mon Jun 10 11:25:59 2019 - [info] * Phase 3: Master Recovery Phase completed.
整个第三阶段步骤是最多的,主要有如下:
- 3.1: Getting Latest Slaves Phase
- 该阶段主要通过复制信息确定哪个slave节点有最新的binlog
- 3.2: Saving Dead Master’s Binlog Phase
- 保存dead master上的binlog,如果是由于down机导致无法ssh,那么会提示
Transactions that were not sent to the latest slave (Read_Master_Log_Pos to the tail of the dead master's binlog) were lost
- 3.3: Determining New Master Phase
- 该阶段即确认哪个slave节点会被选为新的master节点
- 3.4:New Master Diff Log Generation Phase
- 该阶段即调用save_binary_logs来补齐binlog
- 3.5: Master Log Apply Phase
- 等待新master应用完全部的binlog,确定应用完日志的最新位点,生成change master to语句
- 完成以后调用failover命令
/etc/masterha/master_ip_failover --command=start --ssh_user=root --orig_master_host=db1 --orig_master_ip=10.100.64.119 --orig_master_port=3306 --new_master_host=db2 --new_master_ip=10.100.64.120 --new_master_port=3306 --new_master_user='xucl' --orig_master_ssh_port=52000 --new_master_ssh_port=52000 --new_master_password=xxx
这里调用start命令,源码如下:
elsif ( $command eq "start" ) {
# all arguments are passed.
# If you manage master ip address at global catalog database,
# activate new_master_ip here.
# You can also grant write access (create user, set read_only=0, etc) here.
my $exit_code = 10;
eval {
my $new_master_handler = new MHA::DBHelper();
# args: hostname, port, user, password, raise_error_or_not
$new_master_handler->connect( $new_master_ip, $new_master_port,
$new_master_user, $new_master_password, 1 );
## Set read_only=0 on the new master
$new_master_handler->disable_log_bin_local();
print "Set read_only=0 on the new master.\n";
$new_master_handler->disable_read_only();
## Creating an app user on the new master
#print "Creating app user on the new master..\n";
#FIXME_xxx_create_user( $new_master_handler->{dbh} );
$new_master_handler->enable_log_bin_local();
$new_master_handler->disconnect();
## Update master ip on the catalog database, etc
&add_vip();
$exit_code = 0;
};
if ($@) {
warn $@;
# If you want to continue failover, exit 10.
exit $exit_code;
}
exit $exit_code;
}
这里实际上执行的是设置read_only,然后绑定VIP的操作
Phase 4: Slaves Recovery Phase
- Phase 4.1: Starting Parallel Slave Diff Log Generation Phase
- 其他从库全部应用完relay log以后,从新的master上补齐差异binlog
- Phase 4.2: Starting Parallel Slave Log Apply Phase
- 其他节点应用差异的binlog
Phase 5: New master cleanup phase
该阶段即清理new master上的slave信息
最后完成fail over后,给出简单的报告:
app1: MySQL Master failover db1(10.100.64.119:3306) to db2(10.100.64.120:3306) succeeded
Master db1(10.100.64.119:3306) is down!
Check MHA Manager logs at manager:/var/log/masterha/app1/app1.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on db1(10.100.64.119:3306)
Selected db2(10.100.64.120:3306) as a new master.
db2(10.100.64.120:3306): OK: Applying all logs succeeded.
db2(10.100.64.120:3306): OK: Activated master IP address.
db3(10.100.64.121:3306): OK: Slave started, replicating from db2(10.100.64.120:3306)
db2(10.100.64.120:3306): Resetting slave info succeeded.
Master failover to db2(10.100.64.120:3306) completed successfully.
简单给出了从哪个节点选举了新的master节点,其他节点成功复制,failover完成