MySQL 5.7_MHA 集群部署手册

1、MHA介绍

MHA(Master High Availability)目前在MySQL高可用方面是一个相对成熟的解决方案,它由日本DeNA公司youshimaton(现就职于Facebook公司)开发,是一套优秀的作为MySQL高可用性环境下故障切换和主从提升的高可用软件。在MySQL故障切换过程中,MHA能做到在0~30秒之内自动完成数据库的故障切换操作,并且在进行故障切换的过程中,MHA能在最大程度上保证数据的一致性,以达到真正意义上的高可用。

2、MHA集群架构

2.1、架构图

image-20210618161935680.png

2.2、MHA集群组件介绍

组件缩写 组件名称 备注
MHA_manager MHA Manager MHA管理节点
MHA_node MHA Node MHA 数据节点
DB_Master MySQL Database MASTER Node 数据库主节点
DB_Slave MySQL Database Slave Nod 数据库从节点

3、部署MHA集群

3.1、集群角色说明

工作项 注意项
MHA集群部署环境前置要求 架构规划 确定集群规模,服务器数量和配置。
硬件 注意跨机架和跨机房上架服务器,至少db机器要跨机架。只兼容x86
网络 所有机器可以互相访问,如果有lvs的机器要在同一网段内。物理网卡建议是万兆网卡。
操作系统 推荐使用CentOS7分支(redhat),建议使用centos7.9
存储(文件系统) 推荐使用ssd盘或者nvme做数据库的数据盘,文件系统建议用xfs。
yum/apt源 所有服务器需配置好对应系统的yum/apt源。
NTP服务 部署NTP服务器,所有服务器连接NTP服务器,使用北京时间(UTC+08:00),保证服务器间的时间误差不超过3秒。
挂载NAS(可选) 如需使用NAS作为冷备存储,请将NAS在对应机器上先挂载好。
部署 部署规划 详细IP及部署模块规划列表。

3.2、集群基础环境建议

3.2.1、硬件建议
组件 机器数 机器配置(CPU/内存/磁盘) 备注
MHA_Manager 1 2C/4G/200G 管理节点
MHA Node 3 16C/64G/1000G Node节点和DB节点复用
DB 3 16C/64G/1000G Node节点和DB节点复用

共计4台机器,建议3台物理台机器(或者高配虚拟机),管理节点虚拟机

3.2.2、操作系统要求
项目 说明
CentOS CentOS Linux release 7.9
Glibc glibc-2.17

3.2.3、部署规划

一个最小化部署(测试环境)的规划样例如下:

10.211.55.101 10.211.55.102 10.211.55.103 10.211.55.100
MHA_Manager Y
MHA Node Y Y Y Y
DB Y Y Y

3.3、系统环境修改(所有节点)

  • 关闭防火墙
  1. systemctl stop firewalld
  2. systemctl disable firewalld
  • 关闭selinux
  1. sed -i 's/enforcing/disabled/' /etc/selinux/config # 永久
  2. setenfore 0 # 临时
  • 挂载数据盘

数据目录建议使用单独硬盘,需要对硬盘进行格式化并挂载,格式化整个磁盘,然后直接把整个磁盘挂载到/data/ 目录。

umount /data/ 
ls /dev/nvme* 
mkfs.xfs -f /dev/nvme0n1 
mount /dev/nvme0n1 /data/ 
df -Th
  • 修改主机名(可选)
hostnamectl set-hostname zabbixdb001
hostnamectl set-hostname zabbixdb002
hostnamectl set-hostname zabbixdb003
  • 修改profile
echo $LANG
echo "PS1='tty:[\e[1;36m\l\e[0m] jobs:[\e[1;36m\j\e[0m] cwd:[\e[1;36m\w\e[0m]\n\`date +%H:%M\` [\u@\`hostname\`]$ '">>/etc/bashrc
grep 'ulimit -HSn 100000' /etc/profile || echo 'ulimit -HSn 100000' >>/etc/profile

source /etc/profile
  • 关闭postfix服务
systemctl disable postfix
  • 依赖环境安装
yum -y install openssh net-tools sysstat ntpdate numactl tree strace dstat wget lrzsz libaio glibc gcc gcc-c++ libstdc++-devel
ldd --version
# 注:要求Linux系统的glibc版本要比2.12新,可以使用ldd --version查看glibc版本。
  • 配置时间同步
ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime

vi /etc/crontab
*/20 * * * * root ntpdate -s ntp.aliyun.com
  • 关闭numa(INTEL机器建议关闭numa)
vim /etc/default/grub
GRUB_CMDLINE_LINUX 最后添加 numa=off
# 编译配置文件,然后重启机器
grub2-mkconfig -o /etc/grub2.cfg

3.4、数据库节点安装(所有节点)

3.4.1、安装包下载
wget https://cdn.mysql.com/archives/mysql-5.7/mysql-5.7.33-linux-glibc2.12-x86_64.tar.gz

3.4.2、卸载冲突包
rpm  -qa | grep mysql | xargs rpm -e 
rpm  -qa | grep mariadb | xargs rpm -e

3.4.3、创建my.cnf

每个DB节点的server-id必须修改为不一样

innodb_buffer_pool_size建议设置为物理内存的75%

vim /etc/my.cnf

[client]
port    = 3306
socket  = /data/mysql/mysql3306/mysql.sock

[mysql]
prompt="\\u@\\h:\\p [\\d]>"
#pager="less -i -n -S"
#tee=/opt/mysql/query.log
no-auto-rehash


[mysqld]
user    = mysql
port    = 3306
basedir = /usr/local/mysql
datadir = /data/mysql/mysql3306/data
tmpdir = /data/mysql/mysql3306/tmp

socket  = /data/mysql/mysql3306/mysql.sock
pid-file = mysql3306.pid

character-set-server = utf8mb4
lower_case_table_names =1
explicit_defaults_for_timestamp=1
skip_name_resolve = 1

open_files_limit    = 65535
back_log = 1024
max_connections = 1024
max_connect_errors = 1000000


#timeout
interactive_timeout = 300
wait_timeout = 300

#log_slow
log-output=file
slow_query_log = 1
slow_query_log_file = slow.log
log-error = error.log
long_query_time = 1
log_queries_not_using_indexes =1
log_throttle_queries_not_using_indexes = 60
min_examined_row_limit = 100

log_slow_admin_statements = 1
log_slow_slave_statements = 1

#binlog 
server-id = 2106171636  # 修改id
binlog_format = row
binlog_checksum = 1
log-bin = /data/mysql/mysql3306/logs/mysql-bin
#设置0 高性能
sync_binlog = 0
binlog_cache_size = 4M
max_binlog_cache_size = 2G
max_binlog_size = 1G
expire_logs_days = 7
master_info_repository = TABLE
relay_log_info_repository = TABLE

#GTID
gtid_mode = off
#enforce_gtid_consistency = 1
log_slave_updates
#slave-rows-search-algorithms = 'INDEX_SCAN,HASH_SCAN'

#procedure 
log_bin_trust_function_creators=1

#relay log
skip_slave_start = 1
#max_relay_log_size = 512M
#relay_log_purge = 1
relay_log_recovery = 1
relay-log=/data/mysql/mysql3306/logs/relay-bin
relay-log-index=/data/mysql/mysql3306/logs/relay-bin.index
log_slave_updates
relay_log_info_repository = TABLE
#relay_log = /data/mysql/mysql3306/log/relay-bin
relay_log_purge=off

#buffers & cache
table_open_cache = 4096
table_definition_cache = 1024
#table_open_cache_instances = 64

sort_buffer_size = 4M
join_buffer_size = 4M
thread_cache_size = 200
thread_stack = 192K
max_heap_table_size = 32M
tmp_table_size = 32M
key_buffer_size = 2M
read_buffer_size = 2M
read_rnd_buffer_size = 8M
bulk_insert_buffer_size = 32M
max_allowed_packet = 32M



#innodb_thread_concurrency 设置1.5倍*CPU核数
innodb_thread_concurrency = 16
#innodb_sync_spin_loops = 100
#innodb_spin_wait_delay = 30

#transaction_isolation = REPEATABLE-READ
transaction_isolation = READ-COMMITTED
#innodb_additional_mem_pool_size = 16M

#设置物理内存的75%
innodb_buffer_pool_size = 2048M

innodb_buffer_pool_instances = 16
innodb_buffer_pool_load_at_startup = 1
innodb_buffer_pool_dump_at_shutdown = 1
innodb_data_file_path = ibdata1:2G:autoextend

#数据安全要求高设置1 
innodb_flush_log_at_trx_commit = 2
innodb_log_buffer_size = 64M
innodb_log_file_size = 2G
innodb_log_files_in_group = 2
#innodb_max_undo_log_size = 1G
#innodb_undo_directory = undolog
#innodb_undo_tablespaces = 8

# 根据您的服务器IOPS能力适当调整
# 一般配普通SSD盘的话,可以调整到 10000 - 20000
# 配置高端PCIe SSD卡的话,则可以调整的更高,比如 50000 - 80000
innodb_io_capacity = 10000
innodb_io_capacity_max = 20000

innodb_flush_sync = 0
innodb_flush_neighbors = 0
innodb_write_io_threads = 8
innodb_read_io_threads = 8
innodb_purge_threads = 4
innodb_page_cleaners = 4
innodb_open_files = 65535
innodb_max_dirty_pages_pct = 70
innodb_flush_method = O_DIRECT
innodb_lru_scan_depth = 1024
#innodb_checksum_algorithm = crc32
innodb_lock_wait_timeout = 10
innodb_rollback_on_timeout = 1
innodb_print_all_deadlocks = 1
innodb_file_per_table = 1
#innodb_online_alter_log_max_size = 4G
internal_tmp_disk_storage_engine = InnoDB
innodb_stats_on_metadata = 0

# some var for MySQL 5.7
innodb_checksums = 1
#innodb_file_format = Barracuda
#innodb_file_format_max = Barracuda
query_cache_size = 0
query_cache_type = 0
query_cache_limit = 256K
query_cache_min_res_unit = 512
innodb_undo_logs = 64

innodb_status_file = 1
# 注意: 开启 innodb_status_output & innodb_status_output_locks 后, 可能会导致log-error文件增长较快
innodb_status_output = 0
innodb_status_output_locks = 0

#performance_schema
performance_schema = 1
performance_schema_instrument = '%=on'

#innodb monitor
innodb_monitor_enable="module_innodb"
innodb_monitor_enable="module_server"
innodb_monitor_enable="module_dml"
innodb_monitor_enable="module_ddl"
innodb_monitor_enable="module_trx"
innodb_monitor_enable="module_os"
innodb_monitor_enable="module_purge"
innodb_monitor_enable="module_log"
innodb_monitor_enable="module_lock"
innodb_monitor_enable="module_buffer"
innodb_monitor_enable="module_index"
innodb_monitor_enable="module_ibuf_system"
innodb_monitor_enable="module_buffer_page"
innodb_monitor_enable="module_adaptive_hash"

#myisam
myisam_sort_buffer_size = 128M
myisam_max_sort_file_size = 2G
myisam_repair_threads = 1

[mysqldump]
quick
max_allowed_packet = 32M

3.4.4、安装mysql

如果首次初始化mysql失败,则需删除/usr/local/mysql和/data/mysql目录,重新执行3.4.4步骤。否者文件生成不完整会导致mysql服务启动失败。

$ useradd mysql -s /bin/nologin
$ mkdir /opt/mysql
$ mv mysql-5.7.33-linux-glibc2.12-x86_64.tar.gz /opt/mysql/
$ cd /opt/mysql/;tar -zxvf mysql-5.7.33-linux-glibc2.12-x86_64.tar.gz 
$ chown -R mysql.mysql /opt/mysql/*

$ ln -s /opt/mysql/mysql-5.7.33-linux-glibc2.12-x86_64 /usr/local/mysql
$ chown -R mysql.mysql /usr/local/mysql
$ mkdir -p /data/mysql/mysql3306/{data,logs,tmp}
$ chown -R mysql.mysql /data/mysql/mysql3306
$ cd /usr/local/mysql/
#查看有没有依赖包缺失
ldd bin/mysqld
# 初始化mysql
/usr/local/mysql/bin/mysqld --defaults-file=/etc/my.cnf --initialize --user=mysql 

# 启动MySQL
/usr/local/mysql/bin/mysqld --defaults-file=/etc/my.cnf &

# 检查mysql 进程是否存在
ps -ef |grep mysqld

# 配置mysql环境变量
$ vi  /etc/profile
export PATH=$PATH:/usr/local/mysql/bin
$ source  /etc/profile

# 获取MySQL 初次登录密码并修改
sudo grep 'temporary password' /data/mysql/mysql3306/data/error.log
mysql -uroot -p 
ALTER USER 'root'@'localhost' IDENTIFIED BY 'root_pwd2021!';
FLUSH PRIVILEGES;

3.4.5、安全配置
# 1) Drop the anonymous account.
DELETE FROM mysql.user WHERE User=''; 

# 2) Force the root user to change the password on first connect.
UPDATE mysql.user SET Password_expired='Y' WHERE User='root'; 

# 3) remove remote accounts
DELETE FROM mysql.user WHERE Host <> 'localhost';

# 4) Drop the test database
DROP DATABASE IF EXISTS test;
DELETE FROM mysql.db WHERE Db='test' OR Db='test\\_%';
FLUSH PRIVILEGES;

3.5、安装Percona tool工具(所有节点)

Percona Toolkit简称pt工具—PT-Tools,是Percona公司开发用于管理MySQL的工具,功能包括检查主从复制的数据一致性、检查重复索引、定位IO占用高的表文件、在线DDL等。

# 下载:
wget https://downloads.percona.com/downloads/percona-toolkit/3.3.1/binary/redhat/7/x86_64/percona-toolkit-3.3.1-1.el7.x86_64.rpm
wget https://downloads.percona.com/downloads/Percona-XtraBackup-2.4/Percona-XtraBackup-2.4.22/binary/redhat/7/x86_64/percona-xtrabackup-24-2.4.22-1.el7.x86_64.rpm

# 安装:
yum localinstall -y percona-toolkit-3.3.1-1.el7.x86_64.rpm
yum localinstall -y percona-xtrabackup-24-2.4.22-1.el7.x86_64.rpm

# 检查安装:
[root@mysql01]$ rpm -qa | grep -i percona
percona-xtrabackup-24-2.4.22-1.el7.x86_64
percona-toolkit-3.3.1-1.el7.x86_64

3.6、MySQL主从同步复制配置

按照前面步骤,分别安装3个节点的数据库,并启动mysql。

注意:

  • binlog-do-db 和 replicate-ignore-db 设置必须相同。 MHA 在启动时候会检测过滤规则,如果过滤规则不同,MHA 不启动监控和故障转移。
  • 确保三台DB的配置文件中server-id都不一样。

3.6.1、重启数据库实例(所有节点)
#关闭
mysqladmin -uroot -p'root_pwd2021!'  shutdown
#启动
/usr/local/mysql/bin/mysqld --defaults-file=/etc/my.cnf &

3.6.2、创建复制用户(所有节点)
> CREATE USER 'repl'@'10.211.55.%' IDENTIFIED BY 'repl_pwd2021!';
> GRANT replication slave,replication client  ON *.* TO 'repl'@'10.211.55.%';
> flush privileges;

3.6.3、登录数据库,reset master,reset slave all

该步骤仅限新实例,老实例慎重操作,会清理同步信息和日志

root@localhost:mysql.sock [(none)]>reset master;
Query OK, 0 rows affected (0.00 sec)

root@localhost:mysql.sock [(none)]>reset slave all;
Query OK, 0 rows affected (0.00 sec)

root@localhost:mysql.sock [(none)]>show master logs;
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.000001 |       154 |
+------------------+-----------+
1 row in set (0.00 sec)

3.6.4、检测主库binlog信息(主节点)

查看 Master-Server , binlog File 文件名称和 Position值位置 并且记下来

root@localhost:mysql.sock [(none)]>show master status;
+------------------+----------+--------------+------------------+-------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------+
| mysql-bin.000001 |      154 |              |                  |                   |
+------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

3.6.5、从库配置同步(从节点)

要设置从库与主库进行通信,进行复制,使用必要的连接信息配置从库在从库上执行以下语句,将选项值替换为与系统相关的实际值。

# 参数格式,请勿执行:
mysql> CHANGE MASTER TO
    ->     MASTER_HOST='master_host_name',
    ->     MASTER_USER='replication_user_name',
    ->     MASTER_PASSWORD='replication_password',
    ->     MASTER_LOG_FILE='recorded_log_file_name',
    ->     MASTER_LOG_POS=recorded_log_position;

# 示例配置复制命令
CHANGE MASTER TO MASTER_HOST='10.211.55.101', MASTER_USER='repl', MASTER_PASSWORD='repl_pwd2021!', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=154;

3.6.6、启动从服务器复制线程(从节点)
mysql> START SLAVE;
Query OK, 0 rows affected (0.00 sec)
查看复制状态
mysql>  show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.211.55.101
                  Master_User: replication
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 629
               Relay_Log_File: master2-relay-bin.000003
                Relay_Log_Pos: 320
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
......
  • 检查主从复制通信状态,必须都是 Yes。
    Slave_IO_State #从站的当前状态
    Slave_IO_Running: Yes #读取主程序二进制日志的I/O线程是否正在运行
    Slave_SQL_Running: Yes #执行读取主服务器中二进制日志事件的SQL线程是否正在运行。与I/O线程一样
    Seconds_Behind_Master #是否为0,0就是已经同步了.

3.6.7、测试主从同步

在主库插入几条数据,看从库是否能同步。

3.7、MHA安装

3.7.1、下载安装包
wget https://github-releases.githubusercontent.com/2093258/993094a0-2de4-11e8-9c8a-f5a050f404a9?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210617%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210617T155232Z&X-Amz-Expires=300&X-Amz-Signature=94d993bb8de9f4f222b24ab12866c1180053651e78e4d64eadb693db857cc76c&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=2093258&response-content-disposition=attachment%3B%20filename%3Dmha4mysql-node-0.58-0.el7.centos.noarch.rpm&response-content-type=application%2Foctet-stream

wget https://github-releases.githubusercontent.com/2093236/f7852fb6-2de4-11e8-8326-424ead621d0f?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20210617%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210617T155201Z&X-Amz-Expires=300&X-Amz-Signature=1e493f0246b951a299c219644426e50ed00c42c2b97da59c2708d4e11ffa6357&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=2093236&response-content-disposition=attachment%3B%20filename%3Dmha4mysql-manager-0.58-0.el7.centos.noarch.rpm&response-content-type=application%2Foctet-stream

3.7.2、配置SSH互信

所有机器(DB机器和管理节点)都要配置SSH互信,如果MHA_Manager和数据库节点复用,也需要配置自己的SSH互信。

# 步骤1:修改10.211.55.100的sshd配置:
cp /etc/ssh/sshd_config /etc/ssh/sshd_config_bak

vi /etc/ssh/sshd_config
# 添加以下三项配置:
PubkeyAuthentication yes
RSAAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
# 步骤二:重启sshd:
systemctl restart sshd
# 步骤三:生成公钥:
ssh-keygen -t rsa
# 步骤四:copy公钥文件
ssh-copy-id -i /root/.ssh/id_rsa.pub 10.211.55.101
ssh-copy-id -i /root/.ssh/id_rsa.pub 10.211.55.102
ssh-copy-id -i /root/.ssh/id_rsa.pub 10.211.55.103
# 步骤五:验证无需需要密码即可登录进10.211.55.101
ssh 110.211.55.101

# 其余三台进行同样操作。

3.7.3、安装MHA Node(所有节点)
$ yum install perl-DBD-MySQL -y
$ yum localinstall mha4mysql-node-0.58-0.el7.centos.noarch.rpm -y

3.7.4、安装manager(MHA_Manager管理节点)
$ yum install epel-release -y
$ yum -y install perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Time-HiRes

$ yum localinstall mha4mysql-manager-0.58-0.el7.centos.noarch.rpm -y

3.7.5、配置MHA脚本文件

Node 节点安装完成后会在/usr/local/bin(或者/usr/bin)目录下生成以下脚本文件:

-r-xr-xr-x 1 root root 15498 Apr 20 10:05 apply_diff_relay_logs
-r-xr-xr-x 1 root root  4807 Apr 20 10:05 filter_mysqlbinlog
-r-xr-xr-x 1 root root  7401 Apr 20 10:05 purge_relay_logs
-r-xr-xr-x 1 root root  7263 Apr 20 10:05 save_binary_logs

Node工具包(这些工具通常由MHA Manager的脚本触发,无需人为操作)主要包括以下几个工具:

save_binary_logs 保存和复制master的二进制日志
apply_diff_relay_logs 识别差异的中继日志事件并将其差异的事件应用于其他的slave
filter_mysqlbinlog 去除不必要的ROLLBACK事件(MHA已不再使用这个工具)
purge_relay_logs 清除中继日志(不会阻塞SQL线程)

Manager 管理节点安装完成后会在/usr/local/bin(/usr/bin/)目录下面生成以下脚本文件,前面已经说过这些脚本的作用,这里不再重复。

-r-xr-xr-x 1 root root 15498 Apr 20 10:58 apply_diff_relay_logs
-r-xr-xr-x 1 root root  4807 Apr 20 10:58 filter_mysqlbinlog
-r-xr-xr-x 1 root root  1995 Apr 20 11:33 masterha_check_repl
-r-xr-xr-x 1 root root  1779 Apr 20 11:33 masterha_check_ssh
-r-xr-xr-x 1 root root  1865 Apr 20 11:33 masterha_check_status
-r-xr-xr-x 1 root root  3201 Apr 20 11:33 masterha_conf_host
-r-xr-xr-x 1 root root  2517 Apr 20 11:33 masterha_manager
-r-xr-xr-x 1 root root  2165 Apr 20 11:33 masterha_master_monitor
-r-xr-xr-x 1 root root  2373 Apr 20 11:33 masterha_master_switch
-r-xr-xr-x 1 root root  3749 Apr 20 11:33 masterha_secondary_check
-r-xr-xr-x 1 root root  1739 Apr 20 11:33 masterha_stop
-r-xr-xr-x 1 root root  7401 Apr 20 10:58 purge_relay_logs
-r-xr-xr-x 1 root root  7263 Apr 20 10:58 save_binary_logs

3.7.6、修改从库my.cnf配置

设置从库只读模式

# 在my.cnf中[mysqld]下
read_only=1
relay_log_purge=off
log_bin=log_bin
# 如果没有则需添加这些参数

# 从库登录数据库执行以下sql
set global read_only=1;
set global relay_log_purge=0;
show global variables like "read_only";

3.8、配置MHA

3.8.1、Manage节点创建MHA工作目录

并且创建相关配置文件(在软件包解压后的目录里面有样例配置文件)。

mkdir -p /etc/masterha
mkdir -p /home/mha/masterha/zabbix_mha
# 登录master数据库创建监控用户(在master上执行)
grant all privileges on *.* to 'mhamonitor'@'10.211.55.%' identified  by 'mha_pwd2021!';
flush  privileges;

3.8.2、创建配置文件app_zabbix.cnf
$ vim /etc/masterha/app_zabbix.cnf
[server default]
# working directory on the manager
manager_workdir=/home/mha/masterha/app_zabbix             
manager_log=/home/mha/masterha/app_zabbix/manager.log  
master_binlog_dir=/data/mysql/mysql3306/logs              
remote_workdir=/tmp 

# mysql root user and password  
user=mhamonitor               
password=mha_pwd2021! 

# MHA ping interval
ping_interval=5        

# repl user and password
repl_user=repl
repl_password=repl_pwd2021!  

## scripts to support MHA functions    
#report_script="/usr/bin/send_report" 
#master_ip_failover_script= "/usr/bin/master_ip_failover"   
#master_ip_online_change_script= "/usr/bin/master_ip_online_change"  
#shutdown_script="/usr/bin/power_manager"   
master_ip_failover_script="/usr/bin/master_ip_failover"
master_ip_online_change_script="/usr/bin/master_ip_online_change"      

[server1]
hostname=10.211.55.101  # db主节点ip
port=3306
ssh_user=root 
ssh_port=22 
candidate_master=1
check_repl_delay=0

[server2]
hostname=10.211.55.102  # db从节点ip
port=3306
ssh_user=root 
ssh_port=22
candidate_master=1
check_repl_delay=0


[server3]
hostname=10.211.55.103  # db从节点ip
port=3306
ssh_user=root 
ssh_port=22
ignore_fail=1
no_master=1
check_repl_delay=0

3.8.3、配置Manager节点相关脚本

3.8.3.1、master_ip_failover脚本
vim /usr/bin/master_ip_failover
#!/usr/bin/env perl

#  Copyright (C) 2011 DeNA Co.,Ltd.
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#  Foundation, Inc.,
#  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA

## Note: This is a sample script and is not complete. Modify the script based on your environment.

use strict;
use warnings FATAL => 'all';

use Getopt::Long;
use MHA::DBHelper;
my (
  $command,        $ssh_user,         $orig_master_host,
  $orig_master_ip, $orig_master_port, $new_master_host,
  $new_master_ip,  $new_master_port,  $new_master_user,
  $new_master_password
);

my $vip = '192.168.0.196';
my $key = "1";
#my $ssh_start_vip = "/sbin/ifconfig eth0:$key $vip/24";
#my $ssh_stop_vip = "/sbin/ifconfig eth0:$key down";
#my $ssh_send_garp = "/sbin/arping -U $vip -I eth0 -c 1";
my $ssh_start_vip = "/sbin/ifconfig ";
my $ssh_stop_vip = "/sbin/ifconfig ";
my $ssh_send_garp = "/sbin/ifconfig ";

GetOptions(
  'command=s'             => \$command,
  'ssh_user=s'            => \$ssh_user,
  'orig_master_host=s'    => \$orig_master_host,
  'orig_master_ip=s'      => \$orig_master_ip,
  'orig_master_port=i'    => \$orig_master_port,
  'new_master_host=s'     => \$new_master_host,
  'new_master_ip=s'       => \$new_master_ip,
  'new_master_port=i'     => \$new_master_port,
  'new_master_user=s'     => \$new_master_user,
  'new_master_password=s' => \$new_master_password,
);

exit &main();

sub main {
  if ( $command eq "stop" || $command eq "stopssh" ) {

    # $orig_master_host, $orig_master_ip, $orig_master_port are passed.
    # If you manage master ip address at global catalog database,
    # invalidate orig_master_ip here.
    my $exit_code = 1;
    eval {
      print "Disabling the VIP an old master: $orig_master_host \n";
      &stop_vip();
      $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "start" ) {

    # all arguments are passed.
    # If you manage master ip address at global catalog database,
    # activate new_master_ip here.
    # You can also grant write access (create user, set read_only=0, etc) here.
    my $exit_code = 10;
    eval {

      my $new_master_handler = new MHA::DBHelper();

      # args: hostname, port, user, password, raise_error_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );

      ## Set read_only=0 on the new master
      $new_master_handler->disable_log_bin_local();
      print "Set read_only=0 on the new master.\n";
      $new_master_handler->disable_read_only();

      ## Creating an app user on the new master
      # print "Creating app user on the new master..\n";
      # FIXME_xxx_create_user( $new_master_handler->{dbh} );
      $new_master_handler->enable_log_bin_local();
      $new_master_handler->disconnect();

      print "Enabling the VIP $vip on the new master: $new_master_host \n";
      &start_vip();
      $exit_code = 0;
    };
    if ($@) {
      warn $@;

      # If you want to continue failover, exit 10.
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "status" ) {

    # do nothing
    exit 0;
  }
  else {
    &usage();
    exit 1;
  }
}

sub start_vip(){
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
    `ssh $ssh_user\@$new_master_host \" $ssh_send_garp \"`;
}

sub stop_vip(){
    return 0  unless  ($ssh_user);
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub usage {
  print
"Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

3.8.3.2、master_ip_online_change脚本
vim /usr/bin/master_ip_online_change
#!/usr/bin/env perl

#  Copyright (C) 2011 DeNA Co.,Ltd.
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#  Foundation, Inc.,
#  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA

## Note: This is a sample script and is not complete. Modify the script based on your environment.

use strict;
use warnings FATAL => 'all';

use Getopt::Long;
use MHA::DBHelper;
use MHA::NodeUtil;
use Time::HiRes qw( sleep gettimeofday tv_interval );
use Data::Dumper;

my $_tstart;
my $_running_interval = 0.1;

my $vip = '192.168.0.196';
my $key = "1";
#my $ssh_start_vip = "/sbin/ifconfig eth0:$key $vip/24";
#my $ssh_stop_vip = "/sbin/ifconfig eth0:$key down";
#my $ssh_send_garp = "/sbin/arping -U $vip -I eth0 -c 1";

my $ssh_start_vip = "/sbin/ifconfig ";
my $ssh_stop_vip = "/sbin/ifconfig ";
my $ssh_send_garp = "/sbin/ifconfig ";

my (
  $command,              $orig_master_is_new_slave, $orig_master_host,
  $orig_master_ip,       $orig_master_port,         $orig_master_user,
  $orig_master_password, $orig_master_ssh_user,     $new_master_host,
  $new_master_ip,        $new_master_port,          $new_master_user,
  $new_master_password,  $new_master_ssh_user,
);
GetOptions(
  'command=s'                => \$command,
  'orig_master_is_new_slave' => \$orig_master_is_new_slave,
  'orig_master_host=s'       => \$orig_master_host,
  'orig_master_ip=s'         => \$orig_master_ip,
  'orig_master_port=i'       => \$orig_master_port,
  'orig_master_user=s'       => \$orig_master_user,
  'orig_master_password=s'   => \$orig_master_password,
  'orig_master_ssh_user=s'   => \$orig_master_ssh_user,
  'new_master_host=s'        => \$new_master_host,
  'new_master_ip=s'          => \$new_master_ip,
  'new_master_port=i'        => \$new_master_port,
  'new_master_user=s'        => \$new_master_user,
  'new_master_password=s'    => \$new_master_password,
  'new_master_ssh_user=s'    => \$new_master_ssh_user,
);

exit &main();

sub start_vip(){
    `ssh $new_master_ssh_user\@$new_master_host \" $ssh_start_vip \"`;
    `ssh $new_master_ssh_user\@$new_master_host \" $ssh_send_garp \"`;
}

sub stop_vip(){
    `ssh $orig_master_ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}


sub current_time_us {
  my ( $sec, $microsec ) = gettimeofday();
  my $curdate = localtime($sec);
  return $curdate . " " . sprintf( "%06d", $microsec );
}

sub sleep_until {
  my $elapsed = tv_interval($_tstart);
  if ( $_running_interval > $elapsed ) {
    sleep( $_running_interval - $elapsed );
  }
}

sub get_threads_util {
  my $dbh                    = shift;
  my $my_connection_id       = shift;
  my $running_time_threshold = shift;
  my $type                   = shift;
  $running_time_threshold = 0 unless ($running_time_threshold);
  $type                   = 0 unless ($type);
  my @threads;

  my $sth = $dbh->prepare("SHOW PROCESSLIST");
  $sth->execute();

  while ( my $ref = $sth->fetchrow_hashref() ) {
    my $id         = $ref->{Id};
    my $user       = $ref->{User};
    my $host       = $ref->{Host};
    my $command    = $ref->{Command};
    my $state      = $ref->{State};
    my $query_time = $ref->{Time};
    my $info       = $ref->{Info};
    $info =~ s/^\s*(.*?)\s*$/$1/ if defined($info);
    next if ( $my_connection_id == $id );
    next if ( defined($query_time) && $query_time < $running_time_threshold );
    next if ( defined($command)    && $command eq "Binlog Dump" );
    next if ( defined($user)       && $user eq "system user" );
    next
      if ( defined($command)
      && $command eq "Sleep"
      && defined($query_time)
      && $query_time >= 1 );

    if ( $type >= 1 ) {
      next if ( defined($command) && $command eq "Sleep" );
      next if ( defined($command) && $command eq "Connect" );
    }

    if ( $type >= 2 ) {
      next if ( defined($info) && $info =~ m/^select/i );
      next if ( defined($info) && $info =~ m/^show/i );
    }

    push @threads, $ref;
  }
  return @threads;
}

sub main {
  if ( $command eq "stop" ) {
    ## Gracefully killing connections on the current master
    # 1. Set read_only= 1 on the new master
    # 2. DROP USER so that no app user can establish new connections
    # 3. Set read_only= 1 on the current master
    # 4. Kill current queries
    # * Any database access failure will result in script die.
    my $exit_code = 1;
    eval {
      ## Setting read_only=1 on the new master (to avoid accident)
      my $new_master_handler = new MHA::DBHelper();

      # args: hostname, port, user, password, raise_error(die_on_error)_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );
      print current_time_us() . " Set read_only on the new master.. ";
      $new_master_handler->enable_read_only();
      if ( $new_master_handler->is_read_only() ) {
        print "ok.\n";
      }
      else {
        die "Failed!\n";
      }
      $new_master_handler->disconnect();

      # Connecting to the orig master, die if any database error happens
      my $orig_master_handler = new MHA::DBHelper();
      $orig_master_handler->connect( $orig_master_ip, $orig_master_port,
        $orig_master_user, $orig_master_password, 1 );

      ## Drop application user so that nobody can connect. Disabling per-session binlog beforehand
      $orig_master_handler->disable_log_bin_local();
      # print current_time_us() . " Drpping app user on the orig master..\n";
      #drop_app_user($orig_master_handler);

      ## Waiting for N * 100 milliseconds so that current connections can exit
      my $time_until_read_only = 15;
      $_tstart = [gettimeofday];
      my @threads = get_threads_util( $orig_master_handler->{dbh},
        $orig_master_handler->{connection_id} );
      while ( $time_until_read_only > 0 && $#threads >= 0 ) {
        if ( $time_until_read_only % 5 == 0 ) {
          printf
"%s Waiting all running %d threads are disconnected.. (max %d milliseconds)\n",
            current_time_us(), $#threads + 1, $time_until_read_only * 100;
          if ( $#threads < 5 ) {
            print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
              foreach (@threads);
          }
        }
        sleep_until();
        $_tstart = [gettimeofday];
        $time_until_read_only--;
        @threads = get_threads_util( $orig_master_handler->{dbh},
          $orig_master_handler->{connection_id} );
      }

      ## Setting read_only=1 on the current master so that nobody(except SUPER) can write
      print current_time_us() . " Set read_only=1 on the orig master.. ";
      $orig_master_handler->enable_read_only();
      if ( $orig_master_handler->is_read_only() ) {
        print "ok.\n";
      }
      else {
        die "Failed!\n";
      }

      ## Waiting for M * 100 milliseconds so that current update queries can complete
      my $time_until_kill_threads = 5;
      @threads = get_threads_util( $orig_master_handler->{dbh},
        $orig_master_handler->{connection_id} );
      while ( $time_until_kill_threads > 0 && $#threads >= 0 ) {
        if ( $time_until_kill_threads % 5 == 0 ) {
          printf
"%s Waiting all running %d queries are disconnected.. (max %d milliseconds)\n",
            current_time_us(), $#threads + 1, $time_until_kill_threads * 100;
          if ( $#threads < 5 ) {
            print Data::Dumper->new( [$_] )->Indent(0)->Terse(1)->Dump . "\n"
              foreach (@threads);
          }
        }
        sleep_until();
        $_tstart = [gettimeofday];
        $time_until_kill_threads--;
        @threads = get_threads_util( $orig_master_handler->{dbh},
          $orig_master_handler->{connection_id} );
      }

      ## Terminating all threads
      print current_time_us() . " Killing all application threads..\n";
      $orig_master_handler->kill_threads(@threads) if ( $#threads >= 0 );
      print current_time_us() . " done.\n";
      $orig_master_handler->enable_log_bin_local();
      $orig_master_handler->disconnect();

      ## Droping the VIP     
      print "Disabling the VIP an old master: $orig_master_host \n";
      &stop_vip();

      ## After finishing the script, MHA executes FLUSH TABLES WITH READ LOCK
      $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "start" ) {
    ## Activating master ip on the new master
    # 1. Create app user with write privileges
    # 2. Moving backup script if needed
    # 3. Register new master's ip to the catalog database

# We don't return error even though activating updatable accounts/ip failed so that we don't interrupt slaves' recovery.
# If exit code is 0 or 10, MHA does not abort
    my $exit_code = 10;
    eval {
      my $new_master_handler = new MHA::DBHelper();

      # args: hostname, port, user, password, raise_error_or_not
      $new_master_handler->connect( $new_master_ip, $new_master_port,
        $new_master_user, $new_master_password, 1 );

      ## Set read_only=0 on the new master
      $new_master_handler->disable_log_bin_local();
      print current_time_us() . " Set read_only=0 on the new master.\n";
      $new_master_handler->disable_read_only();

      ## Creating an app user on the new master
      #print current_time_us() . " Creating app user on the new master..\n";
      # create_app_user($new_master_handler);
      print "Enabling the VIP $vip on the new master: $new_master_host \n";
      &start_vip();
      $new_master_handler->enable_log_bin_local();
      $new_master_handler->disconnect();

      ## Update master ip on the catalog database, etc
      $exit_code = 0;
    };
    if ($@) {
      warn "Got Error: $@\n";
      exit $exit_code;
    }
    exit $exit_code;
  }
  elsif ( $command eq "status" ) {

    # do nothing
    exit 0;
  }
  else {
    &usage();
    exit 1;
  }
}

sub usage {
  print
"Usage: master_ip_online_change --command=start|stop|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
  die;
}

3.8.3.3、power_manager脚本
vim /usr/bin/power_manager
#!/usr/bin/env perl

#  Copyright (C) 2011 DeNA Co.,Ltd.
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#  Foundation, Inc.,
#  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA

## Note: This is a sample script and is not complete. Modify the script based on your environment.

use strict;
use warnings FATAL => 'all';

use Getopt::Long;
use Pod::Usage;
use Net::Telnet;
use MHA::ManagerConst;
use MHA::ManagerUtil;

my $SSH_STOP_OK           = 10;
my $COMMAND_NOT_SUPPORTED = 20;
my $ILO_ADMIN             = 'Administrator';
my $DRAC_ADMIN            = 'root';
my $PASSWORD              = 'xxx';
my $max_retries           = 10;

exit &main();

sub get_power_status_drac_internal {
  my $telnet = shift;
  my $prompt = shift;
  $telnet->print("racadm serveraction powerstatus");
  ($_) = $telnet->waitfor($prompt);
  my $power_state = "void";
  my @cmd_out     = split /\n/;

  # discard command sent to DRAC
  $_ = shift @cmd_out;

  #strip ansi control chars
  s/\e\[(([0-9]+;)*[0-9]+)*[ABCDfHJKmsu]//g;
  s/^.*\x0D//;
  foreach (@cmd_out) {
    s/^\s+//g;
    s/\s+$//g;
    if (m/^Server power status: (\w+)/) {
      $power_state = lc($1);
      last;
    }
  }
  return $power_state;
}

sub power_off_drac_internal {
  my $telnet = shift;
  my $prompt = shift;
  $telnet->print("racadm serveraction powerdown");
  $telnet->waitfor($prompt);
}

sub power_on_drac_internal {
  my $telnet = shift;
  my $prompt = shift;
  $telnet->print("racadm serveraction powerup");
  $telnet->waitfor($prompt);
}

sub login_drac_internal {
  my $drac_addr = shift;
  my $prompt    = '/admin1|\$/';
  my $telnet    = new Net::Telnet(
    Timeout => 10,
    Prompt  => $prompt,
  );
  $telnet->open($drac_addr);
  $telnet->waitfor('/login/i');
  $telnet->print($DRAC_ADMIN);
  $telnet->waitfor('/password/i');
  $telnet->print($PASSWORD);
  $telnet->waitfor($prompt);

  return ( $telnet, $prompt );
}

sub power_off_drac {
  my $drac_addr    = shift;
  my $power_status = "void";
  local $@;
  eval {
    my ( $telnet, $prompt ) = login_drac_internal($drac_addr);
    power_off_drac_internal( $telnet, $prompt );
    $power_status = get_power_status_drac_internal( $telnet, $prompt );
    $telnet->close;
  };
  if ($@) {
    warn $@;
  }
  return $power_status;
}

sub power_on_drac {
  my $drac_addr    = shift;
  my $power_status = "void";
  local $@;
  eval {
    my ( $telnet, $prompt ) = login_drac_internal($drac_addr);
    power_on_drac_internal( $telnet, $prompt );
    $power_status = get_power_status_drac_internal( $telnet, $prompt );
    $telnet->close;
  };
  if ($@) {
    warn $@;
  }
  return $power_status;
}

sub power_status_drac {
  my $drac_addr    = shift;
  my $power_status = "void";
  local $@;
  eval {
    my ( $telnet, $prompt ) = login_drac_internal($drac_addr);
    $power_status = get_power_status_drac_internal( $telnet, $prompt );
    $telnet->close;
  };
  if ($@) {
    warn $@;
  }
  return $power_status;
}

sub power_status_ilo {
  my $ilo_addr     = shift;
  my $power_status = "void";
  local $@;
  eval {
    my $ipmi_out =
`ipmitool -H $ilo_addr -U $ILO_ADMIN -P $PASSWORD -I lanplus  power status`;
    die
"Failed to get power status from ipmitool. Maybe you need to upgrade ILO firmware version.\n"
      if ($?);
    chomp($ipmi_out);
    if ( $ipmi_out =~ m/^Chassis Power is (\w+)/ ) {
      $power_status = lc($1);
    }
  };
  if ($@) {
    warn $@;
  }
  return $power_status;
}

sub power_on_ilo {
  my $ilo_addr     = shift;
  my $power_status = "void";
  local $@;
  eval {
    $power_status = power_status_ilo($ilo_addr);
    if ( $power_status ne "off" ) {
      die "Power from ipmitool is already on.\n" if ( $power_status eq "on" );
      return $power_status;
    }
    `ipmitool -H $ilo_addr -U $ILO_ADMIN -P $PASSWORD -I lanplus  power on`;
    $power_status = power_status_ilo($ilo_addr);
  };
  if ($@) {
    warn $@;
  }
  return $power_status;
}

sub power_off_ilo {
  my $ilo_addr     = shift;
  my $power_status = "void";
  local $@;
  eval {
    $power_status = power_status_ilo($ilo_addr);
    if ( $power_status ne "on" ) {
      die "Power from ipmitool is already off.\n" if ( $power_status eq "off" );
      return $power_status;
    }
    `ipmitool -H $ilo_addr -U $ILO_ADMIN -P $PASSWORD -I lanplus  power off`;
    $power_status = power_status_ilo($ilo_addr);
  };
  if ($@) {
    warn $@;
  }
  return $power_status;
}

sub get_power_status {
  my ( $admin_addr, $server_type ) = @_;
  my $power_status = "void";
  if ( $server_type eq "ilo" ) {
    $power_status = power_status_ilo($admin_addr);
  }
  elsif ( $server_type eq "drac" ) {
    $power_status = power_status_drac($admin_addr);
  }
  return $power_status;
}

sub stop {
  my ( $real_host, $admin_addr, $server_type ) = @_;

  my $power_status = "void";
  if ( $server_type eq "ilo" ) {
    $power_status = power_off_ilo($admin_addr);
  }
  elsif ( $server_type eq "drac" ) {
    $power_status = power_off_drac($admin_addr);
  }

  if ( $power_status eq "off" ) {
    print "Power of $real_host was successfully turned off.\n";
    return 0;
  }
  elsif ( $power_status ne "on" ) {
    return $COMMAND_NOT_SUPPORTED;
  }

  my $retry_count = 0;
  while ( $retry_count < $max_retries ) {
    $power_status = get_power_status( $admin_addr, $server_type );
    last if ( $power_status eq "off" );
    print
"Waiting until power status becomes 'off'. Current status is $power_status ...\n";
    sleep 3;
    $retry_count++;
  }

  if ( $power_status eq "off" ) {
    print "Power of $real_host was successfully turned off.\n";
    return 0;
  }
  else {
    print
      "Power of $real_host was not turned off. Check the host for detail.\n";
    return 1;
  }
}

sub stopssh {
  my ( $ssh_user, $real_host, $real_ip, $pid_file ) = @_;
  my $ssh_user_host = $ssh_user . '@';
  if ($real_ip) {
    $ssh_user_host .= $real_ip;
  }
  else {
    $ssh_user_host .= $real_host;
  }

  my $command;
  my ( $high_ret, $low_ret );
  if ($pid_file) {
    $command =
"\"if [ ! -e $pid_file ]; then exit 1; fi; pid=\\\`cat $pid_file\\\`; rm -f $pid_file; kill -9 \\\$pid; a=\\\`ps ax | grep $pid_file | grep -v grep | wc | awk {'print \\\$1'}\\\`; if [ \"a\\\$a\" = \"a0\" ]; then exit 10; fi; sleep 1; a=\\\`ps ax | grep $pid_file | grep -v grep | wc | awk {'print \\\$1'}\\\`; if [ \"a\\\$a\" = \"a0\" ]; then exit 10; else exit 1; fi\"";
    ( $high_ret, $low_ret ) = MHA::ManagerUtil::exec_system(
      "ssh $ssh_user_host $MHA::ManagerConst::SSH_OPT_CHECK $command");
    if ( $high_ret == $SSH_STOP_OK && $low_ret == 0 ) {
      print "ssh reachable. mysqld stopped. power off not needed.\n";
      return $high_ret;
    }
    print "Killing mysqld instance based on $pid_file failed.\n";
  }

  print "Killing all mysqld instances on $real_host..\n";
  $command =
"\"killall -9 mysqld mysqld_safe; a=\\\`pidof mysqld\\\`; if [ \\\"a\\\$a\\\" = \\\"a\\\" ]; then exit 10; fi; sleep 1; a=\\\`pidof mysqld\\\`; if [ \\\"a\\\$a\\\" = \\\"a\\\" ]; then exit 10; else exit 1; fi\"";
  ( $high_ret, $low_ret ) = MHA::ManagerUtil::exec_system(
    "ssh $ssh_user_host $MHA::ManagerConst::SSH_OPT_CHECK $command");
  if ( $high_ret == $SSH_STOP_OK && $low_ret == 0 ) {
    print "ssh reachable. mysqld stopped. power off not needed.\n";
    return $high_ret;
  }
  else {
    print
      "ssh NOT reachable. Power off needed (rc1=$high_ret, rc2=$low_ret).\n";
    return 1;
  }
}

sub start {
  my ( $real_host, $admin_addr, $server_type ) = @_;

  my $power_status = "void";
  if ( $server_type eq "ilo" ) {
    $power_status = power_on_ilo($admin_addr);
  }
  elsif ( $server_type eq "drac" ) {
    $power_status = power_on_drac($admin_addr);
  }
  if ( $power_status eq "on" ) {
    print "Power of $real_host was successfully turned on.\n";
    return 0;
  }
  elsif ( $power_status ne "off" ) {
    return $COMMAND_NOT_SUPPORTED;
  }

  my $retry_count = 0;

  while ( $power_status ne "on" && $retry_count < $max_retries ) {
    $power_status = get_power_status( $admin_addr, $server_type );
    last if ( $power_status eq "on" );
    print
"Waiting until power status becomes 'on'. Current status is $power_status ...\n";
    sleep 3;
    $retry_count++;
  }

  if ( $power_status eq "on" ) {
    print "Power of $real_host was successfully turned on.\n";
    return 0;
  }
  else {
    print "Power of $real_host was not turned on. Check the host for detail.\n";
    return 1;
  }
}

sub status {
  my ( $real_host, $admin_addr, $server_type ) = @_;
  my $power_status = get_power_status( $admin_addr, $server_type );
  print "Current power status on $real_host : $power_status\n";
  if ( $power_status eq "on" ) {
    return 0;
  }
  elsif ( $power_status eq "off" ) {
    return 0;
  }
  else {
    return $COMMAND_NOT_SUPPORTED;
  }
}

# If ssh is reachable and mysqld process does not exist, exit with 2 and
# do not power off. If ssh is not reachable, do power off and exit with 0
# if successful. Otherwise exit with 1.
sub main {

  my ( $command, $ssh_user, $host, $ip, $port, $pid_file, $help );
  GetOptions(
    'command=s'  => \$command,
    'ssh_user=s' => \$ssh_user,
    'host=s'     => \$host,
    'ip=s'       => \$ip,
    'port=i'     => \$port,
    'pid_file=s' => \$pid_file,
    'help'       => \$help,
  );

  if ($help) {
    pod2usage(0);
  }

  pod2usage(1) unless ($command);

  my $rc            = 1;
  my $ssh_stop_fail = 0;

  if ( $command eq "stopssh" || $command eq "stopssh2" ) {
    pod2usage(1) unless ($ssh_user);
    pod2usage(1) unless ($host);
    $rc = stopssh( $ssh_user, $host, $ip, $pid_file );
    if ( $rc == $SSH_STOP_OK ) {
      exit $rc;
    }
    else {
      exit 1 if ( $command eq "stopssh2" );
      $ssh_stop_fail = 1;
    }
  }

  # Get server type (ilo/drac, etc) and administrative IP address.
  my ( $admin_addr, $server_type ) = FIXME_xxx( $host, $ip );
  if ( $command eq "start" ) {
    $rc = start( $host, $admin_addr, $server_type );
  }
  elsif ( $command eq "stop" || $ssh_stop_fail ) {
    $rc = stop( $host, $admin_addr, $server_type );
  }
  elsif ( $command eq "status" ) {
    $rc = status( $host, $admin_addr, $server_type );
  }
  else {
    pod2usage(1);
  }

  # Do other way to stop host
  if ( $rc == $COMMAND_NOT_SUPPORTED ) {
    $rc = FIXME_xxx( $command, $host, $ip );
  }

  if ( $rc == 0 ) {
    exit 0;
  }
  else {
    exit 1;
  }
}

#############################################################################

=head1 NAME

Main purpose of this command is node fencing so that split brain never happens.

=head1 SYNOPSIS

# power off

power_manager --command=stop --host=master_server

# killing mysqld and mysqld_safe at first. If not successful, forcing power off

power_manager --command=stopssh --host=master_server --ssh_user=root

# killing mysqld and mysqld_safe. If not successful, just exit.

power_manager --command=stopssh2 --host=master_server --ssh_user=root

# killing mysqld with specified pid file. This is useful when you run multiple MySQL instances and want to stop only specified instance

power_manager --command=stopssh --host=master_server --ssh_user=root --pid_file=/var/lib/mysql/mysqld.pid

# power on

power_manager --command=start --host=master_server

# checking power status

power_manager --command=status --host=master_server

3.8.3.4、sned_report脚本
vim /usr/bin/send_report
#!/usr/bin/perl

#  Copyright (C) 2011 DeNA Co.,Ltd.
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#  Foundation, Inc.,
#  51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA

## Note: This is a sample script and is not complete. Modify the script based on your environment.

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

#new_master_host and new_slave_hosts are set only when recovering master succeeded
my ( $dead_master_host, $new_master_host, $new_slave_hosts, $subject, $body );
GetOptions(
  'orig_master_host=s' => \$dead_master_host,
  'new_master_host=s'  => \$new_master_host,
  'new_slave_hosts=s'  => \$new_slave_hosts,
  'subject=s'          => \$subject,
  'body=s'             => \$body,
);

# Do whatever you want here

exit 0;

3.8.3.5、添加执行权限
chmod +x /usr/bin/master_ip_failover
chmod +x /usr/bin/master_ip_online_change 
chmod +x /usr/bin/power_manager 
chmod +x /usr/bin/send_report

3.8.4、设置定期清理relay脚本(两台slave服务器)
$ mkdir  -p /root/shell
$ vi /root/shell/purge_relay_log.sh
#!/bin/bash
dbuser=root
dbhost=localhost
dbpasswd=root_pwd2021!
dbsocket='/data/mysql/mysql3306/mysql.sock'
port=3306
log_dir='/data/mysql/mysql3306/logs'
work_dir='/data/mysql/mysql3306/data/'
purge='/usr/bin/purge_relay_logs'

if [ ! -d $log_dir ]
then
   mkdir $log_dir -p
fi

$purge --user=$dbuser  --host=$dbhost --password=$dbpasswd --socket=$dbsocket --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir/purge_relay_logs.log 2>&1

$ chmod 755 /root/shell/purge_relay_log.sh   

# 添加到crontab定期执行
vi /etc/crontab
0 4 * * * /bin/bash  /root/shell/purge_relay_log.sh

3.8.5、所有DB机器配置mysql软链:
ln -s /usr/local/mysql/bin/mysqlbinlog /usr/local/bin/mysqlbinlog
ln -s /usr/local/mysql/bin/mysql /usr/local/bin/mysql

3.8.6、执行互信验证
masterha_check_ssh --conf=/etc/masterha/app_zabbix.cnf

# 看到所有节点:
Fri Jun 18 13:28:49 2021 - [info] All SSH connection tests passed successfully.
# 检测通过

3.8.7、检查mysql集群主从复制是否正常
masterha_check_repl --conf=/etc/masterha/app_zabbix.cnf

# 结果:MySQL Replication Health is OK.
# 表示检查通过

3.8.8、启动manager
nohup masterha_manager --ignore_last_failover --conf=/etc/masterha/app_zabbix.cnf  &

# 备注:
#启动前检查
masterha_check_ssh  --conf=/etc/masterha/app_zabbix.cnf
masterha_check_repl --conf=/etc/masterha/app_zabbix.cnf
# 启动前建议清除之前的failover 日志:
rm -f /home/mha/masterha/app1/saved_master_binlog_from*

3.8.9、其他操作
#查看启动状态
masterha_check_status --conf=/etc/masterha/app_zabbix.cnf

#启动日志:
tail -f /home/mha/masterha/app_zabbix/manager.log

# 手动关闭manager 节点:
masterha_stop --conf=/etc/masterha/app_zabbix.cnf

#关闭DB
mysqladmin -uroot -p'root_pwd2021!'  shutdown
#启动DB
/usr/local/mysql/bin/mysqld --defaults-file=/etc/my.cnf &

4、故障模拟

4.1、验证MHA主从自动切换

通过模拟主节点宕机,验证是否触发主从自动切换。

  • 模拟主服务器宕机
$ systemctl stop mysqld
或
$ ps -ef |grep mysql | grep -v grep | awk '{print $2}' |xargs kill -9
  • 观察MHA Manger日志,查看切换过程
----- Failover Report -----

app_zabbix: MySQL Master failover 10.211.55.101(10.211.55.101:3306) to 10.211.55.102(10.211.55.102:3306) succeeded
......
Master failover to 10.211.55.102(10.211.55.102:3306) completed successfully.
  • Mysql-slave2节点查看主从状态
root@localhost:mysql.sock [zbxtest]>show slave status\G
No connection. Trying to reconnect...
Connection id:    27
Current database: zbxtest

*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.211.55.102
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000002
          Read_Master_Log_Pos: 613
               Relay_Log_File: relay-bin.000002
                Relay_Log_Pos: 779
        Relay_Master_Log_File: mysql-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
  • Mysql-Master(主从切换后的主节点)插入数据,验证主从同步
root@localhost:mysql.sock [zbxtest]>create table test (id INT, value varchar(40));
Query OK, 0 rows affected (0.02 sec)

root@localhost:mysql.sock [zbxtest]>insert into test values(3,'test');
Query OK, 1 row affected (0.04 sec)
  • Mysql-slave2节点查看数据是否同步
root@localhost:mysql.sock [zbxtest]>select * from test;
+------+-------+
| id   | value |
+------+-------+
|    3 | test  |
+------+-------+

4.2、将故障节点重新加入MHA

  • 启动mysql服务
$ systemctl start mysqld
或
/usr/local/mysql/bin/mysqld --defaults-file=/etc/my.cnf &
  • 从mangae.log中找到新master 日志的起始位置,以便老的master修复后重新加入MHA集群
$ cd /home/mha/masterha/app_zabbix
$ more manager.log |grep  -i change
Fri Jun 18 14:04:21 2021 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.137.138', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000003', MASTER_LOG_POS=1712, MASTER_USER='repl', MASTER_PASSWORD='xxx';
  • 登录故障数据库,设置为只读模式
root@localhost:mysql.sock [zbxtest]>set global read_only=1;
Query OK, 0 rows affected (0.01 sec)

root@localhost:mysql.sock [zbxtest]>set global relay_log_purge=0;
Query OK, 0 rows affected (0.00 sec)

root@localhost:mysql.sock [zbxtest]>stop slave ;
  • stop slave,配置主从复制
root@localhost:mysql.sock [zbxtest]>CHANGE MASTER TO MASTER_HOST='192.168.137.138', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000003', MASTER_LOG_POS=1712, MASTER_USER='repl', MASTER_PASSWORD='repl_pwd2021!';
Connection id:    6
Current database: zbxtest
  • 启动主从复制
root@localhost:mysql.sock [zbxtest]>start slave;
Query OK, 0 rows affected (0.01 sec)

root@localhost:mysql.sock [zbxtest]>show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.211.55.102
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000002
          Read_Master_Log_Pos: 613
               Relay_Log_File: relay-bin.000004
                Relay_Log_Pos: 826
        Relay_Master_Log_File: mysql-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
  • 验证数据是否同步
root@localhost:mysql.sock [zbxtest]>select * from test;
+------+-------+
| id   | value |
+------+-------+
|    3 | test  |
+------+-------+
1 row in set (0.00 sec)

4.3、管理节点重新启动mha

# 启动前建议清除之前的failover 日志:
rm -f /home/mha/masterha/app_zabbix/saved_master_binlog_from*
rm -f /home/mha/masterha/app_zabbix/app_zabbix.failover.complete
rm -f /home/mha/masterha/app_zabbix/manager.log

#启动前检查
masterha_check_ssh  --conf=/etc/masterha/app_zabbix.cnf
masterha_check_repl --conf=/etc/masterha/app_zabbix.cnf

#启动
nohup masterha_manager --ignore_last_failover --conf=/etc/masterha/app_zabbix.cnf  &


#查看启动状态
masterha_check_status --conf=/etc/masterha/app_zabbix.cnf