0%

mysql日志类型

redo log 重做日志 确保事务的持久性
undo log 回滚日志 用于回滚
bin log 二进制日志 主从复制
relay log 中继日志 主从复制
error log 错误日志
slow query log 慢查询日志
general log 一般查询日志

percona-toolkit介绍

percona-toolkit是一组高级命令行工具的集合,用来执行各种通过手工执行非常复杂和麻烦的mysql和系统任务,这些任务包括:

  1. 检查master和slave数据的一致性
  2. 有效地对记录进行归档
  3. 查找重复的索引
  4. 对服务器信息进行汇总
  5. 分析来自日志和tcpdump的查询
  6. 当系统出问题的时候收集重要的系统信息

percona-toolkit工具中最主要的三个组件分别是:

  1. pt-table-checksum 负责监测mysql主从数据一致性
  2. pt-table-sync 负责当主从数据不一致时修复数据,让它们保存数据的一致性
  3. pt-heartbeat 负责监控mysql主从同步延迟

percona-toolkit安装

  1. 安装依赖
# 安装依赖
yum install perl-IO-Socket-SSL perl-DBD-MySQL perl-Time-HiRes perl perl-DBI -y
  1. 下载安装包
  • 百度云盘下载地址:https://pan.baidu.com/s/1bp1OOgf (提取密码:y462)
  • wget https://www.percona.com/downloads/percona-toolkit/2.2.7/RPM/percona-toolkit-2.2.7-1.noarch.rpm
  1. 安装
rpm -ivh percona-toolkit-2.2.7-1.noarch.rpm

percona-toolkit使用

pt-table-checksumPercona-Toolkit的组件之一,用于检测MySQL主、从库的数据是否一致。其原理是在主库执行基于statement的sql语句来生成主库数据块的checksum,把相同的sql语句传递到从库执行,并在从库上计算相同数据块的checksum,最后,比较主从库上相同数据块的checksum值,由此判断主从数据是否一致。

为了减少对数据库的干预,pt-table-checksum还会自动侦测并连接到从库,当然如果失败,可以指定–recursion-method选项来告诉从库在哪里。它的易用性还体现在,复制若有延迟,在从库 checksum 会暂停直到赶上主库的计算时间点。

为了保证主数据库服务的安全,该工具实现了许多保护措施:

  1. 自动设置 innodb_lock_wait_timeout1s,避免引起
  2. 默认当数据库有25个以上的并发查询时,pt-table-checksum会暂停。可以设置 --max-load 选项来设置这个阀值
  3. 当用Ctrl+C停止任务后,工具会正常的完成当前chunk检测,下次使用--resume选项启动可以恢复继续下一个chunk

创建授权帐号

-- 工具在主库上执行,IP写主库IP
mysql> GRANT SELECT, PROCESS, SUPER, REPLICATION SLAVE,CREATE,DELETE,INSERT,UPDATE ON *.* TO 'pt'@'192.168.41.141' identified by '123456';
mysql> flush privileges;

pt-table-checksum

在主(master)上通过执行校验的查询对复制的一致性进行检查,对比主从的校验值,从而产生结果。

注意:第一次运行的时候需要加上--create-replicate-table参数,生成checksums表!

常用参数解释:

--nocheck-replication-filters 不检查复制过滤器,建议启用。
--no-check-binlog-format 不检查复制的binlog模式,要是binlog模式是ROW,则会报错。
--replicate-check-only 只显示不同步的信息。
--replicate= 把checksum的信息写入到指定表中,建议直接写到被检查的数据库当中。
--databases= 指定需要被检查的数据库,多个则用逗号隔开。
--tables= 指定需要被检查的表,多个用逗号隔开
--recursion-method 指定复制检查的方式,默认为processlist,hosts
h= Master的地址
u= 用户名
p= 密码
P= 端口
-- 从库需要开启slave;
start slave;
show slave status\G

-- 主库processlist可以查到slave
mysql> show processlist;
+--------+-------------+---------------+------+------------------+--------+---------------------------------------------------------------+------------------+
| Id     | User        | Host          | db   | Command          | Time   | State                                                         | Info             |
+--------+-------------+---------------+------+------------------+--------+---------------------------------------------------------------+------------------+
|      2 | system user |               | NULL | Connect          | 589655 | Slave has read all relay log; waiting for more updates        | NULL             |
|   2049 | slave       | mysql02:37382 | NULL | Binlog Dump GTID | 587611 | Master has sent all binlog to slave; waiting for more updates | NULL             |
| 588590 | root        | localhost     | NULL | Query            |      0 | starting                                                      | show processlist |
+--------+-------------+---------------+------+------------------+--------+---------------------------------------------------------------+------------------+
3 rows in set (0.03 sec)
# 执行检查命令,注意第一次添加--create-replicate-table参数创建表
[root@mysql01 ~]# pt-table-checksum --nocheck-replication-filters --no-check-binlog-format --replicate=test2.checksums --create-replicate-table --databases=test h=192.168.41.141,u=pt,p=123456,P=3306
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
08-05T17:35:54      0      1       13       1       0   0.091 test.t_department
08-05T17:35:54      0      0        8       1       0   0.202 test.testttt

# 结果字段说明
# TS :完成检查的时间。
# ERRORS :检查时候发生错误和警告的数量。
# DIFFS :0表示一致,1表示不一致。当指定--no-replicate-check时,会一直为0,当指定--replicate-check-only会显示不同的信息。
# ROWS :表的行数。
# CHUNKS :被划分到表中的块的数目。
# SKIPPED :由于错误或警告或过大,则跳过块的数目。
# TIME :执行的时间。
# TABLE :被检查的表名。

[root@mysql01 ~]# pt-table-checksum --nocheck-replication-filters --no-check-binlog-format --replicate=test2.checksums --no-replicate-check  --databases=test h=192.168.41.141,u=pt,p=123456,P=3306
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
08-06T09:30:43      0      0       13       1       0   0.023 test.t_department
08-06T09:30:43      0      0        8       1       0   0.033 test.testttt
[root@mysql01 ~]# pt-table-checksum --nocheck-replication-filters --no-check-binlog-format --replicate=test2.checksums --replicate-check-only --databases=test h=192.168.41.141,u=pt,p=123456,P=3306
Differences on mysql02
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
test.t_department 1 0 1 

pt-table-sync

pt-table-sync: 高效的同步MySQL表之间的数据,他可以做单向和双向同步的表数据。他可以同步单个表,也可以同步整个库。它不同步表结构、索引、或任何其他模式对象。所以在修复一致性之前需要保证他们表存在。

# 参数解释:
# --replicate= :   指定通过pt-table-checksum得到的表,这2个工具差不多都会一直用。
# --databases= :    指定执行同步的数据库。
# --tables= :      指定执行同步的表,多个用逗号隔开。
# --sync-to-master :指定一个DSN,即从的IP,他会通过show processlist或show slave status 去自动的找主。
# h= :             服务器地址,命令里有2个ip,第一次出现的是Master的地址,第2次是Slave的地址。
# u= :             帐号。
# p= :             密码。
# --print :        打印,但不执行命令。打印出来修复数据的sql语句,可以手动的在slave从库上执行,让他们数据保持一致性
# --execute :      执行命令。

[root@mysql01 ~]# pt-table-sync --replicate=test2.checksums h=192.168.41.141,u=pt,p=123456 h=192.168.41.142,u=pt,p=123456 --print
REPLACE INTO `test`.`t_department`(`department_id`, `department_name`, `parent_id`, `level`, `create_time`, `update_time`) VALUES ('7', '8', '0', '1', '2020-07-28 16:06:40', '2020-07-28 16:06:40') /*percona-toolkit src_db:test src_tbl:t_department src_dsn:h=192.168.41.141,p=...,u=pt dst_db:test dst_tbl:t_department dst_dsn:h=mysql02,p=...,u=pt lock:1 transaction:1 changing_src:test2.checksums replicate:test2.checksums bidirectional:0 pid:114329 user:root host:mysql01*/;


[root@mysql01 ~]# pt-table-sync --replicate=test2.checksums h=192.168.41.141,u=pt,p=123456 h=192.168.41.142,u=pt,p=123456 --print --execute
REPLACE INTO `test`.`t_department`(`department_id`, `department_name`, `parent_id`, `level`, `create_time`, `update_time`) VALUES ('7', '8', '0', '1', '2020-07-28 16:06:40', '2020-07-28 16:06:40') /*percona-toolkit src_db:test src_tbl:t_department src_dsn:h=192.168.41.141,p=...,u=pt dst_db:test dst_tbl:t_department dst_dsn:h=mysql02,p=...,u=pt lock:1 transaction:1 changing_src:test2.checksums replicate:test2.checksums bidirectional:0 pid:115260 user:root host:mysql01*/;

# 同步之后再次查看,主从数据已经同步
[root@mysql01 ~]# pt-table-checksum --nocheck-replication-filters --no-check-binlog-format --replicate=test2.checksums --databases=test h=192.168.41.141,u=pt,p=123456,P=3306
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
08-06T09:36:23      0      0       13       1       0   0.293 test.t_department
08-06T09:36:23      0      0        8       1       0   0.018 test.testttt

可以编写脚本定时同步主备库数据

#!/bin/bash
NUM=$(/usr/bin/pt-table-checksum --nocheck-replication-filters --no-check-binlog-format --replicate=test2.checksums --databases=test  h=192.168.41.141,u=pt,p=123456,P=3306|awk -F" " '{print $3}'|sed -n '2p')
if [ $NUM -eq 1 ];then
  /usr/bin/pt-table-sync --replicate=test2.checksums h=192.168.41.141,u=pt,p=123456 h=192.168.41.142,u=pt,p=123456 --print --execute
else
  echo "data is ok"
fi

pt-heartbeat

对于MySQL数据库主从复制延迟的监控,可以借助pt-heartbeat来实现。

pt-heartbeat的工作原理通过使用时间戳方式在主库上更新特定表,然后在从库上读取被更新的时间戳然后与本地系统时间对比来得出其延迟。具体流程:

  1. 在主上创建一张heartbeat表,按照一定的时间频率更新该表的字段(把时间更新进去)。监控操作运行后,heartbeat表能促使主从同步!
  2. 连接到从库上检查复制的时间记录,和从库的当前系统时间进行比较,得出时间的差异。

参考链接

问题说明

mysql主从同步时难免出现问题导致数据不一致情况.此时需要手动将主从数据同步.

Maatkit是一个开源的工具包,为mySQL日常管理提供了帮助,它包含很多工具,这里主要说下面两个:

  • mk-table-checksum : 用来检测masterslave上的表结构和数据是否一致的;
  • mk-table-sync : 在主从数据不一致时,用来修复数据的;先主后从有效保证表一致的工具,不必重载从表而能够保证一致。

    上面两个perl脚本在运行时都会锁表,表的大小取决于执行的快慢,勿在高峰期间运行,可选择凌晨

安装Maatkit

  1. 安装依赖
# 安装依赖
yum install perl-IO-Socket-SSL perl-DBD-MySQL perl-Time-HiRes perl perl-DBI -y

yum install perl-ExtUtils-Embed perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker -y

yum -y install perl-Digest-MD5
  1. 下载maatkit
  1. 安装maatkit
# 解压进入目录
tar -zvxf maatkit-7540.tar.gz  && cd maatkit-7540 

[root@mysql01 maatkit-7540]# perl Makefile.PL
Checking if your kit is complete...
Looks good
Writing Makefile for maatkit

# 安装
make && make install

# 测试查看mk-相关命令
[root@mysql01 maatkit-7540]# mk-
mk-archiver               mk-duplicate-key-checker  mk-heartbeat              mk-log-player             mk-profile-compact        mk-query-profiler         mk-slave-move             mk-table-sync             mk-variable-advisor       
mk-checksum-filter        mk-error-log              mk-index-usage            mk-merge-mqd-results      mk-purge-logs             mk-show-grants            mk-slave-prefetch         mk-table-usage            mk-visual-explain         
mk-config-diff            mk-fifo-split             mk-kill                   mk-parallel-dump          mk-query-advisor          mk-slave-delay            mk-slave-restart          mk-tcp-model              
mk-deadlock-logger        mk-find                   mk-loadavg                mk-parallel-restore       mk-query-digest           mk-slave-find             mk-table-checksum         mk-upgrade                

工具使用

  1. 创建mysql检查帐号
-- 主从服务器上创建只有SELECT权限的帐号
mysql> GRANT SELECT ON *.* TO 'data_check'@'%' IDENTIFIED BY '123456';
Query OK, 0 rows affected (0.01 sec)
 
mysql> flush privileges;
Query OK, 0 rows affected (0.01 sec)
  1. 检查一致性
# 第一个h:   填写主库ip
# 第一个u:   主库授权的用户名
# 第一个p:   主库授权的密码
# 第二个h:   填写从库ip
# 第二个u:   从库授权的用户名
# 第二个p:   从库授权的密码
# 大写P:   mysql的端口
[root@mysql01 maatkit-7540]# mk-table-checksum h=192.168.41.141,u=data_check,p=123456,P=3306 h=192.168.41.142,u=data_check,p=123456,P=3306
DATABASE TABLE        CHUNK HOST           ENGINE      COUNT         CHECKSUM TIME WAIT STAT  LAG
test     t_department     0 192.168.41.141 InnoDB       NULL         83657212    0    0 NULL NULL
test     t_department     0 192.168.41.142 InnoDB       NULL       3344593131    0    0 NULL NULL
test     testttt          0 192.168.41.141 InnoDB       NULL       3722958311    0    0 NULL NULL
test     testttt          0 192.168.41.142 InnoDB       NULL       3722958311    0    0 NULL NULL
DATABASE TABLE                     CHUNK HOST           ENGINE      COUNT         CHECKSUM TIME WAIT STAT  LAG
mysql    columns_priv                  0 192.168.41.141 MyISAM       NULL                0    0    0 NULL NULL
mysql    columns_priv                  0 192.168.41.142 MyISAM       NULL                0    0    0 NULL NULL
mysql    db                            0 192.168.41.141 MyISAM       NULL       2629905140    0    0 NULL NULL
mysql    db                            0 192.168.41.142 MyISAM       NULL       2629905140    0    0 NULL NULL
mysql    engine_cost                   0 192.168.41.141 InnoDB       NULL       2678612005    0    0 NULL NULL
...


# 参数-d(或者--database) 指定要检查的数据库,多个数据库之间用逗号隔开,如果不加-d,校验的是所有库;
# --count    会计算出表的行数
[root@mysql01 maatkit-7540]# mk-table-checksum h=192.168.41.141,u=data_check,p=123456,P=3306 h=192.168.41.142,u=data_check,p=123456,P=3306 --count -dtest
DATABASE TABLE        CHUNK HOST           ENGINE      COUNT         CHECKSUM TIME WAIT STAT  LAG
test     t_department     0 192.168.41.141 InnoDB         13         5e339d70    0    0 NULL NULL
test     t_department     0 192.168.41.142 InnoDB         13         7a4c9153    0    0 NULL NULL
test     testttt          0 192.168.41.141 InnoDB          8         41db7636    0    0 NULL NULL
test     testttt          0 192.168.41.142 InnoDB          8         41db7636    0    0 NULL NULL

# 结果参数说明
# DATABASE      检查的库名
# TABLE:       检查的库中的表名
# CHUNK:       checksum时的近似数值
# HOST:        主从MYSQL的地址
# ENGINE:      表引擎
# COUNT:       表的行数
# CHECKSUM:    校验值
# TIME:        所用时间
# WAIT:        等待时间
# STAT:        MASTER_POS_WAIT()返回值
# LAG:         slave的延时时间


# mk-checksum-filter过滤出不一样的结果
[root@mysql01 maatkit-7540]# mk-table-checksum h=192.168.41.141,u=data_check,p=123456,P=3306 h=192.168.41.142,u=data_check,p=123456,P=3306 --count -dtest | mk-checksum-filter
test     t_department     0 192.168.41.141 InnoDB         13         5e339d70    0    0 NULL NULL
test     t_department     0 192.168.41.142 InnoDB         13         7a4c9153    0    0 NULL NULL
  1. 修复主从不同步的数据

mk-table-sync用来修复多个实例之间数据的不一致. mk-table-sync的工作方式是:先一行一行检查主从库的表是否一样,如果哪里不一样,就执行删除,更新,插入等操作,使其达到一致。

由于上面在mk-table-checksum检查时用的data_check只有select权限,权限太小,不能用于mk-table-sync修复数据只用。
所以还需要在主库和从库数据库里创建用于mk-table-sync修复数据之用的账号权限

mysql> GRANT ALL ON *.* to mksync@'%' identified by "123456";
Query OK, 0 rows affected (0.01 sec)
 
mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)
# 注意:主从数据库表和表结构必须一致,否则会报错!
# --print               显示同步需要执行的语句
# --execute             执行数据同步
# --charset=utf8mb4     设置字符集,避免从库乱码。
# --no-check-slave      如果不加这个参数,会提示报错,因为此工具默认不允许在从库上修改数据
# -t                    指定同步的表名,多个表时可以在-t后加=号,即-t=tables1,tables2,..
# > /root/result        命令后面加上 > /root/result 将结果导出

[root@mysql01 maatkit-7540]# mk-table-sync --execute --charset=utf8mb4 --print --no-check-slave --databases test h=192.168.41.141,u=mksync,p=123456,P=3306 h=192.168.41.142,u=mksync,p=123456,P=3306
UPDATE `test`.`t_department` SET `department_name`='44', `parent_id`='0', `level`='1', `create_time`='2020-07-28 16:06:24', `update_time`='2020-07-28 16:06:24' WHERE `department_id`='3' LIMIT 1 /*maatkit src_db:test src_tbl:t_department src_dsn:A=utf8mb4,P=3306,h=192.168.41.141,p=...,u=mksync dst_db:test dst_tbl:t_department dst_dsn:A=utf8mb4,P=3306,h=192.168.41.142,p=...,u=mksync lock:0 transaction:1 changing_src:0 replicate:0 bidirectional:0 pid:3869 user:root host:mysql01*/;
# 修复之后检查主从,通过CHECKSUM可以看出,数据已经完全一样了
[root@mysql01 maatkit-7540]# mk-table-checksum h=192.168.41.141,u=data_check,p=check@123,P=3306 h=192.168.41.142,u=data_check,p=check@123,P=3306 --count -dtest
DATABASE TABLE        CHUNK HOST           ENGINE      COUNT         CHECKSUM TIME WAIT STAT  LAG
test     t_department     0 192.168.41.141 InnoDB         13         5e339d70    0    0 NULL NULL
test     t_department     0 192.168.41.142 InnoDB         13         5e339d70    0    0 NULL NULL
test     testttt          0 192.168.41.141 InnoDB          8         41db7636    0    0 NULL NULL
test     testttt          0 192.168.41.142 InnoDB          8         41db7636    0    0 NULL NULL

参考链接

说明

  • Redis基于Reactor模式开发了网络事件处理器,这个处理器被称为文件事件处理器。它的组成结构为4部分:多个套接字、IO多路复用程序、文件事件分派器、事件处理器。因为文件事件分派器队列的消费是单线程的,所以Redis才叫单线程模型。

  • Redis 4.0 开始就有多线程的概念了,比如 Redis 通过多线程方式在后台删除对象、以及通过 Redis 模块实现的阻塞命令等。

  • Redis 6.0 网络处理多线程,指的是在网络 IO 处理方面上了多线程,如网络数据的读写和协议解析等,需要注意的是,执行命令的核心模块还是单线程的。

  • Redis 的瓶颈并不在 CPU,而在内存和网络。内存不够的话,可以加内存或者做数据结构优化和其他优化等,但网络的性能优化才是大头,网络 IO 的读写在 Redis 整个执行期间占用了大部分的 CPU 时间,如果把网络处理这部分做成多线程处理方式,那对整个 Redis 的性能会有很大的提升。

IO多路复用

-

说明

prometheus监控插件mysqld_exporter可以采集mysql指标并告警展示.

步骤

本文使用docker部署的方式

  1. 创建用户并授权
CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporter#2020';
GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'exporter'@'%';
GRANT SELECT ON performance_schema.* TO 'exporter'@'%';
-- GRANT ALL ON *.* TO 'exporter'@'%';
  1. 运行mysqld_exporter
docker run -d \
  --name mysql_exporter \
  --restart always \
  -p 9104:9104 \
  -e DATA_SOURCE_NAME="exporter:123456@(192.168.41.141:3306)/" \
  prom/mysqld-exporter
  1. 查看指标

如果没有问题,可以将指标采集到prometheus

  1. 修改prometheus配置文件,scrape_configs下面添加
    ssh no
    如果需要,部署prometheus
# prometheus.yml
global:
  scrape_interval:     15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093
rule_files:
  # - "rules.yml"
  # - "second_rules.yml"
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['192.168.41.128:9090']
  - job_name: 'mysql'
    static_configs:
      - targets: ['192.168.41.141:9104']
# 启动
docker run --name=prometheus -d -p 9090:9090  -v /root/prometheus/config/prometheus.yml:/etc/prometheus/prometheus.yml hub.deri.org.cn/k8s_monitor/prometheus:latest
# vim /usr/local/prometheus/prometheus.yml
scrape_configs:
  - job_name: 'mysql'
    static_configs:
      - targets: ['192.168.41.141:9104']

注意:如果是采用prometheus-operator方式安装prometheus,新增target示例:
新增mysql-exporter-endpoints.yaml,引入外部服务

apiVersion: v1
kind: Endpoints
metadata:
  name: mysql-exporter
  namespace: monitoring
subsets:
- addresses:
  - ip: 192.168.1.47
    targetRef:
      name: huawei-mysql
      namespace: default
    ports:
      - name: metrics
        port: 9104
        protocol: TCP

新增mysql-exporter-service.yaml

apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: "true"
  labels:
    app: prometheus-mysql-exporter
    chart: prometheus-mysql-exporter
    heritage: Tiller
    jobLabel: mysql-exporter
    release: prometheus-operator
  # name与endpoints name保持一致
  name: mysql-exporter
  namespace: monitoring
spec:
  ports:
  - name: metrics
    port: 9104
    targetPort: 9104
    protocol: TCP
  type: ClusterIP

新增mysql-exporter-ServiceMonitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  # 注意标签 当前版本prometheus-operator是通过release: prometheus-operator标签来发现ServiceMonitor的
  labels:
    app: prometheus-operator-mysql-exporter
    chart: prometheus-operator-8.2.4
    heritage: Tiller
    release: prometheus-operator
  name: prometheus-operator-mysql-exporter
  namespace: monitoring
spec:
  endpoints:
  - port: metrics
  jobLabel: jobLabel
  selector:
    # 通过标签发现service
    matchLabels:
      app: prometheus-mysql-exporter
      release: prometheus-operator
kubectl apply -f .
# 新增上面三个服务之后,prometheus页面就可以发现新增的mysql-exporter.
# prometheus数据库中也能查询到mysql指标数据
  1. 设置报警规则
groups:
- name: GaleraAlerts
  rules:
  - alert: MySQLGaleraNotReady
    expr: mysql_global_status_wsrep_ready != 1
    for: 5m
    labels:
      severity: warning
    annotations:
      description: '{{$labels.job}} on {{$labels.instance}} is not ready.'
      summary: Galera cluster node not ready
  - alert: MySQLGaleraOutOfSync
    expr: (mysql_global_status_wsrep_local_state != 4 and mysql_global_variables_wsrep_desync
      == 0)
    for: 5m
    labels:
      severity: warning
    annotations:
      description: '{{$labels.job}} on {{$labels.instance}} is not in sync ({{$value}}
        != 4).'
      summary: Galera cluster node out of sync
  - alert: MySQLGaleraDonorFallingBehind
    expr: (mysql_global_status_wsrep_local_state == 2 and mysql_global_status_wsrep_local_recv_queue
      > 100)
    for: 5m
    labels:
      severity: warning
    annotations:
      description: '{{$labels.job}} on {{$labels.instance}} is a donor (hotbackup)
        and is falling behind (queue size {{$value}}).'
      summary: xtradb cluster donor node falling behind
  - alert: MySQLReplicationNotRunning
    expr: mysql_slave_status_slave_io_running == 0 or mysql_slave_status_slave_sql_running
      == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      description: Slave replication (IO or SQL) has been down for more than 2 minutes.
      summary: Slave replication is not running
  - alert: MySQLReplicationLag
    expr: (mysql_slave_lag_seconds > 30) and on(instance) (predict_linear(mysql_slave_lag_seconds[5m],
      60 * 2) > 0)
    for: 1m
    labels:
      severity: critical
    annotations:
      description: The mysql slave replication has fallen behind and is not recovering
      summary: MySQL slave replication is lagging
  - alert: MySQLReplicationLag
    expr: (mysql_heartbeat_lag_seconds > 30) and on(instance) (predict_linear(mysql_heartbeat_lag_seconds[5m],
      60 * 2) > 0)
    for: 1m
    labels:
      severity: critical
    annotations:
      description: The mysql slave replication has fallen behind and is not recovering
      summary: MySQL slave replication is lagging
  - alert: MySQLInnoDBLogWaits
    expr: rate(mysql_global_status_innodb_log_waits[15m]) > 10
    labels:
      severity: warning
    annotations:
      description: The innodb logs are waiting for disk at a rate of {{$value}} /
        second
      summary: MySQL innodb log writes stalling
# 告警规则示例
groups:
- name: MySQLStatsAlert
  rules:
  - alert: MySQL is down
    expr: mysql_up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} MySQL is down"
      description: "MySQL database is down. This requires immediate action!"
  - alert: open files high
    expr: mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.75
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} open files high"
      description: "Open files is high. Please consider increasing open_files_limit."
  - alert: Read buffer size is bigger than max. allowed packet size
    expr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Read buffer size is bigger than max. allowed packet size"
      description: "Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication."
  - alert: Sort buffer possibly missconfigured
    expr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Sort buffer possibly missconfigured"
      description: "Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M."
  - alert: Thread stack size is too small
    expr: mysql_global_variables_thread_stack <196608
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Thread stack size is too small"
      description: "Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
  - alert: Used more than 80% of max connections limited 
    expr: mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Used more than 80% of max connections limited"
      description: "Used more than 80% of max connections limited"
  - alert: InnoDB Force Recovery is enabled
    expr: mysql_global_variables_innodb_force_recovery != 0 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Force Recovery is enabled"
      description: "InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data."
  - alert: InnoDB Log File size is too small
    expr: mysql_global_variables_innodb_log_file_size < 16777216 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Log File size is too small"
      description: "The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts."
  - alert: InnoDB Flush Log at Transaction Commit
    expr: mysql_global_variables_innodb_flush_log_at_trx_commit != 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Flush Log at Transaction Commit"
      description: "InnoDB Flush Log at Transaction Commit is set to a values != 1. This can lead to a loss of commited transactions in case of a power failure."
  - alert: Table definition cache too small
    expr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Table definition cache too small"
      description: "Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!"
  - alert: Table open cache too small
    expr: mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Table open cache too small"
      description: "Your Table Open Cache is possibly too small (old name Table Cache). If it is much too small this can have significant performance impacts!"
  - alert: Thread stack size is possibly too small
    expr: mysql_global_variables_thread_stack < 262144
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Thread stack size is possibly too small"
      description: "Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
  - alert: InnoDB Buffer Pool Instances is too small
    expr: mysql_global_variables_innodb_buffer_pool_instances == 1
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Buffer Pool Instances is too small"
      description: "If you are using MySQL 5.5 and higher you should use several InnoDB Buffer Pool Instances for performance reasons. Some rules are: InnoDB Buffer Pool Instance should be at least 1 Gbyte in size. InnoDB Buffer Pool Instances you can set equal to the number of cores of your machine."
  - alert: InnoDB Plugin is enabled
    expr: mysql_global_variables_ignore_builtin_innodb == 1
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Plugin is enabled"
      description: "InnoDB Plugin is enabled"
  - alert: Binary Log is disabled
    expr: mysql_global_variables_log_bin != 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Binary Log is disabled"
      description: "Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR)."
  - alert: Binlog Cache size too small
    expr: mysql_global_variables_binlog_cache_size < 1048576
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Binlog Cache size too small"
      description: "Binlog Cache size is possibly to small. A value of 1 Mbyte or higher is OK."
  - alert: Binlog Statement Cache size too small
    expr: mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 0
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Binlog Statement Cache size too small"
      description: "Binlog Statement Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
  - alert: Binlog Transaction Cache size too small
    expr: mysql_global_variables_binlog_cache_size  <1048576
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Binlog Transaction Cache size too small"
      description: "Binlog Transaction Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
  - alert: Sync Binlog is enabled
    expr: mysql_global_variables_sync_binlog == 1
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Sync Binlog is enabled"
      description: "Sync Binlog is enabled. This leads to higher data security but on the cost of write performance."
  - alert: IO thread stopped
    expr: mysql_slave_status_slave_io_running != 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} IO thread stopped"
      description: "IO thread has stopped. This is usually because it cannot connect to the Master any more."
  - alert: SQL thread stopped 
    expr: mysql_slave_status_slave_sql_running == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} SQL thread stopped"
      description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
  - alert: SQL thread stopped
    expr: mysql_slave_status_slave_sql_running != 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} Sync Binlog is enabled"
      description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
  - alert: Slave lagging behind Master
    expr: rate(mysql_slave_status_seconds_behind_master[1m]) >30 
    for: 1m
    labels:
      severity: warning 
    annotations:
      summary: "Instance {{ $labels.instance }} Slave lagging behind Master"
      description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"
  - alert: Slave is NOT read only(Please ignore this warning indicator.)
    expr: mysql_global_variables_read_only != 0
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Slave is NOT read only"
      description: "Slave is NOT set to read only. You can accidentally manipulate data on the slave and get inconsistencies..."
  1. 监控展示

如果需要,部署grafana

docker run -d -p 3000:3000 --name grafana hub.deri.org.cn/k8s_monitor/grafana:6.4.2

Grafana dashboard一些可用的ID,可直接通过ID导入dashboard.

  • 6239
  • 7362
  • 11329
  • 10654
  • 11323

参考链接

说明

  • 使用的docker镜像mysql:5.7.18;
  • 容器中mysql配置文件目录/etc/mysql/;

步骤

  1. 启动镜像,将镜像中配置目录拷贝到宿主机
#启动,并设置root密码
docker run -d --name mysql  -e MYSQL_ROOT_PASSWORD=123456 hub.deri.org.cn/library/mysql:5.7

#复制
docker cp mysql:/etc/mysql /root
#配置文件目录结构
[root@mysql01 mysql]# tree
.
├── conf.d
│   ├── docker.cnf
│   └── mysql.cnf
├── my.cnf -> /etc/alternatives/my.cnf
├── my.cnf.fallback
├── mysql.cnf
└── mysql.conf.d
    └── mysqld.cnf

2 directories, 6 files
  1. 查看mysql data目录,便于将docker中数据持久化到宿主机
-- 登录
docker exec -it mysql bash
mysql -uroot -p123456
-- 查询
mysql> show variables like '%datadir%';
+---------------+-----------------+
| Variable_name | Value           |
+---------------+-----------------+
| datadir       | /var/lib/mysql/ |
+---------------+-----------------+
1 row in set (0.01 sec)
  1. 修改mysql.cnf
#参考mysql主从配置
[mysqld]
server-id=1 # 主从设置不一样
log-bin=mysql-bin #备无需开启
auto-increment-increment=2
auto-increment-offset=2 # 主从设置不一样:1,2
log-slave-updates=true
#replicate-ignore-db=information_schema
max_binlog_size=1024M
binlog-format=ROW
#binlog_rows_query_log_events=on # 在row 模式的binlog中包含SQL EVENTS(即SQL语句也会保留)
#master-info-repository=TABLE
#relay-log-info-repository=TABLE
#slave-skip-errors=all # 忽略复制产生的错误
#slave-skip-errors=1062,1032,1060 # 跳过已知错误,主键冲突、表已存在等错误代码如1062,1032,1060等
#relay-log-purge = 1    #是否自动清空不再需要中继日志时。默认值为1(启用)
#expire_logs_days = 30  #超过 30 天的 binlog 删除
  1. 启动mysql
#映射端口、配置文件和data
docker run -d --name mysql -v /root/mysql:/etc/mysql -v /var/lib/mysql/:/var/lib/mysql/ -e MYSQL_ROOT_PASSWORD=123456 -p 3306:3306  hub.deri.org.cn/library/mysql:5.7

#也可以按需只映射一个配置文件mysql.cnf
docker run -d --name mysql -v /root/mysql/mysql.cnf:/etc/mysql/mysql.cnf -v /var/lib/mysql/:/var/lib/mysql/ -e MYSQL_ROOT_PASSWORD=123456 -p 3307:3306  hub.deri.org.cn/library/mysql:5.7
  1. 其它主从配置,参考非docker安装配置
  • 添加主从帐号
  • 配置主从信息
  • 开启半同步复制
  • 开启从节点只读模式
  1. 使用keepalive工具

写一个脚本/etc/docker_mysql.sh判断dockermysql是否存活

#! /bin/bash
count=$(docker ps |grep hub.deri.org.cn/library/mysql:5.7 | grep Up| wc -l)
if [ $count == "1" ];then
    exit 0
else
    exit 1
fi


#测试这个脚本
[root@mysql01 ~]# docker stop mysql
mysql
[root@mysql01 ~]# ./docker_mysql.sh 
[root@mysql01 ~]# ./docker_mysql.sh ;echo $?
1
[root@mysql01 ~]# docker start mysql 
mysql
[root@mysql01 ~]# ./docker_mysql.sh ;echo $?
0

#配置keepalived
vrrp_script chk_mysql {
    script "/etc/docker_mysql.sh"
    interval 2
    timeout 2
    fall 3
}