TiDB 2.0升级2.1
中控机器上安装 Ansible 及其依赖
TiDB-Ansible release-2.1 版本依赖 2.4.2 及以上但不高于 2.7.0 的 Ansible 版本(ansible>=2.4.2,<2.7.0),另依赖 Python 模块:jinja2>=2.9.6 和 jmespath>=0.9.0。为方便管理依赖,新版本使用 pip 安装 Ansible 及其依赖,可参照在中控机器上安装 Ansible 及其依赖 进行安装。离线环境参照在中控机器上离线安装 Ansible 及其依赖。
安装完成后,可通过以下命令查看版本:
$ ansible --version
ansible 2.6.8
$ pip show jinja2
Name: Jinja2
Version: 2.10
$ pip show jmespath
Name: jmespath
Version: 0.9.3
注意:请务必按以上文档安装 Ansible 及其依赖。确认 Jinja2 版本是否正确,否则启动 Grafana 时会报错。确认 jmespath 版本是否正确,否则滚动升级 TiKV 时会报错。
在中控机器上下载 TiDB-Ansible
以 tidb 用户登录中控机并进入 /home/tidb 目录,备份 TiDB 2.0 版本或 TiDB 2.1 rc 版本的 tidb-ansible 文件夹:
$ mv tidb-ansible tidb-ansible-bak
下载最新 tidb-ansible release-2.1 分支,默认的文件夹名称为 tidb-ansible。
$ git clone -b release-2.1 https://github.com/pingcap/tidb-ansible.git
编辑 inventory.ini 文件和配置文件
以 tidb 用户登录中控机并进入 /home/tidb/tidb-ansible 目录。
编辑 inventory.ini 文件或者直接拷贝原有 inventory.ini文件
编辑 inventory.ini 文件,IP 信息参照备份文件 /home/tidb/tidb-ansible-bak/inventory.ini。
以下变量配置,需要重点确认,变量含义可参考 inventory.ini 变量调整。
请确认
ansible_user配置的是普通用户。为统一权限管理,不再支持使用 root 用户远程安装。默认配置中使用tidb用户作为 SSH 远程用户及程序运行用户。## Connection # ssh via normal user ansible_user = tidb可参考如何配置 ssh 互信及 sudo 规则 自动配置主机间互信。
process_supervision变量请与之前版本保持一致,默认推荐使用systemd。# process supervision, [systemd, supervise] process_supervision = systemd如需变更,可参考 如何调整进程监管方式从 supervise 到 systemd,先使用备份
/home/tidb/tidb-ansible-bak/分支变更进程监管方式再升级。
编辑 TiDB 集群组件配置文件或者直接拷贝原有 修改过的 conf 下的参数文件
如之前自定义过 TiDB 集群组件配置文件,请参照备份文件修改 /home/tidb/tidb-ansible/conf 下对应配置文件。
TiKV 配置文件 tikv.yml 中 end-point-concurrency 变更为 high-concurrency、normal-concurrency 和 low-concurrency三个参数:
readpool:
coprocessor:
# Notice: if CPU_NUM > 8, default thread pool size for coprocessors
# will be set to CPU_NUM * 0.8.
# high-concurrency: 8
# normal-concurrency: 8
# low-concurrency: 8
单机多 TiKV 实例情况下,需要修改这三个参数,推荐设置:实例数 * 参数值 = CPU 核数 * 0.8。
下载 TiDB 2.1 binary 到中控机
确认 tidb-ansible/inventory.ini 文件中 tidb_version = v2.1.0,然后执行以下命令下载 TiDB 2.1 binary 到中控机。
$ ansible-playbook local_prepare.yml
滚动升级 TiDB 集群组件
滚动升级所有集群组件
$ ansible-playbook rolling_update.yml
滚动指定升级某一部分,pd、tikv、tidb 等等,若不知道 tags 是多少可以指 more 查看 rolling_update.yml 找到 tags 该 key 对应的值即可
$ ansible-playbook rolling_update.yml --tags=pd
验证是否升级到2.1版本,检查如下可以看到已升级到 TiDB-v2.1.4 : Server version: 5.7.10-TiDB-v2.1.4
[tidb@ip-172-16-30-86 tidb-ansible]$ mysql -uroot -p -h172.16.30.86 -P5000
Enter password:
Welcome to the MySQL monitor. Commands end with ; or /g.
Your MySQL connection id is 40
Server version: 5.7.10-TiDB-v2.1.4 MySQL Community Server (Apache License 2.0)
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '/h' for help. Type '/c' to clear the current input statement.
mysql>
滚动升级 TiDB 监控组件
滚动升级所有集群组件
$ ansible-playbook rolling_update_monitor.yml
滚动指定升级某一部分,prometheus、grafana 等等,若不知道 tags 是多少可以指 more 查看 rolling_update_monitor.yml 找到 tags 该 key 对应的值即可
$ ansible-playbook rolling_update_monitor.yml --tags=pd
手动下载 binary,然后使用Ansible滚动升级
上面TiDB升级方式直接通过Ansible中控机升级下载二进制文件到downloads,再ansible方式滚动升级,这种为手动下载,官方推荐Ansible统一升级
wget http://download.pingcap.org/tidb-{版本号}-linux-amd64.tar.gz
除 “自动下载 binary” 中描述的方法之外,也可以手动下载 binary,解压后手动替换 binary 到 ${deploy_dir}/resource/bin/,需注意替换链接中的版本号。
$ wget http://download.pingcap.org/tidb-v2.0.7-linux-amd64.tar.gz
如果使用 master 分支的 tidb-ansible,使用以下命令下载 binary:
$ wget http://download.pingcap.org/tidb-latest-linux-amd64.tar.gz
用 Ansible 滚动升级
滚动升级 PD 节点(只升级单独 PD 服务)
$ ansible-playbook rolling_update.yml --tags=pd
如果 PD 实例数大于等于 3,滚动升级 PD leader 实例时,Ansible 会先迁移 PD leader 到其他节点再关闭该实例。
滚动升级 TiKV 节点(只升级 TiKV 服务)
$ ansible-playbook rolling_update.yml --tags=tikv
滚动升级 TiKV 实例时,Ansible 会迁移 region leader 到其他节点。具体逻辑为:调用 PD API 添加 evict leader scheduler,每 10 秒探测一次该 TiKV 实例 leader_count, 等待 leader_count 降到 1 以下或探测超 18 次后,即三分钟超时后,开始关闭 TiKV 升级,启动成功后再去除 evict leader scheduler,串行操作。
如中途升级失败,请登录 pd-ctl 执行 scheduler show,查看是否有 evict-leader-scheduler, 如有需手工清除。{PD_IP} 和 {STORE_ID} 请替换为你的 PD IP 及 TiKV 实例的 store_id。
$ /home/tidb/tidb-ansible/resources/bin/pd-ctl -u "http://{PD_IP}:2379"
» scheduler show
[
"label-scheduler",
"evict-leader-scheduler-{STORE_ID}",
"balance-region-scheduler",
"balance-leader-scheduler",
"balance-hot-region-scheduler"
]
» scheduler remove evict-leader-scheduler-{STORE_ID}
滚动升级 TiDB 节点(只升级单独 TiDB 服务,如果 TiDB 集群开启了 binlog,升级 TiDB 服务时会升级 pump)
$ ansible-playbook rolling_update.yml --tags=tidb
滚动升级所有服务(依次升级 PD,TiKV,TiDB 服务,如果 TiDB 集群开启了 binlog,升级 TiDB 服务时会升级 pump)
$ ansible-playbook rolling_update.yml
滚动升级监控组件
$ ansible-playbook rolling_update_monitor.yml
滚动升级问题记录
确认 tidb-ansible/inventory.ini 文件中 tidb_version = v2.1.0,然后执行以下命令下载 TiDB 2.1 binary 到中控机,务必确保 tidb_version 已更改,要不然 TiDB 无法启动,滚动升级失败,将会停止提供业务服务,报错如下
ERROR MESSAGE SUMMARY **********************************************************
[TiDB]: Ansible FAILED! => playbook: rolling_update.yml; TASK: wait until the TiDB port is up; message: {"changed": false, "elapsed": 300, "msg": "the TiDB port 5000 is not up"}
tidb_stderr.log循环报错:flag provided but not defined: -advertise-address
Usage of bin/tidb-server:
-L string
log level: info, debug, warn, error, fatal (default "info")
-P string
tidb server port (default "4000")
-V print version information and exit (default false)
-binlog-socket string
socket file to write binlog
-config string
config file path
-host string
tidb server host (default "0.0.0.0")
-lease string
schema lease duration, very dangerous to change only if you know what you do (default "45s")
-log-file string
log file path
-log-slow-query string
slow query file path
-metrics-addr string
prometheus pushgateway address, leaves it empty will disable prometheus push.
-metrics-interval uint
prometheus client push interval in second, set "0" to disable prometheus push. (default 15)
-path string
tidb storage path (default "/tmp/tidb")
-proxy-protocol-header-timeout uint
proxy protocol header read timeout, unit is second. (default 5)
-proxy-protocol-networks string
proxy protocol networks allowed IP or *, empty mean disable proxy protocol support
-report-status
If enable status report HTTP service. (default true)
-run-ddl
run ddl worker on this tidb-server (default true)
-socket string
The socket file to use for connection.
-status string
tidb server status port (default "10080")
-store string
registered store name, [tikv, mocktikv] (default "mocktikv")
-token-limit int
the limit of concurrent executed sessions (default 1000)
flag provided but not defined: -advertise-address
Usage of bin/tidb-server:
-L string
log level: info, debug, warn, error, fatal (default "info")
-P string
tidb server port (default "4000")
-V print version information and exit (default false)
-binlog-socket string
socket file to write binlog
-config string
config file path
-host string
tidb server host (default "0.0.0.0")
-lease string
schema lease duration, very dangerous to change only if you know what you do (default "45s")
-log-file string
log file path
-log-slow-query string
slow query file path
-metrics-addr string
prometheus pushgateway address, leaves it empty will disable prometheus push.
-metrics-interval uint
prometheus client push interval in second, set "0" to disable prometheus push. (default 15)
-path string
tidb storage path (default "/tmp/tidb")
-proxy-protocol-header-timeout uint
proxy protocol header read timeout, unit is second. (default 5)
-proxy-protocol-networks string
proxy protocol networks allowed IP or *, empty mean disable proxy protocol support
-report-status
If enable status report HTTP service. (default true)
-run-ddl
run ddl worker on this tidb-server (default true)
-socket string
The socket file to use for connection.
-status string
tidb server status port (default "10080")
-store string
registered store name, [tikv, mocktikv] (default "mocktikv")
-token-limit int
the limit of concurrent executed sessions (default 1000)
更改 inventory.ini 文件 tidb_version 然后执行以下命令下载 TiDB 2.1 binary 到中控机,再去滚动升级 TiDB 集群组件,若报错如下,说明滚动升级已经失败,再次直接升级滚动升级 需要先开启原 TiDB 集群,要么手动升级每个组件
07ms
2019/02/19 11:42:24.198 pd.go:126: [error] updateTS error: rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader
2019/02/19 11:42:24.198 client.go:391: [error] [pd] getTS error: rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader
2019/02/19 11:42:24.201 client.go:212: [info] [pd] leader switches to: http://172.16.30.86:2479, previous: http://172.16.30.88:2479
2019/02/19 11:42:24.724 client.go:391: [error] [pd] getTS error: rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader
2019/02/19 11:42:25.091 client.go:391: [error] [pd] getTS error: rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader
2019/02/19 11:42:47.227 manager.go:287: [info] [stats] ownerManager 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c watch owner key /tidb/stats/owner/6cbc69036706a283 watcher is closed, no owner
2019/02/19 11:42:47.227 manager.go:287: [info] [ddl] ownerManager 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c watch owner key /tidb/ddl/fg/owner/6cbc69036706a279 watcher is closed, no owner
2019/02/19 11:42:47.227 manager.go:234: [warning] [ddl] /tidb/ddl/fg/owner ownerManager 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c isn't the owner
2019/02/19 11:42:47.227 manager.go:234: [warning] [stats] /tidb/stats/owner ownerManager 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c isn't the owner
2019/02/19 11:42:47.229 domain.go:350: [warning] [ddl] reload schema in loop, schema syncer need rewatch
2019/02/19 11:42:47.229 syncer.go:220: [info] [syncer] watch global schema finished
2019/02/19 11:42:47.229 manager.go:269: [info] [stats] /tidb/stats/owner ownerManager 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c, owner is 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c
2019/02/19 11:42:47.229 manager.go:269: [info] [ddl] /tidb/ddl/fg/owner ownerManager 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c, owner is 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c
2019/02/19 11:42:47.234 domain.go:676: [info] [domain] reload privilege success.
2019/02/19 11:42:47.234 domain.go:661: [error] [domain] load privilege loop watch channel closed.
2019/02/19 11:42:56.741 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:56.741 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:56.907 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:56.929 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:57.027 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:57.064 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:57.361 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:57.435 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:57.770 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:57.866 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:59.360 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:59.447 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:00.630 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:00.832 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:01.776 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:02.140 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:03.158 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:03.927 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:04.300 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:05.486 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:05.576 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:06.626 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:07.284 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:08.063 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:08.774 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:08.837 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:08.921 coprocessor.go:686: [info] [TIME_COP_PROCESS] resp_time:12.015128862s txn_start_ts:406466796380749826 region_id:25 store_addr:172.16.30.88:20191 backoff_ms:12162 backoff_types:[tikvRPC,regionMiss,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader]
2019/02/19 11:43:08.922 adapter.go:390: [warning] [SLOW_QUERY] cost_time:12.180696571s backoff_time:12.162s request_count:1 total_keys:1 succ:true con:0 user:<nil> txn_start_ts:406466796380749826 database: table_ids:[19],index_ids:[1],sql:SELECT version, table_id, modify_count, count from mysql.stats_meta where version > 0 order by version
注意:需要重启原 TiDB 集群,可能报错 TiDB 未启动,但是 PD、TiKV启动,但是当前 TiDB 组件未启动,可忽略,因为 TiDB 是无状态的,被动接受,不妨碍重新滚动升级,最后重新执行 ansible-playbook rolling_update.yml 即可
TASK [wait for TiDB up] ******************************************************************************************************************************************************************
fatal: [TiDB]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Shared connection to 172.16.30.86 closed./r/n", "unreachable": true}
to retry, use: --limit @/data/tidb/wentaojin/tidb-ansible-bak/retry_files/start.retry
PLAY RECAP *******************************************************************************************************************************************************************************
PD01 : ok=7 changed=1 unreachable=0 failed=0
PD02 : ok=7 changed=1 unreachable=0 failed=0
PD03 : ok=7 changed=1 unreachable=0 failed=0
TiDB : ok=6 changed=1 unreachable=1 failed=0
TiKV1-1 : ok=9 changed=1 unreachable=0 failed=0
TiKV1-2 : ok=9 changed=1 unreachable=0 failed=0
TiKV2-1 : ok=9 changed=1 unreachable=0 failed=0
TiKV2-2 : ok=9 changed=1 unreachable=0 failed=0
TiKV3-1 : ok=9 changed=1 unreachable=0 failed=0
TiKV3-2 : ok=9 changed=1 unreachable=0 failed=0
altmgr3086 : ok=7 changed=1 unreachable=0 failed=0
grafana3086 : ok=5 changed=0 unreachable=0 failed=0
localhost : ok=1 changed=0 unreachable=0 failed=0
nodeblack3086 : ok=9 changed=2 unreachable=0 failed=0
nodeblack3087 : ok=9 changed=2 unreachable=0 failed=0
nodeblack3088 : ok=9 changed=2 unreachable=0 failed=0
nodeblack3089 : ok=9 changed=2 unreachable=0 failed=0
prometheus3086 : ok=8 changed=1 unreachable=0 failed=0
ERROR MESSAGE SUMMARY ********************************************************************************************************************************************************************
[TiDB]: Ansible UNREACHABLE! => playbook: start.yml; TASK: wait for TiDB up; message: {"changed": false, "msg": "Failed to connect to the host via ssh: Shared connection to 172.16.30.86 closed./r/n", "unreachable": true}
Ask for help:
Contact us: support@pingcap.com
It seems that you encounter some problems. You can send an email to the above email address, attached with the tidb-ansible/inventory.ini and tidb-ansible/log/ansible.log files and the error message, or new issue on https://github.com/pingcap/tidb-ansible/issues. We'll try our best to help you deploy a TiDB cluster. Thanks. :-)
版权声明:本文为博主原创文章,未经博主允许不得转载。
- 上一篇:TiDB集群部署文档
- 下一篇:TiDB 集群扩缩容



