签到成功

知道了

CNDBA社区CNDBA社区

TiDB 集群2.0升级2.1

2019-02-19 18:28 4941 2 原创 TiDB
作者: Marvinn

TiDB 2.0升级2.1

中控机器上安装 Ansible 及其依赖

TiDB-Ansible release-2.1 版本依赖 2.4.2 及以上但不高于 2.7.0 的 Ansible 版本(ansible>=2.4.2,<2.7.0),另依赖 Python 模块:jinja2>=2.9.6jmespath>=0.9.0。为方便管理依赖,新版本使用 pip 安装 Ansible 及其依赖,可参照在中控机器上安装 Ansible 及其依赖 进行安装。离线环境参照在中控机器上离线安装 Ansible 及其依赖

安装完成后,可通过以下命令查看版本:

$ ansible --version
ansible 2.6.8
$ pip show jinja2
Name: Jinja2
Version: 2.10
$ pip show jmespath
Name: jmespath
Version: 0.9.3

注意:请务必按以上文档安装 Ansible 及其依赖。确认 Jinja2 版本是否正确,否则启动 Grafana 时会报错。确认 jmespath 版本是否正确,否则滚动升级 TiKV 时会报错。

在中控机器上下载 TiDB-Ansible

tidb 用户登录中控机并进入 /home/tidb 目录,备份 TiDB 2.0 版本或 TiDB 2.1 rc 版本的 tidb-ansible 文件夹:

$ mv tidb-ansible tidb-ansible-bak

下载最新 tidb-ansible release-2.1 分支,默认的文件夹名称为 tidb-ansible

$ git clone -b release-2.1 https://github.com/pingcap/tidb-ansible.git

编辑 inventory.ini 文件和配置文件

tidb 用户登录中控机并进入 /home/tidb/tidb-ansible 目录。

编辑 inventory.ini 文件或者直接拷贝原有 inventory.ini文件

编辑 inventory.ini 文件,IP 信息参照备份文件 /home/tidb/tidb-ansible-bak/inventory.ini

http://www.cndba.cn/Marvinn/article/3272

以下变量配置,需要重点确认,变量含义可参考 inventory.ini 变量调整

  1. 请确认 ansible_user 配置的是普通用户。为统一权限管理,不再支持使用 root 用户远程安装。默认配置中使用 tidb 用户作为 SSH 远程用户及程序运行用户。

    ## Connection
    # ssh via normal user
    ansible_user = tidb
    

    可参考如何配置 ssh 互信及 sudo 规则 自动配置主机间互信。

  2. process_supervision 变量请与之前版本保持一致,默认推荐使用 systemd

    # process supervision, [systemd, supervise]
    process_supervision = systemd
    

    如需变更,可参考 如何调整进程监管方式从 supervise 到 systemd,先使用备份 /home/tidb/tidb-ansible-bak/ 分支变更进程监管方式再升级。http://www.cndba.cn/Marvinn/article/3272

编辑 TiDB 集群组件配置文件或者直接拷贝原有 修改过的 conf 下的参数文件

如之前自定义过 TiDB 集群组件配置文件,请参照备份文件修改 /home/tidb/tidb-ansible/conf 下对应配置文件。http://www.cndba.cn/Marvinn/article/3272

TiKV 配置文件 tikv.yml 中 end-point-concurrency 变更为 high-concurrencynormal-concurrencylow-concurrency三个参数:http://www.cndba.cn/Marvinn/article/3272http://www.cndba.cn/Marvinn/article/3272

http://www.cndba.cn/Marvinn/article/3272

readpool:
  coprocessor:
    # Notice: if CPU_NUM > 8, default thread pool size for coprocessors
    # will be set to CPU_NUM * 0.8.
    # high-concurrency: 8
    # normal-concurrency: 8
    # low-concurrency: 8

单机多 TiKV 实例情况下,需要修改这三个参数,推荐设置:实例数 * 参数值 = CPU 核数 * 0.8http://www.cndba.cn/Marvinn/article/3272http://www.cndba.cn/Marvinn/article/3272

下载 TiDB 2.1 binary 到中控机

确认 tidb-ansible/inventory.ini 文件中 tidb_version = v2.1.0,然后执行以下命令下载 TiDB 2.1 binary 到中控机。

$ ansible-playbook local_prepare.yml

滚动升级 TiDB 集群组件

滚动升级所有集群组件
$ ansible-playbook rolling_update.yml

滚动指定升级某一部分,pd、tikv、tidb 等等,若不知道 tags 是多少可以指 more 查看 rolling_update.yml 找到 tags 该 key 对应的值即可

$ ansible-playbook rolling_update.yml --tags=pd

验证是否升级到2.1版本,检查如下可以看到已升级到 TiDB-v2.1.4 : Server version: 5.7.10-TiDB-v2.1.4

[tidb@ip-172-16-30-86 tidb-ansible]$ mysql -uroot -p -h172.16.30.86 -P5000
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or /g.
Your MySQL connection id is 40
Server version: 5.7.10-TiDB-v2.1.4 MySQL Community Server (Apache License 2.0)

Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '/h' for help. Type '/c' to clear the current input statement.

mysql>

滚动升级 TiDB 监控组件

滚动升级所有集群组件
$ ansible-playbook rolling_update_monitor.yml

滚动指定升级某一部分,prometheus、grafana 等等,若不知道 tags 是多少可以指 more 查看 rolling_update_monitor.yml 找到 tags 该 key 对应的值即可

$ ansible-playbook rolling_update_monitor.yml --tags=pd

手动下载 binary,然后使用Ansible滚动升级

上面TiDB升级方式直接通过Ansible中控机升级下载二进制文件到downloads,再ansible方式滚动升级,这种为手动下载,官方推荐Ansible统一升级

wget http://download.pingcap.org/tidb-{版本号}-linux-amd64.tar.gz   
除 “自动下载 binary” 中描述的方法之外,也可以手动下载 binary,解压后手动替换 binary 到 ${deploy_dir}/resource/bin/,需注意替换链接中的版本号。

$ wget http://download.pingcap.org/tidb-v2.0.7-linux-amd64.tar.gz
如果使用 master 分支的 tidb-ansible,使用以下命令下载 binary:
$ wget http://download.pingcap.org/tidb-latest-linux-amd64.tar.gz
用 Ansible 滚动升级
滚动升级 PD 节点(只升级单独 PD 服务)
$ ansible-playbook rolling_update.yml --tags=pd
如果 PD 实例数大于等于 3,滚动升级 PD leader 实例时,Ansible 会先迁移 PD leader 到其他节点再关闭该实例。

滚动升级 TiKV 节点(只升级 TiKV 服务)
$ ansible-playbook rolling_update.yml --tags=tikv
滚动升级 TiKV 实例时,Ansible 会迁移 region leader 到其他节点。具体逻辑为:调用 PD API 添加 evict leader scheduler,每 10 秒探测一次该 TiKV 实例 leader_count, 等待 leader_count 降到 1 以下或探测超 18 次后,即三分钟超时后,开始关闭 TiKV 升级,启动成功后再去除 evict leader scheduler,串行操作。
如中途升级失败,请登录 pd-ctl 执行 scheduler show,查看是否有 evict-leader-scheduler, 如有需手工清除。{PD_IP} 和 {STORE_ID} 请替换为你的 PD IP 及 TiKV 实例的 store_id。
$ /home/tidb/tidb-ansible/resources/bin/pd-ctl -u "http://{PD_IP}:2379"
» scheduler show
[
  "label-scheduler",
  "evict-leader-scheduler-{STORE_ID}",
  "balance-region-scheduler",
  "balance-leader-scheduler",
  "balance-hot-region-scheduler"
]
» scheduler remove evict-leader-scheduler-{STORE_ID}

滚动升级 TiDB 节点(只升级单独 TiDB 服务,如果 TiDB 集群开启了 binlog,升级 TiDB 服务时会升级 pump)
$ ansible-playbook rolling_update.yml --tags=tidb

滚动升级所有服务(依次升级 PD,TiKV,TiDB 服务,如果 TiDB 集群开启了 binlog,升级 TiDB 服务时会升级 pump)
$ ansible-playbook rolling_update.yml

滚动升级监控组件
$ ansible-playbook rolling_update_monitor.yml

滚动升级问题记录

确认 tidb-ansible/inventory.ini 文件中 tidb_version = v2.1.0,然后执行以下命令下载 TiDB 2.1 binary 到中控机,务必确保 tidb_version 已更改,要不然 TiDB 无法启动,滚动升级失败,将会停止提供业务服务,报错如下

ERROR MESSAGE SUMMARY **********************************************************
[TiDB]: Ansible FAILED! => playbook: rolling_update.yml; TASK: wait until the TiDB port is up; message: {"changed": false, "elapsed": 300, "msg": "the TiDB port 5000 is not up"}

tidb_stderr.log循环报错:flag provided but not defined: -advertise-address

Usage of bin/tidb-server:
  -L string
        log level: info, debug, warn, error, fatal (default "info")
  -P string
        tidb server port (default "4000")
  -V    print version information and exit (default false)
  -binlog-socket string
        socket file to write binlog
  -config string
        config file path
  -host string
        tidb server host (default "0.0.0.0")
  -lease string
        schema lease duration, very dangerous to change only if you know what you do (default "45s")
  -log-file string
        log file path
  -log-slow-query string
        slow query file path
  -metrics-addr string
        prometheus pushgateway address, leaves it empty will disable prometheus push.
  -metrics-interval uint
        prometheus client push interval in second, set "0" to disable prometheus push. (default 15)
  -path string
        tidb storage path (default "/tmp/tidb")
  -proxy-protocol-header-timeout uint
        proxy protocol header read timeout, unit is second. (default 5)
  -proxy-protocol-networks string
        proxy protocol networks allowed IP or *, empty mean disable proxy protocol support
  -report-status
        If enable status report HTTP service. (default true)
  -run-ddl
        run ddl worker on this tidb-server (default true)
  -socket string
        The socket file to use for connection.
  -status string
        tidb server status port (default "10080")
  -store string
        registered store name, [tikv, mocktikv] (default "mocktikv")
  -token-limit int
        the limit of concurrent executed sessions (default 1000)
flag provided but not defined: -advertise-address
Usage of bin/tidb-server:
  -L string
        log level: info, debug, warn, error, fatal (default "info")
  -P string
        tidb server port (default "4000")
  -V    print version information and exit (default false)
  -binlog-socket string
        socket file to write binlog
  -config string
        config file path
  -host string
        tidb server host (default "0.0.0.0")
  -lease string
        schema lease duration, very dangerous to change only if you know what you do (default "45s")
  -log-file string
        log file path
  -log-slow-query string
        slow query file path
  -metrics-addr string
        prometheus pushgateway address, leaves it empty will disable prometheus push.
  -metrics-interval uint
        prometheus client push interval in second, set "0" to disable prometheus push. (default 15)
  -path string
        tidb storage path (default "/tmp/tidb")
  -proxy-protocol-header-timeout uint
        proxy protocol header read timeout, unit is second. (default 5)
  -proxy-protocol-networks string
        proxy protocol networks allowed IP or *, empty mean disable proxy protocol support
  -report-status
        If enable status report HTTP service. (default true)
  -run-ddl
        run ddl worker on this tidb-server (default true)
  -socket string
        The socket file to use for connection.
  -status string
        tidb server status port (default "10080")
  -store string
        registered store name, [tikv, mocktikv] (default "mocktikv")
  -token-limit int
        the limit of concurrent executed sessions (default 1000)

更改 inventory.ini 文件 tidb_version 然后执行以下命令下载 TiDB 2.1 binary 到中控机,再去滚动升级 TiDB 集群组件,若报错如下,说明滚动升级已经失败,再次直接升级滚动升级 需要先开启原 TiDB 集群,要么手动升级每个组件

http://www.cndba.cn/Marvinn/article/3272

07ms
2019/02/19 11:42:24.198 pd.go:126: [error] updateTS error: rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader
2019/02/19 11:42:24.198 client.go:391: [error] [pd] getTS error: rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader
2019/02/19 11:42:24.201 client.go:212: [info] [pd] leader switches to: http://172.16.30.86:2479, previous: http://172.16.30.88:2479
2019/02/19 11:42:24.724 client.go:391: [error] [pd] getTS error: rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader
2019/02/19 11:42:25.091 client.go:391: [error] [pd] getTS error: rpc error: code = Unknown desc = rpc error: code = Unavailable desc = not leader
2019/02/19 11:42:47.227 manager.go:287: [info] [stats] ownerManager 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c watch owner key /tidb/stats/owner/6cbc69036706a283 watcher is closed, no owner
2019/02/19 11:42:47.227 manager.go:287: [info] [ddl] ownerManager 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c watch owner key /tidb/ddl/fg/owner/6cbc69036706a279 watcher is closed, no owner
2019/02/19 11:42:47.227 manager.go:234: [warning] [ddl] /tidb/ddl/fg/owner ownerManager 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c isn't the owner
2019/02/19 11:42:47.227 manager.go:234: [warning] [stats] /tidb/stats/owner ownerManager 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c isn't the owner
2019/02/19 11:42:47.229 domain.go:350: [warning] [ddl] reload schema in loop, schema syncer need rewatch
2019/02/19 11:42:47.229 syncer.go:220: [info] [syncer] watch global schema finished
2019/02/19 11:42:47.229 manager.go:269: [info] [stats] /tidb/stats/owner ownerManager 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c, owner is 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c
2019/02/19 11:42:47.229 manager.go:269: [info] [ddl] /tidb/ddl/fg/owner ownerManager 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c, owner is 22c7f9a9-9947-4ca1-ae0f-b0d255311a4c
2019/02/19 11:42:47.234 domain.go:676: [info] [domain] reload privilege success.
2019/02/19 11:42:47.234 domain.go:661: [error] [domain] load privilege loop watch channel closed.
2019/02/19 11:42:56.741 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:56.741 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:56.907 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:56.929 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:57.027 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:57.064 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:57.361 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:57.435 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:57.770 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:57.866 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:59.360 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:42:59.447 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:00.630 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:00.832 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:01.776 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:02.140 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:03.158 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:03.927 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:04.300 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:05.486 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:05.576 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:06.626 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:07.284 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:08.063 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:08.774 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:08.837 region_cache.go:470: [info] drop regions of store 1 from cache due to request fail, err: rpc error: code = Unavailable desc = grpc: the connection is unavailable
2019/02/19 11:43:08.921 coprocessor.go:686: [info] [TIME_COP_PROCESS] resp_time:12.015128862s txn_start_ts:406466796380749826 region_id:25 store_addr:172.16.30.88:20191 backoff_ms:12162 backoff_types:[tikvRPC,regionMiss,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader,tikvRPC,updateLeader]
2019/02/19 11:43:08.922 adapter.go:390: [warning] [SLOW_QUERY] cost_time:12.180696571s backoff_time:12.162s request_count:1 total_keys:1 succ:true con:0 user:<nil> txn_start_ts:406466796380749826 database: table_ids:[19],index_ids:[1],sql:SELECT version, table_id, modify_count, count from mysql.stats_meta where version > 0 order by version

注意:需要重启原 TiDB 集群,可能报错 TiDB 未启动,但是 PD、TiKV启动,但是当前 TiDB 组件未启动,可忽略,因为 TiDB 是无状态的,被动接受,不妨碍重新滚动升级,最后重新执行 ansible-playbook rolling_update.yml 即可http://www.cndba.cn/Marvinn/article/3272

TASK [wait for TiDB up] ******************************************************************************************************************************************************************
fatal: [TiDB]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Shared connection to 172.16.30.86 closed./r/n", "unreachable": true}
        to retry, use: --limit @/data/tidb/wentaojin/tidb-ansible-bak/retry_files/start.retry

PLAY RECAP *******************************************************************************************************************************************************************************
PD01                       : ok=7    changed=1    unreachable=0    failed=0   
PD02                       : ok=7    changed=1    unreachable=0    failed=0   
PD03                       : ok=7    changed=1    unreachable=0    failed=0   
TiDB                       : ok=6    changed=1    unreachable=1    failed=0   
TiKV1-1                    : ok=9    changed=1    unreachable=0    failed=0   
TiKV1-2                    : ok=9    changed=1    unreachable=0    failed=0   
TiKV2-1                    : ok=9    changed=1    unreachable=0    failed=0   
TiKV2-2                    : ok=9    changed=1    unreachable=0    failed=0   
TiKV3-1                    : ok=9    changed=1    unreachable=0    failed=0   
TiKV3-2                    : ok=9    changed=1    unreachable=0    failed=0   
altmgr3086                 : ok=7    changed=1    unreachable=0    failed=0   
grafana3086                : ok=5    changed=0    unreachable=0    failed=0   
localhost                  : ok=1    changed=0    unreachable=0    failed=0   
nodeblack3086              : ok=9    changed=2    unreachable=0    failed=0   
nodeblack3087              : ok=9    changed=2    unreachable=0    failed=0   
nodeblack3088              : ok=9    changed=2    unreachable=0    failed=0   
nodeblack3089              : ok=9    changed=2    unreachable=0    failed=0   
prometheus3086             : ok=8    changed=1    unreachable=0    failed=0   

ERROR MESSAGE SUMMARY ********************************************************************************************************************************************************************
[TiDB]: Ansible UNREACHABLE! => playbook: start.yml; TASK: wait for TiDB up; message: {"changed": false, "msg": "Failed to connect to the host via ssh: Shared connection to 172.16.30.86 closed./r/n", "unreachable": true}
Ask for help:
Contact us: support@pingcap.com
It seems that you encounter some problems. You can send an email to the above email address, attached with the tidb-ansible/inventory.ini and tidb-ansible/log/ansible.log files and the error message, or new issue on https://github.com/pingcap/tidb-ansible/issues. We'll try our best to help you deploy a TiDB cluster. Thanks. :-)

版权声明:本文为博主原创文章,未经博主允许不得转载。

用户评论
* 以下用户言论只代表其个人观点,不代表CNDBA社区的观点或立场
Marvinn

Marvinn

关注

路漫漫其修远兮、吾将上下而求索

  • 99
    原创
  • 0
    翻译
  • 2
    转载
  • 36
    评论
  • 访问:458389次
  • 积分:449
  • 等级:中级会员
  • 排名:第12名
精华文章
    最新问题
    查看更多+
    热门文章
      热门用户
      推荐用户
        Copyright © 2016 All Rights Reserved. Powered by CNDBA · 皖ICP备2022006297号-1·

        QQ交流群

        注册联系QQ