TiDB PD集群不可用异常恢复 -- 中国DBA社区

应用场景

pd 集群超半数不可用或者全部不可用，需要使用 pd-recover 重建 pd 集群。

pd-recover 恢复步骤

1、stop 三节点 pd 集群中的两个 pd 节点（或者全部节点），剩下的节点也无法提供服务

$ ansible-playbook stop.yml -l PD03
$ ansible-playbook stop.yml -l PD02

pd.log

存活pd.log 一直重复选举无法选出leader

2019/03/05 15:33:26.855 raft.go:857: [info] 152210ed6a66ca42 is starting a new election at term 17
2019/03/05 15:33:26.855 raft.go:684: [info] 152210ed6a66ca42 became pre-candidate at term 17
2019/03/05 15:33:26.855 raft.go:755: [info] 152210ed6a66ca42 received MsgPreVoteResp from 152210ed6a66ca42 at term 17
2019/03/05 15:33:26.855 raft.go:742: [info] 152210ed6a66ca42 [logterm: 17, index: 1082742] sent MsgPreVote request to f925a8ed690ecbc at term 17
2019/03/05 15:33:26.855 raft.go:742: [info] 152210ed6a66ca42 [logterm: 17, index: 1082742] sent MsgPreVote request to 4c461b0861cbd30b at term 17
2019/03/05 15:33:27.821 log.go:84: [warning] etcdserver: [timed out waiting for read index response]
2019/03/05 15:33:28.506 log.go:84: [warning] etcdserver: [read-only range request "key:/"/tidb-binlog/v1/pumps//" range_end:/"/tidb-binlog/v1/pumps0/" " took too long (5.000031308s) to execute]
2019/03/05 15:33:29.551 log.go:84: [warning] etcdserver: [read-only range request "key:/"/tidb/store/gcworker/saved_safe_point/" " took too long (4.410328359s) to execute]
2019/03/05 15:33:29.887 log.go:84: [warning] rafthttp: [health check for peer f925a8ed690ecbc could not connect: dial tcp 172.16.30.88:2480: connect: connection refused]
2019/03/05 15:33:29.889 log.go:84: [warning] rafthttp: [health check for peer 4c461b0861cbd30b could not connect: dial tcp 172.16.30.87:2480: connect: connection refused]
2019/03/05 15:33:30.355 raft.go:857: [info] 152210ed6a66ca42 is starting a new election at term 17
2019/03/05 15:33:30.355 raft.go:684: [info] 152210ed6a66ca42 became pre-candidate at term 17
2019/03/05 15:33:30.355 raft.go:755: [info] 152210ed6a66ca42 received MsgPreVoteResp from 152210ed6a66ca42 at term 17
2019/03/05 15:33:30.355 raft.go:742: [info] 152210ed6a66ca42 [logterm: 17, index: 1082742] sent MsgPreVote request to f925a8ed690ecbc at term 17
2019/03/05 15:33:30.355 raft.go:742: [info] 152210ed6a66ca42 [logterm: 17, index: 1082742] sent MsgPreVote request to 4c461b0861cbd30b at term 17
2019/03/05 15:33:33.855 raft.go:857: [info] 152210ed6a66ca42 is starting a new election at term 17
2019/03/05 15:33:33.855 raft.go:684: [info] 152210ed6a66ca42 became pre-candidate at term 17
2019/03/05 15:33:33.855 raft.go:755: [info] 152210ed6a66ca42 received MsgPreVoteResp from 152210ed6a66ca42 at term 17
2019/03/05 15:33:33.855 raft.go:742: [info] 152210ed6a66ca42 [logterm: 17, index: 1082742] sent MsgPreVote request to f925a8ed690ecbc at term 17
2019/03/05 15:33:33.855 raft.go:742: [info] 152210ed6a66ca42 [logterm: 17, index: 1082742] sent MsgPreVote request to 4c461b0861cbd30b at term 17
2019/03/05 15:33:34.887 log.go:84: [warning] rafthttp: [health check for peer f925a8ed690ecbc could not connect: dial tcp 172.16.30.88:2480: connect: connection refused]
2019/03/05 15:33:34.889 log.go:84: [warning] rafthttp: [health check for peer 4c461b0861cbd30b could not connect: dial tcp 172.16.30.87:2480: connect: connection refused]
2019/03/05 15:33:35.552 log.go:84: [warning] etcdserver: [read-only range request "key:/"/tidb/store/gcworker/saved_safe_point/" " took too long (410.522646ms) to execute]
2019/03/05 15:33:35.552 log.go:84: [warning] etcdserver: [read-only range request "key:/"/tidb/store/gcworker/saved_safe_point/" " took too long (2.55356525s) to execute]
2019/03/05 15:33:35.552 log.go:84: [warning] etcdserver: [read-only range request "key:/"/tidb/store/gcworker/saved_safe_point/" " took too long (2.416047557s) to execute]
2019/03/05 15:33:35.552 log.go:84: [warning] etcdserver: [read-only range request "key:/"/tidb/store/gcworker/saved_safe_point/" " took too long (2.416873303s) to execute]

tikv.log

tikv.log显示一致链接pd，但是显示pd deadline无法链接，一致重试

2019/03/05 15:38:41.741 ERRO util.rs:297: request failed: Grpc(RpcFailure(RpcStatus { status: DeadlineExceeded, details: Some("Deadline Exceeded") })), retry
2019/03/05 15:38:41.741 WARN util.rs:243: updating PD client, block the tokio core
2019/03/05 15:38:41.741 INFO util.rs:406: connecting to PD endpoint: "http://172.16.30.86:2479"
2019/03/05 15:38:41.742 INFO util.rs:406: connecting to PD endpoint: "http://172.16.30.88:2479"
2019/03/05 15:38:42.743 ERRO util.rs:297: request failed: Grpc(RpcFailure(RpcStatus { status: DeadlineExceeded, details: Some("Deadline Exceeded") })), retry
2019/03/05 15:38:42.743 WARN util.rs:243: updating PD client, block the tokio core
2019/03/05 15:38:42.743 INFO util.rs:406: connecting to PD endpoint: "http://172.16.30.86:2479"
2019/03/05 15:38:42.744 INFO util.rs:406: connecting to PD endpoint: "http://172.16.30.88:2479"
2019/03/05 15:38:42.831 ERRO util.rs:297: request failed: Grpc(RpcFailure(RpcStatus { status: DeadlineExceeded, details: Some("Deadline Exceeded") })), retry
2019/03/05 15:38:43.745 ERRO util.rs:297: request failed: Grpc(RpcFailure(RpcStatus { status: DeadlineExceeded, details: Some("Deadline Exceeded") })), retry
2019/03/05 15:38:43.745 WARN util.rs:243: updating PD client, block the tokio core
2019/03/05 15:38:43.745 INFO util.rs:406: connecting to PD endpoint: "http://172.16.30.86:2479"
2019/03/05 15:38:43.746 INFO util.rs:406: connecting to PD endpoint: "http://172.16.30.88:2479"
2019/03/05 15:38:44.747 ERRO util.rs:297: request failed: Grpc(RpcFailure(RpcStatus { status: DeadlineExceeded, details: Some("Deadline Exceeded") })), retry
2019/03/05 15:38:44.747 WARN util.rs:243: updating PD client, block the tokio core
2019/03/05 15:38:44.747 INFO util.rs:406: connecting to PD endpoint: "http://172.16.30.86:2479"
2019/03/05 15:38:44.748 INFO util.rs:406: connecting to PD endpoint: "http://172.16.30.88:2479"
2019/03/05 15:38:44.832 ERRO util.rs:297: request failed: Grpc(RpcFailure(RpcStatus { status: DeadlineExceeded, details: Some("Deadline Exceeded") })), retry
2019/03/05 15:38:45.749 ERRO util.rs:297: request failed: Grpc(RpcFailure(RpcStatus { status: DeadlineExceeded, details: Some("Deadline Exceeded") })), retry
2019/03/05 15:38:45.749 WARN util.rs:243: updating PD client, block the tokio core

pd-ctl命令行

$ pd-ctl -u http://172.16.30.86:2479 
可以看到当存在2个pd节点时，还可以提供服务，只存在一个节点的时候，无法提供服务报错[500] redirect failed

[tidb@ip-172-16-30-86 bin]$ ./pd-ctl -u http://172.16.30.86:2479
» health
[
  {
    "name": "pd3",
    "member_id": 1122058826700352700,
    "client_urls": [
      "http://172.16.30.88:2479"
    ],
    "health": true
  },
  {
    "name": "pd1",
    "member_id": 1522798235883063874,
    "client_urls": [
      "http://172.16.30.86:2479"
    ],
    "health": true
  },
  {
    "name": "pd5",
    "member_id": 5496110118066705163,
    "client_urls": [
      "http://172.16.30.87:2479"
    ],
    "health": false
  }
]

» member
{
  "header": {
    "cluster_id": 6659519785752570180
  },
  "members": [
    {
      "name": "pd3",
      "member_id": 1122058826700352700,
      "peer_urls": [
        "http://172.16.30.88:2480"
      ],
      "client_urls": [
        "http://172.16.30.88:2479"
      ]
    },
    {
      "name": "pd1",
      "member_id": 1522798235883063874,
      "peer_urls": [
        "http://172.16.30.86:2480"
      ],
      "client_urls": [
        "http://172.16.30.86:2479"
      ]
    },
    {
      "name": "pd5",
      "member_id": 5496110118066705163,
      "peer_urls": [
        "http://172.16.30.87:2480"
      ],
      "client_urls": [
        "http://172.16.30.87:2479"
      ]
    }
  ],
  "leader": {
    "name": "pd3",
    "member_id": 1122058826700352700,
    "peer_urls": [
      "http://172.16.30.88:2480"
    ],
    "client_urls": [
      "http://172.16.30.88:2479"
    ]
  },
  "etcd_leader": {
    "name": "pd3",
    "member_id": 1122058826700352700,
    "peer_urls": [
      "http://172.16.30.88:2480"
    ],
    "client_urls": [
      "http://172.16.30.88:2479"
    ]
  }
}

» health
[500] redirect failed

可以看到当存在2个pd节点时，还可以提供服务，只存在一个节点的时候，无法提供服务报错[500] redirect failed

2、从 pd 日志获取 cluster id

从当前集群中找到集群的 Cluster ID 和 Alloc ID。一般在 PD，TiKV 或 TiDB 的日志中都可以获取 Cluster ID。
已经分配过的 Alloc ID 可以从 PD 日志获得。另外也可以从 PD 的监控面板的 Metadata Information 监控项中获得。
在指定 alloc-id 时需指定一个比当前最大的 Alloc ID 更大的值。
如果没有途径获取 Alloc ID，可以根据集群中的 Region，Store 数预估一个较大的数，一般可取高几个数量级的数

vi pd.log 查找匹配cluster id
server.go:205: [info] init cluster id 6659519785752570180

3、从 pd 日志获取store id

vi pd.log 匹配查找allocates 

2019/02/19 09:37:58.915 server.go:201: [info] init cluster id 6659519785752570180
2019/02/19 09:37:58.919 tso.go:104: [info] sync and save timestamp: last 0001-01-01 00:00:00 +0000 UTC save 2019-02-19 09:38:01.918721307 +0800 CST m=+4.521776344
2019/02/19 09:37:58.919 leader.go:269: [info] PD cluster leader pd3 is ready to serve
2019/02/19 09:38:01.732 id.go:90: [info] idAllocator allocates a new id: 1000

4、删除当前选定恢复重建pd节点的 data-dir(即data.pd目录)

[tidb@ip-172-16-30-87 scripts]# pwd
/data/tidb/deploy_tidb/pd/
[tidb@ip-172-16-30-87 pd]# rm -rf data.pd/
[tidb@ip-172-16-30-87 pd]# ls
backup  conf        default.pd4  log        scripts
bin     default.pd  default.pd5  nohup.out  status

5、选择一个或多个 pd 节点启动或在新的机器上重新部署，这里选择原有 pd 集群中的一个节点重建pd

run_pd.sh 中指定对应的 
--initial-cluster="pd1=http://172.16.30.86:2480,pd2=http://172.16.30.87:2480,pd3=http://172.16.30.88:2480" 
并且去掉--join改行，两者不可同时存在，详细报错见下
再执行 start_pd.sh，日志显示 [info] init cluster id 6637221545670907823，使用了新的 cluster id

[tidb@ip-172-16-30-87 pd]# cd scripts/
[tidb@ip-172-16-30-87 scripts]# vi run_pd.sh 
#!/bin/bash
set -e
ulimit -n 1000000

# WARNING: This file was auto-generated. Do not edit!
#          All your edit might be overwritten!
DEPLOY_DIR=/data/tidb/deploy_tidb/pd

cd "${DEPLOY_DIR}" || exit 1

exec bin/pd-server /
    --name="pd2" /
    --client-urls="http://172.16.30.87:2479" /
    --advertise-client-urls="http://172.16.30.87:2479" /
    --peer-urls="http://172.16.30.87:2480" /
    --advertise-peer-urls="http://172.16.30.87:2480" /
    --data-dir="/data/tidb/deploy_tidb/pd/data.pd" /
    --initial-cluster="pd2=http://172.16.30.87:2480,pd1=http://172.16.30.86:2480,pd3=http://172.16.30.88:2480" /
    --join="http://172.16.30.87:2479" /
    --config=conf/pd.toml /
    --log-file="/data/tidb/deploy_tidb/pd/log/pd.log" 2>> "/data/tidb/deploy_tidb/pd/log/pd_stderr.log"


[tidb@ip-172-16-30-87 scripts]$ sh start_pd.sh 
ok: started!

报错:
[tidb@ip-172-16-30-87 log]$ tail -20f pd_stderr.log 
报错： -initial-cluster and -join can not be provided at the same time
time="2019-03-05T16:24:33+08:00" level=fatal msg="parse cmd flags error: -initial-cluster and -join can not be provided at the same time/ngithub.com/pingcap/pd/server.(*Config).validate/n/t/home/jenkins/workspace/build_pd_2.1/go/src/github.com/pingcap/pd/server/config.go:267/ngithub.com/pingcap/pd/server.(*Config).Adjust/n/t/home/jenkins/workspace/build_pd_2.1/go/src/github.com/pingcap/pd/server/config.go:340/ngithub.com/pingcap/pd/server.(*Config).Parse/n/t/home/jenkins/workspace/build_pd_2.1/go/src/github.com/pingcap/pd/server/config.go:261/nmain.main/n/t/home/jenkins/workspace/build_pd_2.1/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:40/nruntime.main/n/t/usr/local/go/src/runtime/proc.go:201/nruntime.goexit/n/t/usr/local/go/src/runtime/asm_amd64.s:1333/n"

再次编辑修改run_pd.sh，去掉—join=”http://172.16.30.88:2479“

[tidb@ip-172-16-30-87 scripts]# vi run_pd.sh 
#!/bin/bash
set -e
ulimit -n 1000000

# WARNING: This file was auto-generated. Do not edit!

# All your edit might be overwritten!

DEPLOY_DIR=/data/tidb/deploy_tidb/pd

cd "${DEPLOY_DIR}" || exit 1

exec bin/pd-server /
    --name="pd2" /
    --client-urls="http://172.16.30.87:2479" /
    --advertise-client-urls="http://172.16.30.87:2479" /
    --peer-urls="http://172.16.30.87:2480" /
    --advertise-peer-urls="http://172.16.30.87:2480" /
    --data-dir="/data/tidb/deploy_tidb/pd/data.pd" /
    --initial-cluster="pd2=http://172.16.30.87:2480,pd1=http://172.16.30.86:2480,pd3=http://172.16.30.88:2480" /
    --config=conf/pd.toml /
    --log-file="/data/tidb/deploy_tidb/pd/log/pd.log" 2>> "/data/tidb/deploy_tidb/pd/log/pd_stderr.log"

[tidb@ip-172-16-30-87 scripts]$ sh start_pd.sh 
ok: started!

查看run_pd进程以及pd-server进程是否启动
[tidb@ip-172-16-30-87 scripts]$ ps -ef|grep run_pd.sh 
tidb_le+  70017      1  0 17:21 ?        00:00:00 bin/supervise status/pd /data2/leifu/deploy_tidb/pd/scripts/run_pd.sh
tidb     124649      1  0 18:22 pts/0    00:00:00 bin/supervise status/pd /data/tidb/deploy_tidb/pd/scripts/run_pd.sh


[tidb@ip-172-16-30-87 scripts]$ ps -ef|grep pd-server

tidb     124650 124649  3 18:22 pts/0    00:00:00 bin/pd-server --name=pd2 --client-urls=http://172.16.30.87:2479 --advertise-client-urls=http://172.16.30.87:2479 --peer-urls=http://172.16.30.87:2480 --advertise-peer-urls=http://172.16.30.87:2480 --data-dir=/data/tidb/deploy_tidb/pd/data.pd --initial-cluster=pd2=http://172.16.30.87:2480 --config=conf/pd.toml --log-file=/data/tidb/deploy_tidb/pd/log/pd.log
tidb     125630 124060  0 18:22 pts/0    00:00:00 grep --color=auto pd-server

查看pd.log会显示新的cluster_id:

2019/03/05 18:27:01.087 server.go:205: [info] init cluster id 6664850091666764736


2019/03/05 18:27:01.080 raft.go:857: [info] 581353672014e5e6 is starting a new election at term 2
2019/03/05 18:27:01.080 raft.go:684: [info] 581353672014e5e6 became pre-candidate at term 2
2019/03/05 18:27:01.080 raft.go:755: [info] 581353672014e5e6 received MsgPreVoteResp from 581353672014e5e6 at term 2
2019/03/05 18:27:01.080 raft.go:669: [info] 581353672014e5e6 became candidate at term 3
2019/03/05 18:27:01.080 raft.go:755: [info] 581353672014e5e6 received MsgVoteResp from 581353672014e5e6 at term 3
2019/03/05 18:27:01.080 raft.go:712: [info] 581353672014e5e6 became leader at term 3
2019/03/05 18:27:01.080 node.go:306: [info] raft.node: 581353672014e5e6 elected leader 581353672014e5e6 at term 3
2019/03/05 18:27:01.081 server.go:166: [info] create etcd v3 client with endpoints [http://172.16.30.87:2479]
2019/03/05 18:27:01.081 log.go:88: [info] etcdserver: [published {Name:pd2 ClientURLs:[http://172.16.30.87:2479]} to cluster 5c45ad6fda8ad27c]
2019/03/05 18:27:01.081 log.go:88: [info] embed: [ready to serve client requests]
2019/03/05 18:27:01.083 log.go:86: [info] embed: [serving insecure client requests on 172.16.30.87:2479, this is strongly discouraged!]
2019/03/05 18:27:01.087 server.go:205: [info] init cluster id 6664850091666764736
2019/03/05 18:27:01.088 namespace_classifier.go:438: [info] load 0 namespacesInfo cost 1.21073ms
2019/03/05 18:27:01.089 leader.go:95: [warning] leader is still name:"pd2" member_id:6346508002280138214 peer_urls:"http://172.16.30.87:2480" client_urls:"http://172.16.30.87:2479" , delete and campaign again
2019/03/05 18:27:01.091 tso.go:104: [info] sync and save timestamp: last 2019-03-05 18:25:52.409239341 +0800 CST save 2019-03-05 18:27:04.091126462 +0800 CST m=+7.525554246 next 2019-03-05 18:27:01.091126462 +0800 CST m=+4.525554246
2019/03/05 18:27:01.091 leader.go:263: [info] cluster version is 0.0.0
2019/03/05 18:27:01.091 leader.go:264: [info] PD cluster leader pd2 is ready to serve

6、执行 pd-recover (tidb-ansible 主控机中)
store 的 alloc-id 需要取比原来已分配的 id 更大的值

[tidb@ip-172-16-30-86 tidb-ansible]$  ./resources/bin/pd-recover -endpoints http://172.16.30.87:2479 -alloc-id 1008 -cluster-id 6659519785752570180
recover success! please restart the PD cluster

[tidb@ip-172-16-30-86 tidb-ansible]$ ansible-playbook stop.yml -l PD02
[tidb@ip-172-16-30-86 tidb-ansible]$ ansible-playbook start.yml -l PD02

重新启动该 pd，会继续使用原有的 cluster id （6659519785752570180）,以下为重启时的pd.log

2019/03/05 18:32:16.628 server.go:205: [info] init cluster id 6659519785752570180

2019/03/05 18:32:16.623 raft.go:712: [info] 581353672014e5e6 became leader at term 4
2019/03/05 18:32:16.623 node.go:306: [info] raft.node: 581353672014e5e6 elected leader 581353672014e5e6 at term 4
2019/03/05 18:32:16.623 server.go:166: [info] create etcd v3 client with endpoints [http://172.16.30.87:2479]
2019/03/05 18:32:16.623 log.go:88: [info] etcdserver: [published {Name:pd2 ClientURLs:[http://172.16.30.87:2479]} to cluster 5c45ad6fda8ad27c]
2019/03/05 18:32:16.623 log.go:88: [info] embed: [ready to serve client requests]
2019/03/05 18:32:16.624 log.go:86: [info] embed: [serving insecure client requests on 172.16.30.87:2479, this is strongly discouraged!]
2019/03/05 18:32:16.628 server.go:205: [info] init cluster id 6659519785752570180
2019/03/05 18:32:16.629 namespace_classifier.go:438: [info] load 0 namespacesInfo cost 313.167µs
2019/03/05 18:32:16.631 cluster_info.go:71: [info] load 0 stores cost 231.421µs
2019/03/05 18:32:16.631 cluster_info.go:77: [info] load 0 regions cost 213.281µs
2019/03/05 18:32:16.631 namespace_classifier.go:438: [info] load 0 namespacesInfo cost 172.677µs
2019/03/05 18:32:16.631 coordinator.go:208: [info] coordinator: Start collect cluster information
2019/03/05 18:32:16.631 coordinator.go:211: [info] coordinator: Cluster information is prepared
2019/03/05 18:32:16.631 coordinator.go:220: [info] coordinator: Run scheduler
2019/03/05 18:32:16.632 coordinator.go:235: [info] create scheduler balance-region-scheduler

7、重启 tikv、tidb 集群

[tidb@ip-172-16-30-86 tidb-ansible]$ ansible-playbook start.yml --tags=tidb,tikv

8、将另外两个 pd 节点加入到 pd 集群中，分别在对应的 run_pd.sh 中的 —initial-cluster 修改为 —join=”http://172.16.30.87:2479“

第二个pd节点:

当前选定恢复重建pd节点的 data-dir(data.pd目录) ps:需要删除，后续pd会自动生成，否则会报error错误，具体报错在本文最后可查看
[tidb@ip-172-16-30-88 scripts]$ cd ..
[tidb@ip-172-16-30-88 pd]$ ls
backup  bin  conf  data.pd  log  scripts  status
[tidb@ip-172-16-30-88 pd]$ rm -rf data.pd/

编辑run_pd.sh
[tidb@ip-172-16-30-88 pd]$ cd scripts/
[tidb@ip-172-16-30-88 scripts]$ cat run_pd.sh 
#!/bin/bash
set -e
ulimit -n 1000000

# WARNING: This file was auto-generated. Do not edit!

# All your edit might be overwritten!

DEPLOY_DIR=/data/tidb/deploy_tidb/pd

cd "${DEPLOY_DIR}" || exit 1

exec bin/pd-server /
    --name="pd3" /
    --client-urls="http://172.16.30.88:2479" /
    --advertise-client-urls="http://172.16.30.88:2479" /
    --peer-urls="http://172.16.30.88:2480" /
    --advertise-peer-urls="http://172.16.30.88:2480" /
    --data-dir="/data/tidb/deploy_tidb/pd/data.pd" /
    --join=“http://172.16.30.87:2479”
    --config=conf/pd.toml /
    --log-file="/data/tidb/deploy_tidb/pd/log/pd.log" 2>> "/data/tidb/deploy_tidb/pd/log/pd_stderr.log"


[tidb@ip-172-16-30-88 scripts]$ sh stop_pd.sh
sync ... done!
ok: stopped!
[tidb@ip-172-16-30-88 scripts]$ sh start_pd.sh 
ok: started!

可以分别通过ps获取对应的pd-server以及run_pd进程是否启动
[tidb@ip-172-16-30-86 scripts]$ ps -ef|grep run_pd
[tidb@ip-172-16-30-86 scripts]$ ps -ef|grep pd-server

第一个pd节点:

当前选定恢复重建pd节点的 data-dir(data.pd目录) ps:需要删除，后续pd会自动生成,否则会报error错误，否则会报error错误，具体报错在本文最后可查看 
[tidb@ip-172-16-30-86 scripts]$ cd ..
[tidb@ip-172-16-30-86 pd]$ ls
backup  bin  conf  data.pd  log  scripts  status
[tidb@ip-172-16-30-86 pd]$ rm -rf data.pd/

编辑run_pd.sh
[tidb@ip-172-16-30-86 pd]$ cd scripts/
[tidb@ip-172-16-30-86 scripts]$ cat run_pd.sh 
#!/bin/bash
set -e
ulimit -n 1000000

# WARNING: This file was auto-generated. Do not edit!

# All your edit might be overwritten!

DEPLOY_DIR=/data/tidb/deploy_tidb/pd

cd "${DEPLOY_DIR}" || exit 1

exec bin/pd-server /
    --name="pd1" /
    --client-urls="http://172.16.30.86:2479" /
    --advertise-client-urls="http://172.16.30.86:2479" /
    --peer-urls="http://172.16.30.86:2480" /
    --advertise-peer-urls="http://172.16.30.86:2480" /
    --data-dir="/data/tidb/deploy_tidb/pd/data.pd" /
    --join="http://172.16.30.87:2479" /
    --config=conf/pd.toml /
    --log-file="/data/tidb/deploy_tidb/pd/log/pd.log" 2>> "/data/tidb/deploy_tidb/pd/log/pd_stderr.log"

[tidb@ip-172-16-30-86 scripts]$ sh stop_pd.sh 
sync ... done!
ok: stopped!
[tidb@ip-172-16-30-86 scripts]$ sh start_pd.sh 
ok: started!

可以分别通过ps获取对应的pd-server以及run_pd进程是否启动
[tidb@ip-172-16-30-86 scripts]$ ps -ef|grep run_pd
[tidb@ip-172-16-30-86 scripts]$ ps -ef|grep pd-server

9、检查 member 信息

[tidb@ip-172-16-30-86 tidb-ansible]$ ./resources/bin/pd-ctl -u "http://172.16.30.86:2479" -d member
{
  "header": {
    "cluster_id": 6659519785752570180
  },
  "members": [
    {
      "name": "pd1",
      "member_id": 2413783433345583533,
      "peer_urls": [
        "http://172.16.30.86:2480"
      ],
      "client_urls": [
        "http://172.16.30.86:2479"
      ]
    },
    {
      "name": "pd2",
      "member_id": 6346508002280138214,
      "peer_urls": [
        "http://172.16.30.87:2480"
      ],
      "client_urls": [
        "http://172.16.30.87:2479"
      ]
    }
  ],
  "leader": {
    "name": "pd2",
    "member_id": 6346508002280138214,
    "peer_urls": [
      "http://172.16.30.87:2480"
    ],
    "client_urls": [
      "http://172.16.30.87:2479"
    ]
  },
  "etcd_leader": {
    "name": "pd2",
    "member_id": 6346508002280138214,
    "peer_urls": [
      "http://172.16.30.87:2480"
    ],
    "client_urls": [
      "http://172.16.30.87:2479"
    ]
  }
}

10、报错信息（data.pd导致pd.log报错)

pd集群是利用etcd做服务注册以及发现，因为pd集群重建Recover，单独启动过单一etcd(pd节点)作为集群leader，集群内第一次启动其他etcd (pd节点）服务时候，是通过发现服务引导的，所以需要删除旧的成员信息（删除data.pd目录），否则要是所有节点没有删除data.pd目录，则pd.log一致循环报错

2019/03/05 16:42:43.599 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:43.599 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:43.699 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:43.699 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:43.799 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:43.799 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:43.899 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:43.899 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:43.999 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:43.999 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:44.099 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:44.099 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:44.199 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:44.199 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:44.299 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:44.299 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:44.399 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:44.399 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:44.499 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]
2019/03/05 16:42:44.499 log.go:82: [error] rafthttp: [request cluster ID mismatch (got e62d88fc2a22d675 want 5c45ad6fda8ad27c)]

签到成功

CNDBA社区