CEPH 出现HEALTH_WARN clock skew detected的解决办法 -- cnDBA.cn

HEALTH_WARN clock skew detected 意思是各个节点之间的时间不同步。

1.查看集群状态

[root@ceph-osd1 ~]# ceph status
    cluster 21ed0f42-69d2-450c-babf-b1a44c1b82e4
     health HEALTH_ERR
        clock skew detected on mon.ceph-osd2, mon.ceph-osd3  --可以看到这里有问题了。时间不同步
            64 pgs are stuck inactive for more than 300 seconds
            64 pgs stuck inactive
            too few PGs per OSD (7 < min 30)
            Monitor clock skew detected 
     monmap e2: 3 mons at {ceph-osd1=192.168.1.141:6789/0,ceph-osd2=192.168.1.142:6789/0,ceph-osd3=192.168.1.143:6789/0}
            election epoch 12, quorum 0,1,2 ceph-osd1,ceph-osd2,ceph-osd3
     osdmap e60: 9 osds: 9 up, 9 in
            flags sortbitwise
      pgmap v238: 64 pgs, 1 pools, 0 bytes data, 0 objects
            300 MB used, 359 GB / 359 GB avail
                  64 creating

#查看详细的日志

[root@ceph-osd1 ~]# ceph health detail
....
too few PGs per OSD (7 < min 30)
mon.ceph-osd2 addr 192.168.1.142:6789/0 clock skew 2.56682s > max 0.05s (latency 0.0020987s)
mon.ceph-osd3 addr 192.168.1.143:6789/0 clock skew 2.56706s > max 0.05s (latency 0.00193141s)

1.2查看当前系统设定的值

[root@ceph-osd1 ~]# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show  | grep clock
    "mon_clock_drift_allowed": "0.05",---当 mon 时间偏移 0.05 秒则不正常
    "mon_clock_drift_warn_backoff": "5",---当出现 5 次偏移, 则报警
    "clock_offset": "0",--mon 节点的时间偏移默认值

2.方法

一个简单的解决办法就是：--但是不推荐这种方法

2.1停掉所有节点的ntpd服务，如果有的话

# systemctl stop ntpd

2.2同步国际时间

# ntpdate time.nist.gov

2.3 配置ntp服务

这里我把NTP server放在了ceph-admin节点上，其余三个ceph-1/2/3节点都是NTP client，目的就是从根本上解决时间同步问题。(暂时没搞多server的)

在ceph-admin节点上：

修改/etc/ntp.conf,注释掉默认的四个server，添加三行配置如下：

vim  /etc/ntp.conf
###comment following lines:
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
###add following lines:
server 127.127.1.0 minpoll 4
fudge 127.127.1.0 stratum 0
restrict 192.168.56.0 mask 255.255.0.0 nomodify notrap #这一行需要根据client的IP范围设置。

修改/etc/ntp/step-tickers文件如下：

# List of NTP servers used by the ntpdate service.
# 0.centos.pool.ntp.org
192.168.1.131

重启ntp服务，并查看server端是否运行正常，正常的标准就是ntpq -p指令的最下面一行是*:

[root@ceph-admin ~]# systemctl enable ntpd
Created symlink from /etc/systemd/system/multi-user.target.wants/ntpd.service to /usr/lib/systemd/system/ntpd.service.
[root@ceph-admin ~]# systemctl restart ntpd 
[root@ceph-admin ~]# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*LOCAL(0)        .LOCL.           0 l    -   16    1    0.000    0.000   0.000

至此，NTP server端已经配置完毕，下面开始配置client端。

在ceph-1/ceph-2/ceph-3三个节点上:

修改/etc/ntp.conf，注释掉四行server，添加一行server指向ceph-admin:

vim /etc/ntp.conf
#server 0.centos.pool.ntp.org iburst
#server 1.centos.pool.ntp.org iburst
#server 2.centos.pool.ntp.org iburst
#server 3.centos.pool.ntp.org iburst
server 192.168.1.131

重启ntp服务并观察client是否正确连接到server端，同样正确连接的标准是ntpq -p的最下面一行以*号开头:

[root@ceph-1 ~]# systemctl enable ntpd
Created symlink from /etc/systemd/system/multi-user.target.wants/ntpd.service to /usr/lib/systemd/system/ntpd.service.
[root@ceph-1 ~]# systemctl restart ntpd
[root@ceph-1 ~]# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*ceph-admin          .LOCL.           1 u    1   64    1    0.329    0.023   0.000

这个过程不会持续太久，实际生产最久5min内也会达到*状态，下图给了一个未能正确连接的输出:

[root@ceph-1 ~]# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 ceph-admin  .INIT.          16 u    -   64    0    0.000    0.000   0.000

3.重启mon

[root@ceph-osd2 ~]# systemctl restart ceph-mon@ceph-osd3

--@主机名

3.1再次查看集群状态，正常了

[root@ceph-osd1 ~]# ceph -w
    cluster 21ed0f42-69d2-450c-babf-b1a44c1b82e4
     health HEALTH_ERR
            64 pgs are stuck inactive for more than 300 seconds
            64 pgs stuck inactive
            too few PGs per OSD (7 < min 30)
     monmap e2: 3 mons at {ceph-osd1=192.168.1.141:6789/0,ceph-osd2=192.168.1.142:6789/0,ceph-osd3=192.168.1.143:6789/0}
            election epoch 16, quorum 0,1,2 ceph-osd1,ceph-osd2,ceph-osd3
     osdmap e60: 9 osds: 9 up, 9 in
            flags sortbitwise
      pgmap v238: 64 pgs, 1 pools, 0 bytes data, 0 objects
            300 MB used, 359 GB / 359 GB avail
                  64 creating
2016-11-08 21:12:42.759296 mon.0 [INF] osdmap e60: 9 osds: 9 up, 9 in

ceph 时间不同步

签到成功

CNDBA社区

CEPH 出现HEALTH_WARN clock skew detected的解决办法

Expect-乐

QQ交流群

注册联系QQ