RAC 节点2 CRS不能启动故障处理

朋友的朋友的一个数据库，说节点2不能启动，让我帮忙看看。

这2天白天都忙，不在线，刚回家看了下。发现这个节点2 已经宕了3个月了。连上节点1，看了下没啥问题，除了2个rac 节点时间不同步之外，其他环境没发现啥问题，因为还是生产库，所以还是先确认了几次才下手处理。

DB 版本：

SQL> select * from v$version;
 
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production

在节点2用crsctl stop crs -f 之后，重新正常启动，日志如下：http://www.cndba.cn/dave/article/315

2016-10-30 19:13:34.297
[cssd(16596)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log
2016-10-30 19:13:34.297
[cssd(16596)]CRS-1603:CSSD on node rac2 shutdown by user.
2016-10-30 19:13:39.685
[ohasd(15702)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'rac2'.
2016-10-30 19:13:43.032
[ohasd(15702)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2016-10-30 19:13:51.520
[cssd(16761)]CRS-1713:CSSD daemon is started in clustered mode
2016-10-30 19:13:57.298
[cssd(16761)]CRS-1707:Lease acquisition for node rac2 number 2 completed
2016-10-30 19:13:58.590
[cssd(16761)]CRS-1605:CSSD voting file is online: /dev/asmdisk1; details in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log.

然后就一直卡在这里，等了很长时间，一般11G RAC 启动是慢，但也不至于慢这么久。所以看了下/u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log的日志：

2016-10-30 19:21:22.201: [    CSSD][1407166208]clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk
2016-10-30 19:21:22.691: [    CSSD][1616029440]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2016-10-30 19:21:23.101: [    CSSD][1620760320]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 298902583, wrtcnt, 74191108, LATS 226154536, lastSeqNo 74191107, uniqueness 1452102906, timestamp 1477826482/4248117224
2016-10-30 19:21:23.201: [    CSSD][1612875520]clssnmSendingThread: sending join msg to all nodes
2016-10-30 19:21:23.201: [    CSSD][1612875520]clssnmSendingThread: sent 5 join msgs to all nodes
2016-10-30 19:21:23.691: [    CSSD][1616029440]clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0
2016-10-30 19:21:24.102: [    CSSD][1620760320]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 298902583, wrtcnt, 74191109, LATS 226155546, lastSeqNo 74191108, uniqueness 1452102906, timestamp 1477826483/4248118224

http://www.cndba.cn/dave/article/315http://www.cndba.cn/dave/article/315

错误信息非常明显：has a disk HB, but no network HB，有disk 心跳，没有网络心跳，导致crs 启动失败。 Ping 了下网络，发现没有问题。网络心跳走HAIP，这个11g的新特性bug就没少过，上MOS搜了一下，找到2篇：

http://www.cndba.cn/dave/article/315

Grid Infrastructure Startup During Patching, Install or Upgrade May Fail Due to Multicasting Requirement (文档 ID 1212703.1)

GI Fails to Start as no Private Network Interface is Available (文档 ID 1481176.1)http://www.cndba.cn/dave/article/315

If the communication can be established successfully, we will see a log entry on node2 containing "gipchaLowerProcessAcks: ESTABLISH finished" for the peer node (node1). If the communication cannot be established, we will not see this log entry. Instead, we will see an entry indicating that the network communication cannot be established. This entry will look similar to the one shown below:

2010-09-16 23:13:15.839: [ CSSD][1087465792]clssnmvDHBValidateNCopy: node 1, node1, has a disk HB, but no network HB, DHB has rcfg 180134562, wrtcnt, 8627, LATS 9564064, lastSeqNo 8624, uniqueness 1284701023, timestamp 1284703995/10564774

The above log entry indicates that CSSD is unable to establish network communication on the interface used for the private interconnect. In this particular case, the issue was that multicast communication on the 230.0.1.0 IP was blocked on the network used as the private interconnect.

根据以上说明，可能真遇到bug了。因为负责管理集群cluster interconnect的gipcd.bin进程是个非核心进程，对与这些非核心进程可以直接kill掉，集群会自动重启。现在本身也只有一个节点在运行，kill 该进程不会有负面影响。http://www.cndba.cn/dave/article/315

在节点1的OS上直接kill 进程：

ps -ef|grep gipcd.bin

然后kill。http://www.cndba.cn/dave/article/315

之后节点2的crs立马恢复正常启动.

最后确认一下：

[grid@www.cndba.cn ~]$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.CRS.dg
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.DATA.dg
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.LISTENER.lsnr
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.asm
               ONLINE  ONLINE       rac1                     Started             
               ONLINE  ONLINE       rac2                     Started             
ora.gsd
               OFFLINE OFFLINE      rac1                                         
               OFFLINE OFFLINE      rac2                                         
ora.net1.network
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
ora.ons
               ONLINE  ONLINE       rac1                                         
               ONLINE  ONLINE       rac2                                         
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       rac2                                         
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       rac1                                         
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       rac1                                         
ora.cvu
      1        ONLINE  ONLINE       rac1                                         
ora.oc4j
      1        ONLINE  ONLINE       rac1                                         
ora.rac1.vip
      1        ONLINE  ONLINE       rac1                                         
ora.rac2.vip
      1        ONLINE  ONLINE       rac2                                         
ora.racdb.db
      1        ONLINE  ONLINE       rac1                     Open                
      2        ONLINE  ONLINE       rac2                     Open                
ora.scan1.vip
      1        ONLINE  ONLINE       rac2                                         
ora.scan2.vip
      1        ONLINE  ONLINE       rac1                                         
ora.scan3.vip
      1        ONLINE  ONLINE       rac1                                         
[grid@www.cndba.cn ~]$

成功启动。http://www.cndba.cn/dave/article/315http://www.cndba.cn/dave/article/315

签到成功

CNDBA社区

RAC 节点2 CRS不能启动故障处理

dave

QQ交流群

注册联系QQ