朋友的朋友的一个数据库,说节点2不能启动,让我帮忙看看。
这2天白天都忙,不在线,刚回家看了下。 发现这个节点2 已经宕了3个月了。连上节点1,看了下没啥问题,除了2个rac 节点时间不同步之外,其他环境没发现啥问题,因为还是生产库,所以还是先确认了几次才下手处理。
DB 版本:
SQL> select * from v$version; BANNER -------------------------------------------------------------------------------- Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production PL/SQL Release 11.2.0.3.0 - Production CORE 11.2.0.3.0 Production TNS for Linux: Version 11.2.0.3.0 - Production NLSRTL Version 11.2.0.3.0 - Production
在节点2用crsctl stop crs -f 之后,重新正常启动,日志如下:
2016-10-30 19:13:34.297 [cssd(16596)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log 2016-10-30 19:13:34.297 [cssd(16596)]CRS-1603:CSSD on node rac2 shutdown by user. 2016-10-30 19:13:39.685 [ohasd(15702)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'rac2'. 2016-10-30 19:13:43.032 [ohasd(15702)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE 2016-10-30 19:13:51.520 [cssd(16761)]CRS-1713:CSSD daemon is started in clustered mode 2016-10-30 19:13:57.298 [cssd(16761)]CRS-1707:Lease acquisition for node rac2 number 2 completed 2016-10-30 19:13:58.590 [cssd(16761)]CRS-1605:CSSD voting file is online: /dev/asmdisk1; details in /u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log.
然后就一直卡在这里,等了很长时间,一般11G RAC 启动是慢,但也不至于慢这么久。 所以看了下/u01/app/11.2.0/grid/log/rac2/cssd/ocssd.log的日志:
2016-10-30 19:21:22.201: [ CSSD][1407166208]clssnmLocalJoinEvent: takeover aborted due to cluster member node found on disk 2016-10-30 19:21:22.691: [ CSSD][1616029440]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0 2016-10-30 19:21:23.101: [ CSSD][1620760320]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 298902583, wrtcnt, 74191108, LATS 226154536, lastSeqNo 74191107, uniqueness 1452102906, timestamp 1477826482/4248117224 2016-10-30 19:21:23.201: [ CSSD][1612875520]clssnmSendingThread: sending join msg to all nodes 2016-10-30 19:21:23.201: [ CSSD][1612875520]clssnmSendingThread: sent 5 join msgs to all nodes 2016-10-30 19:21:23.691: [ CSSD][1616029440]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0 2016-10-30 19:21:24.102: [ CSSD][1620760320]clssnmvDHBValidateNCopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 298902583, wrtcnt, 74191109, LATS 226155546, lastSeqNo 74191108, uniqueness 1452102906, timestamp 1477826483/4248118224
错误信息非常明显:has a disk HB, but no network HB,有disk 心跳,没有网络心跳,导致crs 启动失败。 Ping 了下网络,发现没有问题。 网络心跳走HAIP,这个11g的新特性bug就没少过,上MOS搜了一下,找到2篇:
Grid Infrastructure Startup During Patching, Install or Upgrade May Fail Due to Multicasting Requirement (文档 ID 1212703.1)
GI Fails to Start as no Private Network Interface is Available (文档 ID 1481176.1)
If the communication can be established successfully, we will see a log entry on node2 containing "gipchaLowerProcessAcks: ESTABLISH finished" for the peer node (node1). If the communication cannot be established, we will not see this log entry. Instead, we will see an entry indicating that the network communication cannot be established. This entry will look similar to the one shown below:
2010-09-16 23:13:15.839: [ CSSD][1087465792]clssnmvDHBValidateNCopy: node 1, node1, has a disk HB, but no network HB, DHB has rcfg 180134562, wrtcnt, 8627, LATS 9564064, lastSeqNo 8624, uniqueness 1284701023, timestamp 1284703995/10564774
The above log entry indicates that CSSD is unable to establish network communication on the interface used for the private interconnect. In this particular case, the issue was that multicast communication on the 230.0.1.0 IP was blocked on the network used as the private interconnect.
根据以上说明,可能真遇到bug了。 因为负责管理集群cluster interconnect的gipcd.bin进程是个非核心进程,对与这些非核心进程可以直接kill掉,集群会自动重启。 现在本身也只有一个节点在运行,kill 该进程不会有负面影响。
在节点1的OS上直接kill 进程:
ps -ef|grep gipcd.bin
然后kill。
之后节点2的crs立马恢复正常启动.
最后确认一下:
[grid@www.cndba.cn ~]$ crsctl stat res -t -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Local Resources -------------------------------------------------------------------------------- ora.CRS.dg ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.DATA.dg ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.LISTENER.lsnr ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.asm ONLINE ONLINE rac1 Started ONLINE ONLINE rac2 Started ora.gsd OFFLINE OFFLINE rac1 OFFLINE OFFLINE rac2 ora.net1.network ONLINE ONLINE rac1 ONLINE ONLINE rac2 ora.ons ONLINE ONLINE rac1 ONLINE ONLINE rac2 -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.LISTENER_SCAN1.lsnr 1 ONLINE ONLINE rac2 ora.LISTENER_SCAN2.lsnr 1 ONLINE ONLINE rac1 ora.LISTENER_SCAN3.lsnr 1 ONLINE ONLINE rac1 ora.cvu 1 ONLINE ONLINE rac1 ora.oc4j 1 ONLINE ONLINE rac1 ora.rac1.vip 1 ONLINE ONLINE rac1 ora.rac2.vip 1 ONLINE ONLINE rac2 ora.racdb.db 1 ONLINE ONLINE rac1 Open 2 ONLINE ONLINE rac2 Open ora.scan1.vip 1 ONLINE ONLINE rac2 ora.scan2.vip 1 ONLINE ONLINE rac1 ora.scan3.vip 1 ONLINE ONLINE rac1 [grid@www.cndba.cn ~]$
成功启动。
版权声明:本文为博主原创文章,未经博主允许不得转载。