ORACLE 12C RAC CRS-1019 Bug以及补丁应用导致ora.gipcd进程错误解决

问题描述

2018-08-30 15：30：43，本来要给客户ORACLE 12C RAC系统修改时间慢问题，但是惊奇发现RAC两个节点CRS进程全部是挂的，但是其应用确实正常，并且服务器环境也能sqlplus / as sysdba 环境，所以直接怀疑撞bug可能性

系统环境：

Redhat 7.4

ORACLE 12C

排查过程

1、两节点CRS进程处于未启动状态

1节点：

grid@sy-oa-01:/home/grid>crsctl check cluster
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
grid@sy-oa-01:/home/grid>


2节点：
grid@sy-oa-02:/home/grid>crsctl check cluster
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

2、查看集群告警日志

1节点集群告警日志：（显示正常)
2018-08-30 11:46:11.017 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 11:56:36.089 [CRSD(15820)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 15820
2018-08-30 12:16:12.003 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 12:18:39.215 [CRSD(22447)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 22447
2018-08-30 12:40:42.335 [CRSD(29321)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 29321
2018-08-30 12:46:12.975 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 13:02:45.508 [CRSD(36506)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 36506
2018-08-30 13:16:13.954 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 13:24:48.728 [CRSD(43574)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 43574
2018-08-30 13:46:14.957 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 13:46:51.764 [CRSD(50568)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 50568
2018-08-30 14:08:54.838 [CRSD(57112)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 57112
2018-08-30 14:16:15.931 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 14:30:57.979 [CRSD(64492)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 64492
2018-08-30 14:46:16.921 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 14:53:01.095 [CRSD(73026)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 73026

2节点集群告警日志：(显示不正常，报错CRS-01019)
Incident details in: /g01/grid/diag/crs/sy-oa-02/crs/incident/incdir_57/crsd_i57.trc

2018-07-13 10:55:36.282 [CRSD(60745)]CRS-8505: Oracle Clusterware CRSD process with operating system process ID 60745 encountered internal error CRS-01019
2018-07-13 10:55:36.580 [CRSD(60863)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 60863
2018-07-13 10:55:38.779 [CRSD(60863)]CRS-1019: The OCR Service exited on host sy-oa-02. Details in /g01/grid/diag/crs/sy-oa-02/crs/trace/crsd.trc
2018-07-13T10:55:38.801751+08:00
Errors in file /g01/grid/diag/crs/sy-oa-02/crs/trace/crsd.trc  (incident=65):
CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /g01/grid/diag/crs/sy-oa-02/crs/incident/incdir_65/crsd_i65.trc

2018-07-13 10:55:38.818 [CRSD(60863)]CRS-8505: Oracle Clusterware CRSD process with operating system process ID 60863 encountered internal error CRS-01019
2018-07-13 10:55:39.109 [CRSD(60916)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 60916
2018-07-13 10:55:41.330 [CRSD(60916)]CRS-1019: The OCR Service exited on host sy-oa-02. Details in /g01/grid/diag/crs/sy-oa-02/crs/trace/crsd.trc
2018-07-13T10:55:41.350376+08:00
Errors in file /g01/grid/diag/crs/sy-oa-02/crs/trace/crsd.trc  (incident=73):
CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /g01/grid/diag/crs/sy-oa-02/crs/incident/incdir_73/crsd_i73.trc

2018-07-13 10:55:41.368 [CRSD(60916)]CRS-8505: Oracle Clusterware CRSD process with operating system process ID 60916 encountered internal error CRS-01019
2018-07-13 10:55:41.617 [OHASD(45271)]CRS-2771: Maximum restart attempts reached for resource 'ora.crsd'; will not restart.

打开二节点trc文件，/g01/grid/diag/crs/sy-oa-02/crs/trace/crsd.trc  

确认遭遇 
Bug 24396050 - crsd.bin failed several times with error CRS-1019 (文档 ID 24396050.8)

补丁过程以及排错

补丁过程

ORACLE 12C RAC自动打补丁与之前版本不同，具体方法以及错误解决参考我之前写的链接:

https://www.cndba.cn/Marvinn/article/2781

但是，打补丁过程并不是一帆风顺，也遇到RAC其中一个节点，打完补丁后，另外一个节点再去打补丁，无法启动CRS进程的问题

流程如下：
1、RAC节点2正常参照链接，一帆风顺（首先关监听，集群、数据库会自动启停）
2、启动监听，open pdb数据库
3、RAC节点1参照链接，打补丁（首先关监听）

但是打补丁集群自动启停直接报错

OPatchauto session is initiated at Thu Aug 30 22:25:59 2018

System initialization log file is /g01/app/grid/12.2.0/cfgtoollogs/opatchautodb/systemconfig2018-08-30_10-26-04PM.log.

Session log file is /g01/app/grid/12.2.0/cfgtoollogs/opatchauto/opatchauto2018-08-30_10-26-09PM.log
The id for this session is 3YXV

Executing OPatch prereq operations to verify patch applicability on home /g01/app/grid/12.2.0
Patch applicability verified successfully on home /g01/app/grid/12.2.0

Checking shared status of home.....

Bringing down CRS service on home /g01/app/grid/12.2.0
Prepatch operation log file location: /g01/grid/crsdata/sy-oa-01/crsconfig/crspatch_sy-oa-01_2018-08-30_10-26-32PM.log
Failed to bring down CRS service on home /g01/app/grid/12.2.0

Execution of [GIShutDownAction] patch action failed, check log for more details. Failures:
Patch Target : sy-oa-01->/g01/app/grid/12.2.0 Type[crs]
Details: [
---------------------------Patching Failed---------------------------------
Command execution failed during patching in home: /g01/app/grid/12.2.0, host: sy-oa-01.
Command failed:  /g01/app/grid/12.2.0/perl/bin/perl -I/g01/app/grid/12.2.0/perl/lib -I/g01/app/grid/12.2.0/OPatch/auto/dbtmp/bootstrap_sy-oa-01/patchwork/crs/install /g01/app/grid/12.2.0/OPatch/auto/dbtmp/bootstrap_sy-oa-01/patchwork/crs/install/rootcrs.pl -prepatch
Command failure output:
Using configuration parameter file: /g01/app/grid/12.2.0/OPatch/auto/dbtmp/bootstrap_sy-oa-01/patchwork/crs/install/crsconfig_params
The log of current session can be found at:
  /g01/grid/crsdata/sy-oa-01/crsconfig/crspatch_sy-oa-01_2018-08-30_10-26-32PM.log
2018/08/30 22:26:34 CLSRSC-378: Failed to get the configured node role for the local node
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
2018/08/30 22:30:43 CLSRSC-117: Failed to start Oracle Clusterware stack

After fixing the cause of failure Run opatchauto resume

]
OPATCHAUTO-68061: The orchestration engine failed.
OPATCHAUTO-68061: The orchestration engine failed with return code 1
OPATCHAUTO-68061: Check the log for more details.
OPatchAuto failed.

OPatchauto session completed at Thu Aug 30 22:30:45 2018
Time taken to complete the session 4 minutes, 46 seconds

 opatchauto failed with error code 42

报错code:
CLSRSC-378: Failed to get the configured node role for the local node

该进程主要作用：
以守护进程gipcd.bin的形式存在于集群中，主要的功能有。

1. 当集群启动时，发现集群的私网网卡。当然，基于之前文章的介绍，集群私网的信息是从gpnp profile中获得的。并对发现的私网接口进行检查。

2. 利用之前发现的私网网卡，发现集群中的其他节点，并和其他节点的私网网卡建立联系。

3. 如果集群配置了多块私网网卡，当某个节点的某一个/几个私网网卡出现问题时，离线有问题的私网，并通知其他节点

看起来跟网络配置有关，无法配置该节点角色，但出现这种情况是在RAC 2节点成功打完补丁后，出现的.....

排查及解决

尝试手动开启、停止
[root@sy-oa-01 bin]# ./crsctl start has
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.

[root@sy-oa-01 bin]# ./crsctl stop has -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'sy-oa-01'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'sy-oa-01'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'sy-oa-01'
CRS-2679: Attempting to clean 'ora.gipcd' on 'sy-oa-01'
CRS-2673: Attempting to stop 'ora.evmd' on 'sy-oa-01'
CRS-2680: Clean of 'ora.gipcd' on 'sy-oa-01' failed
CRS-2677: Stop of 'ora.mdnsd' on 'sy-oa-01' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'sy-oa-01' succeeded
CRS-2677: Stop of 'ora.evmd' on 'sy-oa-01' succeeded
CRS-2799: Failed to shut down resource 'ora.gipcd' on 'sy-oa-01'
CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'sy-oa-01' has failed
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.

[root@sy-oa-01 bin]# ps -ef|grep d.bin
root      3540     1  2 21:34 ?        00:00:08 /g01/app/grid/12.2.0/bin/ohasd.bin reboot
root      6369  3461  0 21:40 pts/0    00:00:00 grep --color=auto d.bin

查看集群告警日志，获取到信息，发现无法会gipcd进程初始化，打开报错的trc文件
trc:
018-08-30 21:19:15.716 :  OCRMSG:3791640320: prom_listen: Failed to listen at endpoint [20]
2018-08-30 21:19:15.716 :  OCRMSG:3791640320: GIPC error [20] msg [gipcretAddressInUse]
2018-08-30 21:19:15.716 :  OCRSRV:3791640320: th_listen: prom_listen failed retval= 24, addr= [(ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))]
2018-08-30 21:19:15.716 :  OCRSRV:4182315072: th_init: Local listener did not reach valid state
2018-08-30 21:19:15.716 :  OCRAPI:4182315072: a_init:18!: Thread init unsuccessful : [24]
2018-08-30 21:19:15.726 :  CRSRPT:2298464000: {0:0:2} Enabled
2018-08-30 21:19:15.727 :   CRSPE:2420074240: {0:0:2} Getting config role...
2018-08-30 21:19:15.731 :   CRSPE:2420074240: {0:0:2} ...done : 1
2018-08-30 21:19:15.731 :   CRSPE:2420074240: {0:0:2} Getting the target role...
2018-08-30 21:19:15.733 :   CRSPE:2420074240: {0:0:2} ...done: 1
2018-08-30 21:19:15.734 :   CRSPE:2420074240: {0:0:2} Server Attribute ACTIVE_CSS_ROLE=hub
2018-08-30 21:19:15.734 :   CRSPE:2420074240: {0:0:2} Server Attribute CONFIGURED_CSS_ROLE=hub
2018-08-30 21:19:15.734 :   CRSPE:2420074240: {0:0:2} Server Attribute SITE_GUID=sy-oa-01
2018-08-30 21:19:15.734 :   CRSPE:2420074240: {0:0:2} Server Attribute SITE_NAME=sy-oa-01
2018-08-30 21:19:15.734 :   CRSPE:2420074240: {0:0:2} Server Attribute SITE_QUARANTINED=0
2018-08-30 21:19:15.734 :   CRSPE:2420074240: {0:0:2} PE Role|State Update: old role [INVALID] new [INVALID]; old state [Not yet initialized] new [Enabling: waiting for role]
2018-08-30 21:19:15.735 :   CRSSE:2296362752: {0:0:2} SE module master election disabled
2018-08-30 21:19:15.735 :   CRSSE:2296362752: {0:0:2} Master Change Event; New Master Node ID:0 This Node's ID:0
2018-08-30 21:19:15.735 :   CRSSE:2296362752: {0:0:2} Node down monitor enabled
2018-08-30 21:19:15.735 :   CRSPE:2420074240: {0:0:2} PE Role|State Update: old role [INVALID] new [MASTER]; old state [Enabling: waiting for role] new [Configuring]

大概意思就是
2018-08-30 21:19:15.716 :  OCRMSG:3791640320: GIPC error [20] msg [gipcretAddressInUse]
2018-08-30 21:19:15.735 :   CRSPE:2420074240: {0:0:2} PE Role|State Update: old role [INVALID] new [MASTER]; old state [Enabling: waiting for role] new [Configuring]

无法进行gipcd进程清理初始化，之前的状态清理未成功，新的也没分配，处于位置状态，我们查看gipcd进程状态

$crsctl status res -t -init
状态：
      1        ONLINE  OFFLINE                               STABLE
ora.gipcd
      1        ONLINE  UNKNOWN      sy-oa-01                 STABLE

可以看到ora.gipcd进程处于UNKNOWN未知状态，验证了之前的trc文件内容（到这里，请记住我们是成功打补丁后才出现该现象问题的，有可能补丁后产生的影响）
参考MOS链接：
CRS is not starting after applying the latest RU in 12.2 (文档 ID 2373945.1)

解决方案：
# crsctl stop crs -f

Lock the Grid Home using below command:

# $GRID_HOME/crs/install/rootcrs.sh -lock

Start the CRS Home:
# crsctl start crs

 这里需要注意的是：因为我们还没打上补丁，但是现在我们集群是关闭的所以不能直接采用自动打补丁的方式，让它自动应用GI_HOME、ORACLE_HOME，需要手工应用，用户还是ROOT用户，并且上述解决方案中，只需要执行前两步骤即可

 流程如下：(其实这种方法是关闭集群，关闭数据库打补丁的方法)
# ./crsctl stop crs -f
# $GI_HOME/crs/install/rootcrs.sh -lock
# ./opatchauto.sh apply /psu/24396050/ -oh /g01/app/grid/12.2.0/
# ./opatchauto.sh apply /psu/24396050/ -oh /u01/oracle/12.2.0/

至此，补丁即可完整应用，最后一步，刷新SQL，参考12C 自动打补丁链接，最后一步（只需要在一个节点执行即可，记得PDB需要以UPGRADE升级模式打开，应用完在OPEN READ WRITE)
https://www.cndba.cn/Marvinn/article/2781

签到成功

CNDBA社区

ORACLE 12C RAC CRS-1019 Bug以及补丁应用导致ora.gipcd进程错误解决

Marvinn

AI QQ群