ORACLE 12C RAC CRS-1019 Bug以及补丁应用导致ora.gipcd进程错误解决
问题描述
2018-08-30 15:30:43,本来要给客户ORACLE 12C RAC系统修改时间慢问题,但是惊奇发现RAC两个节点CRS进程全部是挂的,但是其应用确实正常,并且服务器环境也能sqlplus / as sysdba 环境,所以直接怀疑撞bug可能性
系统环境:
Redhat 7.4
ORACLE 12C
排查过程
1、两节点CRS进程处于未启动状态
1节点:
grid@sy-oa-01:/home/grid>crsctl check cluster
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
grid@sy-oa-01:/home/grid>
2节点:
grid@sy-oa-02:/home/grid>crsctl check cluster
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
2、查看集群告警日志
1节点集群告警日志:(显示正常)
2018-08-30 11:46:11.017 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 11:56:36.089 [CRSD(15820)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 15820
2018-08-30 12:16:12.003 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 12:18:39.215 [CRSD(22447)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 22447
2018-08-30 12:40:42.335 [CRSD(29321)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 29321
2018-08-30 12:46:12.975 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 13:02:45.508 [CRSD(36506)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 36506
2018-08-30 13:16:13.954 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 13:24:48.728 [CRSD(43574)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 43574
2018-08-30 13:46:14.957 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 13:46:51.764 [CRSD(50568)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 50568
2018-08-30 14:08:54.838 [CRSD(57112)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 57112
2018-08-30 14:16:15.931 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 14:30:57.979 [CRSD(64492)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 64492
2018-08-30 14:46:16.921 [OCTSSD(4379)]CRS-2412: The Cluster Time Synchronization Service detects that the local time is significantly different from the mean cluster time. Details in /g01/grid/diag/crs/sy-oa-01/crs/trace/octssd.trc.
2018-08-30 14:53:01.095 [CRSD(73026)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 73026
2节点集群告警日志:(显示不正常,报错CRS-01019)
Incident details in: /g01/grid/diag/crs/sy-oa-02/crs/incident/incdir_57/crsd_i57.trc
2018-07-13 10:55:36.282 [CRSD(60745)]CRS-8505: Oracle Clusterware CRSD process with operating system process ID 60745 encountered internal error CRS-01019
2018-07-13 10:55:36.580 [CRSD(60863)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 60863
2018-07-13 10:55:38.779 [CRSD(60863)]CRS-1019: The OCR Service exited on host sy-oa-02. Details in /g01/grid/diag/crs/sy-oa-02/crs/trace/crsd.trc
2018-07-13T10:55:38.801751+08:00
Errors in file /g01/grid/diag/crs/sy-oa-02/crs/trace/crsd.trc (incident=65):
CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /g01/grid/diag/crs/sy-oa-02/crs/incident/incdir_65/crsd_i65.trc
2018-07-13 10:55:38.818 [CRSD(60863)]CRS-8505: Oracle Clusterware CRSD process with operating system process ID 60863 encountered internal error CRS-01019
2018-07-13 10:55:39.109 [CRSD(60916)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 60916
2018-07-13 10:55:41.330 [CRSD(60916)]CRS-1019: The OCR Service exited on host sy-oa-02. Details in /g01/grid/diag/crs/sy-oa-02/crs/trace/crsd.trc
2018-07-13T10:55:41.350376+08:00
Errors in file /g01/grid/diag/crs/sy-oa-02/crs/trace/crsd.trc (incident=73):
CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /g01/grid/diag/crs/sy-oa-02/crs/incident/incdir_73/crsd_i73.trc
2018-07-13 10:55:41.368 [CRSD(60916)]CRS-8505: Oracle Clusterware CRSD process with operating system process ID 60916 encountered internal error CRS-01019
2018-07-13 10:55:41.617 [OHASD(45271)]CRS-2771: Maximum restart attempts reached for resource 'ora.crsd'; will not restart.
打开二节点trc文件,/g01/grid/diag/crs/sy-oa-02/crs/trace/crsd.trc
确认遭遇
Bug 24396050 - crsd.bin failed several times with error CRS-1019 (文档 ID 24396050.8)
补丁过程以及排错
补丁过程
ORACLE 12C RAC自动打补丁与之前版本不同,具体方法以及错误解决参考我之前写的链接:
https://www.cndba.cn/Marvinn/article/2781
但是,打补丁过程并不是一帆风顺,也遇到RAC其中一个节点,打完补丁后,另外一个节点再去打补丁,无法启动CRS进程的问题
流程如下:
1、RAC节点2正常参照链接,一帆风顺(首先关监听,集群、数据库会自动启停)
2、启动监听,open pdb数据库
3、RAC节点1参照链接,打补丁(首先关监听)
但是打补丁集群自动启停直接报错
OPatchauto session is initiated at Thu Aug 30 22:25:59 2018
System initialization log file is /g01/app/grid/12.2.0/cfgtoollogs/opatchautodb/systemconfig2018-08-30_10-26-04PM.log.
Session log file is /g01/app/grid/12.2.0/cfgtoollogs/opatchauto/opatchauto2018-08-30_10-26-09PM.log
The id for this session is 3YXV
Executing OPatch prereq operations to verify patch applicability on home /g01/app/grid/12.2.0
Patch applicability verified successfully on home /g01/app/grid/12.2.0
Checking shared status of home.....
Bringing down CRS service on home /g01/app/grid/12.2.0
Prepatch operation log file location: /g01/grid/crsdata/sy-oa-01/crsconfig/crspatch_sy-oa-01_2018-08-30_10-26-32PM.log
Failed to bring down CRS service on home /g01/app/grid/12.2.0
Execution of [GIShutDownAction] patch action failed, check log for more details. Failures:
Patch Target : sy-oa-01->/g01/app/grid/12.2.0 Type[crs]
Details: [
---------------------------Patching Failed---------------------------------
Command execution failed during patching in home: /g01/app/grid/12.2.0, host: sy-oa-01.
Command failed: /g01/app/grid/12.2.0/perl/bin/perl -I/g01/app/grid/12.2.0/perl/lib -I/g01/app/grid/12.2.0/OPatch/auto/dbtmp/bootstrap_sy-oa-01/patchwork/crs/install /g01/app/grid/12.2.0/OPatch/auto/dbtmp/bootstrap_sy-oa-01/patchwork/crs/install/rootcrs.pl -prepatch
Command failure output:
Using configuration parameter file: /g01/app/grid/12.2.0/OPatch/auto/dbtmp/bootstrap_sy-oa-01/patchwork/crs/install/crsconfig_params
The log of current session can be found at:
/g01/grid/crsdata/sy-oa-01/crsconfig/crspatch_sy-oa-01_2018-08-30_10-26-32PM.log
2018/08/30 22:26:34 CLSRSC-378: Failed to get the configured node role for the local node
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-4124: Oracle High Availability Services startup failed.
CRS-4000: Command Start failed, or completed with errors.
2018/08/30 22:30:43 CLSRSC-117: Failed to start Oracle Clusterware stack
After fixing the cause of failure Run opatchauto resume
]
OPATCHAUTO-68061: The orchestration engine failed.
OPATCHAUTO-68061: The orchestration engine failed with return code 1
OPATCHAUTO-68061: Check the log for more details.
OPatchAuto failed.
OPatchauto session completed at Thu Aug 30 22:30:45 2018
Time taken to complete the session 4 minutes, 46 seconds
opatchauto failed with error code 42
报错code:
CLSRSC-378: Failed to get the configured node role for the local node
该进程主要作用:
以守护进程gipcd.bin的形式存在于集群中,主要的功能有。
1. 当集群启动时,发现集群的私网网卡。当然,基于之前文章的介绍,集群私网的信息是从gpnp profile中获得的。并对发现的私网接口进行检查。
2. 利用之前发现的私网网卡,发现集群中的其他节点,并和其他节点的私网网卡建立联系。
3. 如果集群配置了多块私网网卡,当某个节点的某一个/几个私网网卡出现问题时,离线有问题的私网,并通知其他节点
看起来跟网络配置有关,无法配置该节点角色,但出现这种情况是在RAC 2节点成功打完补丁后,出现的.....
排查及解决
尝试手动开启、停止
[root@sy-oa-01 bin]# ./crsctl start has
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
[root@sy-oa-01 bin]# ./crsctl stop has -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'sy-oa-01'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'sy-oa-01'
CRS-2673: Attempting to stop 'ora.gpnpd' on 'sy-oa-01'
CRS-2679: Attempting to clean 'ora.gipcd' on 'sy-oa-01'
CRS-2673: Attempting to stop 'ora.evmd' on 'sy-oa-01'
CRS-2680: Clean of 'ora.gipcd' on 'sy-oa-01' failed
CRS-2677: Stop of 'ora.mdnsd' on 'sy-oa-01' succeeded
CRS-2677: Stop of 'ora.gpnpd' on 'sy-oa-01' succeeded
CRS-2677: Stop of 'ora.evmd' on 'sy-oa-01' succeeded
CRS-2799: Failed to shut down resource 'ora.gipcd' on 'sy-oa-01'
CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'sy-oa-01' has failed
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.
[root@sy-oa-01 bin]# ps -ef|grep d.bin
root 3540 1 2 21:34 ? 00:00:08 /g01/app/grid/12.2.0/bin/ohasd.bin reboot
root 6369 3461 0 21:40 pts/0 00:00:00 grep --color=auto d.bin
查看集群告警日志,获取到信息,发现无法会gipcd进程初始化,打开报错的trc文件
trc:
018-08-30 21:19:15.716 : OCRMSG:3791640320: prom_listen: Failed to listen at endpoint [20]
2018-08-30 21:19:15.716 : OCRMSG:3791640320: GIPC error [20] msg [gipcretAddressInUse]
2018-08-30 21:19:15.716 : OCRSRV:3791640320: th_listen: prom_listen failed retval= 24, addr= [(ADDRESS=(PROTOCOL=ipc)(KEY=procr_local_conn_0_PROL))]
2018-08-30 21:19:15.716 : OCRSRV:4182315072: th_init: Local listener did not reach valid state
2018-08-30 21:19:15.716 : OCRAPI:4182315072: a_init:18!: Thread init unsuccessful : [24]
2018-08-30 21:19:15.726 : CRSRPT:2298464000: {0:0:2} Enabled
2018-08-30 21:19:15.727 : CRSPE:2420074240: {0:0:2} Getting config role...
2018-08-30 21:19:15.731 : CRSPE:2420074240: {0:0:2} ...done : 1
2018-08-30 21:19:15.731 : CRSPE:2420074240: {0:0:2} Getting the target role...
2018-08-30 21:19:15.733 : CRSPE:2420074240: {0:0:2} ...done: 1
2018-08-30 21:19:15.734 : CRSPE:2420074240: {0:0:2} Server Attribute ACTIVE_CSS_ROLE=hub
2018-08-30 21:19:15.734 : CRSPE:2420074240: {0:0:2} Server Attribute CONFIGURED_CSS_ROLE=hub
2018-08-30 21:19:15.734 : CRSPE:2420074240: {0:0:2} Server Attribute SITE_GUID=sy-oa-01
2018-08-30 21:19:15.734 : CRSPE:2420074240: {0:0:2} Server Attribute SITE_NAME=sy-oa-01
2018-08-30 21:19:15.734 : CRSPE:2420074240: {0:0:2} Server Attribute SITE_QUARANTINED=0
2018-08-30 21:19:15.734 : CRSPE:2420074240: {0:0:2} PE Role|State Update: old role [INVALID] new [INVALID]; old state [Not yet initialized] new [Enabling: waiting for role]
2018-08-30 21:19:15.735 : CRSSE:2296362752: {0:0:2} SE module master election disabled
2018-08-30 21:19:15.735 : CRSSE:2296362752: {0:0:2} Master Change Event; New Master Node ID:0 This Node's ID:0
2018-08-30 21:19:15.735 : CRSSE:2296362752: {0:0:2} Node down monitor enabled
2018-08-30 21:19:15.735 : CRSPE:2420074240: {0:0:2} PE Role|State Update: old role [INVALID] new [MASTER]; old state [Enabling: waiting for role] new [Configuring]
大概意思就是
2018-08-30 21:19:15.716 : OCRMSG:3791640320: GIPC error [20] msg [gipcretAddressInUse]
2018-08-30 21:19:15.735 : CRSPE:2420074240: {0:0:2} PE Role|State Update: old role [INVALID] new [MASTER]; old state [Enabling: waiting for role] new [Configuring]
无法进行gipcd进程清理初始化,之前的状态清理未成功,新的也没分配,处于位置状态,我们查看gipcd进程状态
$crsctl status res -t -init
状态:
1 ONLINE OFFLINE STABLE
ora.gipcd
1 ONLINE UNKNOWN sy-oa-01 STABLE
可以看到ora.gipcd进程处于UNKNOWN未知状态,验证了之前的trc文件内容(到这里,请记住我们是成功打补丁后才出现该现象问题的,有可能补丁后产生的影响)
参考MOS链接:
CRS is not starting after applying the latest RU in 12.2 (文档 ID 2373945.1)
解决方案:
# crsctl stop crs -f
Lock the Grid Home using below command:
# $GRID_HOME/crs/install/rootcrs.sh -lock
Start the CRS Home:
# crsctl start crs
这里需要注意的是:因为我们还没打上补丁,但是现在我们集群是关闭的所以不能直接采用自动打补丁的方式,让它自动应用GI_HOME、ORACLE_HOME,需要手工应用,用户还是ROOT用户,并且上述解决方案中,只需要执行前两步骤即可
流程如下:(其实这种方法是关闭集群,关闭数据库打补丁的方法)
# ./crsctl stop crs -f
# $GI_HOME/crs/install/rootcrs.sh -lock
# ./opatchauto.sh apply /psu/24396050/ -oh /g01/app/grid/12.2.0/
# ./opatchauto.sh apply /psu/24396050/ -oh /u01/oracle/12.2.0/
至此,补丁即可完整应用,最后一步,刷新SQL,参考12C 自动打补丁链接,最后一步(只需要在一个节点执行即可,记得PDB需要以UPGRADE升级模式打开,应用完在OPEN READ WRITE)
https://www.cndba.cn/Marvinn/article/2781
版权声明:本文为博主原创文章,未经博主允许不得转载。
- 上一篇:RAC节点不停机修改系统时间
- 下一篇:Mysql8.0基于GTID主从复制



