RAC集群因OCR磁盘无法访问导致节点宕机问题处理
昨个,处理完DB_LINK问题,又发现该集群2节点宕机了….今天上午起来又开始处理集群问题….哎…可怜的周末。。。
以下是处理过程:
发现 高可用服务是开启的,其他关键进程全是offline
grid@ecology2:/home/grid>crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
grid@ecology2:/home/grid>crsctl check has
CRS-4638: Oracle High Availability Services is online
grid@ecology2:/home/grid>crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
检查节点状态
grid@ecology2:/home/grid>crsctl status res -t -init
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE OFFLINE
ora.cluster_interconnect.haip
1 ONLINE OFFLINE
ora.crf
1 ONLINE ONLINE ecology2
ora.crsd
1 ONLINE OFFLINE
ora.cssd
1 ONLINE OFFLINE
ora.cssdmonitor
1 ONLINE ONLINE ecology2
ora.ctssd
1 ONLINE OFFLINE
ora.diskmon
1 OFFLINE OFFLINE
ora.drivers.acfs
1 ONLINE ONLINE ecology2
ora.evmd
1 ONLINE OFFLINE
ora.gipcd
1 ONLINE ONLINE ecology2
ora.gpnpd
1 ONLINE ONLINE ecology2
ora.mdnsd
1 ONLINE ONLINE ecology2
查看crsd.log
2018-03-22 20:14:44.127: [ CRSMAIN][1368024864] Checking the OCR device
2018-03-22 20:14:44.127: [ CRSMAIN][1368024864] Sync-up with OCR
2018-03-22 20:14:44.127: [ CRSMAIN][1368024864] Connecting to the CSS Daemon
2018-03-22 20:14:44.127: [ CRSMAIN][1368024864] Getting local node number
2018-03-22 20:14:44.127: [ CRSMAIN][1361573632] Policy Engine is not initialized yet!
2018-03-22 20:14:44.128: [ CRSMAIN][1368024864] Initializing OCR
[ CLWAL][1368024864]clsw_Initialize: OLR initlevel [70000]
2018-03-22 20:14:44.437: [ OCRASM][1368024864]proprasmo: Error in open/create file in dg [OCR]
[ OCRASM][1368024864]SLOS : SLOS: cat=8, opn=kgfoOpen01, dep=15056, loc=kgfokge
2018-03-22 20:14:44.437: [ OCRASM][1368024864]ASM Error Stack :
2018-03-22 20:14:44.470: [ OCRASM][1368024864]proprasmo: kgfoCheckMount returned [6]
2018-03-22 20:14:44.470: [ OCRASM][1368024864]proprasmo: The ASM disk group OCR is not found or not mounted
2018-03-22 20:14:44.471: [ OCRRAW][1368024864]proprioo: Failed to open [+OCR]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.
2018-03-22 20:14:44.471: [ OCRRAW][1368024864]proprioo: No OCR/OLR devices are usable
2018-03-22 20:14:44.471: [ OCRASM][1368024864]proprasmcl: asmhandle is NULL
2018-03-22 20:14:44.472: [ GIPC][1368024864] gipcCheckInitialization: possible incompatible non-threaded init from [prom.c : 690], original from [clsss.c : 5343]
2018-03-22 20:14:44.472: [ default][1368024864]clsvactversion:4: Retrieving Active Version from local storage.
2018-03-22 20:14:44.475: [ OCRRAW][1368024864]proprrepauto: The local OCR configuration matches with the configuration published by OCR Cache Writer. No repair required.
2018-03-22 20:14:44.476: [ OCRRAW][1368024864]proprinit: Could not open raw device
2018-03-22 20:14:44.476: [ OCRASM][1368024864]proprasmcl: asmhandle is NULL
2018-03-22 20:14:44.478: [ OCRAPI][1368024864]a_init:16!: Backend init unsuccessful : [26]
2018-03-22 20:14:44.479: [ CRSOCR][1368024864] OCR context init failure. Error: PROC-26: Error while accessing the physical storage
2018-03-22 20:14:44.479: [ CRSD][1368024864] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage
2018-03-22 20:14:44.479: [ CRSD][1368024864][PANIC] CRSD exiting: Could not init OCR, code: 26
2018-03-22 20:14:44.479: [ CRSD][1368024864] Done.
grid@ecology2:/home/g01/grid/products/11.2.0/log/ecology2/crsd>
日志显示无法访问磁盘组OCR的物理存储,没有OCR磁盘发现,导致ASM实例集群未能启动
查看ASM磁盘状态
ASMCMD> lsdg
State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name
MOUNTED NORMAL N 512 4096 1048576 1772388 1073246 295398 388924 0 N DATA/
MOUNTED NORMAL Y 512 4096 4194304 16408 15972 0 7986 1 Y OCR/
ASMCMD> exit
OCR磁盘组有块盘OFFLINE
--查看磁盘
SQL> col name for a30;
SQL> col path for a40;
SQL> set linesize 999
SQL> select name,path,header_status,mount_status,state from v$asm_disk;
NAME PATH HEADER_STATUS MOUNT_STATUS STATE
------------------------------ ---------------------------------------- ------------------------ -------------- ----------------
/dev/dm-7 MEMBER CLOSED NORMAL
/dev/dm-6 MEMBER CLOSED NORMAL
_DROPPED_0000_OCR UNKNOWN MISSING FORCING
DATA_0003 /dev/dm-15 MEMBER CACHED NORMAL
DATA_0004 /dev/dm-14 MEMBER CACHED NORMAL
DATA_0002 /dev/dm-13 MEMBER CACHED NORMAL
DATA_0000 /dev/dm-11 MEMBER CACHED NORMAL
DATA_0001 /dev/dm-10 MEMBER CACHED NORMAL
DATA_0005 /dev/dm-9 MEMBER CACHED NORMAL
OCR_0001 /dev/dm-8 MEMBER CACHED NORMAL
10 rows selected.
发现对应的OCR磁盘OCR_0000是丢失forcing状态,物理磁盘/dev/dm-6以及/dev/dm-7物理磁盘是关闭的
查看磁盘对应的磁盘组以及failgroup
col diskname for a30;
col failgroup for a30;
col state for a20;
col path for a20;
col diskgroup for a20;
set linesize 999;
SQL> select b.name as diskgroup, b.state as diskgroupstat,a.name as diskname,a.failgroup,b.type,a.path,a.header_status,a.mount_status,a.state from v$asm_disk a,v$asm_diskgroup b where a.group_number=b.group_number;
DISKGROUP DISKGROUPSTAT DISKNAME FAILGROUP TYPE PATH HEADER_STATUS MOUNT_STATUS STATE
-------------------- ---------------------- ------------------------------ ------------------------------ ------------ -------------------- ------------------------ -------------- --------------------
OCR MOUNTED _DROPPED_0000_OCR OCR_0000 NORMAL UNKNOWN MISSING FORCING
DATA MOUNTED DATA_0003 DATA_0003 NORMAL /dev/dm-15 MEMBER CACHED NORMAL
DATA MOUNTED DATA_0004 DATA_0004 NORMAL /dev/dm-14 MEMBER CACHED NORMAL
DATA MOUNTED DATA_0002 DATA_0002 NORMAL /dev/dm-13 MEMBER CACHED NORMAL
DATA MOUNTED DATA_0000 DATA_0000 NORMAL /dev/dm-11 MEMBER CACHED NORMAL
DATA MOUNTED DATA_0001 DATA_0001 NORMAL /dev/dm-10 MEMBER CACHED NORMAL
DATA MOUNTED DATA_0005 DATA_0005 NORMAL /dev/dm-9 MEMBER CACHED NORMAL
OCR MOUNTED OCR_0001 OCR_0001 NORMAL /dev/dm-8 MEMBER CACHED NORMAL
8 rows selected.
不过好在使用的是Normal冗余,要不然整个集群都可能挂....
使用normal正常冗余,在建diskgroup时可能没有指明failgroup,这里看到,每个diskgroup的failgroup就是他自己本身
还有个问题就是这里使用创建ASM磁盘的并非用的mapper下的设备,而是还是物理存储,难受,/dev/dm-*号是重启机器可能会变动的....
这就可能导致现在查看path物理盘已经查看不到了.....
重新加回去盘
SQL> alter diskgroup OCR add disk '/dev/dm-7';
alter diskgroup OCR add disk '/dev/dm-7'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15033: disk '/dev/dm-7' belongs to diskgroup "OCR"
SQL> alter diskgroup OCR add disk '/dev/dm-7' force;
alter diskgroup OCR add disk '/dev/dm-7' force
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 74649
Session ID: 1648 Serial number: 6713
完蛋....现在集群全挂...节点1跟节点2都不能访问...
怀疑不能强制添加,需要进行dd处理...清理磁盘头信息再添加
查看你一节点ASM日志
查看报错ora 600错误,MOS文章参考ORA-00600 : [kfdvfGetCurrent_baddsk] While adding failed disk into Voting diskgroup (文档 ID 2081484.1)
最后解决方案就是恢复或者重建OCR.
重建OCR放最后,先进行OCR恢复,再重建不行,最后卸载清理,备份恢复
--OCR 自动备份恢复
环境是11.2.0.4,因为有自动备份OCR功能
root@ee bin]# ./ocrcheck
Status of Oracle Cluster Registry is as follows :
Version : 3
Total space (kbytes) : 262120
Used space (kbytes) : 3112
Available space (kbytes) : 259008
ID : 257970053
Device/File Name : +CRSVOTEDISK
Device/File integrity check succeeded
Device/File not configured
Device/File not configured
Device/File not configured
Cluster registry integrity check succeeded
Logical corruption check succeeded
root@ee bin]# ./crsctl query css votedisk
查看资源并不能查询出,b报错不能连接css deamon
查看OCR备份
root@ee bin]#./ocrconfig -showbackup
已该名字结尾backup00.ocr(一般这个为最新的备份文件)
cd 到该文件所在目录下
恢复步骤:
1)停止所有节点clusterware
# crsctl stop crs
# crsctl stop crs -f
2)以root用户在其中一个节点独占模式启动clusterware
# crsctl start crs -excl -nocrs
备注:如果发现crsd在运行,那么通过如下命令将之停止。
$ crsctl status res -t -init --查看crs进程还启动了
# crsctl stop resource ora.crsd -init
3)创建新的存放ocr和vote disk的磁盘组,磁盘组名和原有的一致(如果想改变位置,需修改/etc/oracle/ocr.loc文件)
备注:如发现无法创建等情况,可以采用如下删除磁盘组等排错思路
SQL> drop diskgroup disk_group force including contents;
4)还原ocr,并检查
# ocrconfig -restore backup00.ocr
# ocrcheck
5)恢复表决磁盘,并检查
# crsctl replace votedisk +ocr ---该名字对应orc.loc文件中的磁盘组名 若有更改则对应更改
# crsctl query css votedisk
6)停止独占模式运行的clusterware
# crsctl stop has -f
7)所有节点正常启动clusterware
# crsctl start has
8)CVU验证所有RAC节点OCR的完整性
$ cluvfy comp ocr -n all -verbose
过一会之后,节点一节点正常启动,可提供服务,并查看日志$ORACLE_HOME/log/alertee1.log发现之前添加的dm7已经在线可提供服务,sqlplus / as syasm 用之前语句查询也是可以查看到dm-7盘,
但是节点二仍然并不能正常启动,这就涉及到了用dm创建ASM盘的弊端,重启机器后,dm盘符号会变动,导致对应的OCR无法进行读取,导致节点2剔除集群,接着着手排查节点2
节点2解决方法:
只能先在一节点的dm设备找出对应的mapper设备的wwid号,然后在二节点根据一节点的mapper设备wwid号找出对应的dm号,在这里你就会发现dm设备号已经改变了
一节点mapper对应的dm: 二节点mapper对应的dm:
mpathj (2001738003212004c) dm-6 IBM,2810XIV mpathg (2001738003212004c) dm-5 IBM,2810XIV
mpathh (2001738003212004d) dm-7 IBM,2810XIV mpathh (2001738003212004d) dm-6 IBM,2810XIV
mpathg (2001738003212004e) dm-8 IBM,2810XIV mpathi (2001738003212004e) dm-8 IBM,2810XIV
mpathq (2001738003212005d) dm-9 IBM,2810XIV mpathp (2001738003212005d) dm-15 IBM,2810XIV
mpathl (2001738003212005b) dm-10 IBM,2810XIV mpathm (2001738003212005b) dm-14 IBM,2810XIV
mpathk (2001738003212005a) dm-11 IBM,2810XIV mpathk (2001738003212005a) dm-9 IBM,2810XIV
mpathm (20017380032120059) dm-13 IBM,2810XIV mpathl (20017380032120059) dm-10 IBM,2810XIV
mpathp (2001738003212005e) dm-14 IBM,2810XIV mpathq (2001738003212005e) dm-16 IBM,2810XIV
mpatho (2001738003212005c) dm-15 IBM,2810XIV mpatho (2001738003212005c) dm-12 IBM,2810XIV
根据二节点的结果,修改undev配置文件如下:
KERNEL=="dm-9",OWNER:="grid", GROUP:="asmadmin", MODE:="660"
KERNEL=="dm-5",OWNER:="grid", GROUP:="asmadmin", MODE:="660"
KERNEL=="dm-6",OWNER:="grid", GROUP:="asmadmin", MODE:="660"
KERNEL=="dm-8",OWNER:="grid", GROUP:="asmadmin", MODE:="660"
KERNEL=="dm-10",OWNER:="grid", GROUP:="asmadmin", MODE:="660"
KERNEL=="dm-12",OWNER:="grid", GROUP:="asmadmin", MODE:="660"
KERNEL=="dm-14",OWNER:="grid", GROUP:="asmadmin", MODE:="660"
KERNEL=="dm-15",OWNER:="grid", GROUP:="asmadmin", MODE:="660"
KERNEL=="dm-16",OWNER:="grid", GROUP:="asmadmin", MODE:="660"
然后重启二节点的udev,命令如下:
/sbin/start_udev
然后再重启二节点集群服务,开启日志查看...
crsctl stop has -f
crsctl start has
最后二节点集群正常启动。。。
至此,所有集群节点正常可访问….问题处理结束
版权声明:本文为博主原创文章,未经博主允许不得转载。



