Mirrored Disk Failure in Normal Redundancy Mode. (A Failure Story Part 1)

 IRON MAN  WAS DOWN

I’ve moved my blog from https://insanedba.blogspot.com to https://dincosman.com Please update your bookmarks and follow/subscribe at the new address for all the latest updates and content. More up-to-date content of this post may be available there.

Lately, in our DR (Disaster Recovery) site, we have experienced two mirrored disk failures in normal redundancy mode which ended up with recreating of dataguard databases.  I will try to explain our problem in detail.

    Our databases were down. CRS state was offline. ASM was down. Iron Man was down.

Mirrored Disk Failure in Normal Redundancy Mode


We started diagnosing the issue with manually starting up the ASM instance on one node. We have 3 disk groups. 2 of them got mounted, but one disk group (+DATA) could not get mounted. This disk group (+DATA) was holding ocrconfig and serving as voting disk.  This is the command and output. 

SYS@+ASM1> startup
ASM instance started
Total System Global Area 3213349952 bytes
Fixed Size 8901696 bytes
Variable Size 3170893824 bytes
ASM Cache 33554432 bytes
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15066: offlining disk "73" in group "DATA" may result in a data loss
ORA-15042: ASM disk "133" is missing from group number "1"
ORA-15042: ASM disk "73" is missing from group number "1"

We checked all ASM instances alert.logs to clarify the chronological order of events. Let's examine the findings.

    On 07:17:22, on exacel11 tried to offline disk 8. This was the first faulty disk.

...
2020-08-02T07:17:22.818522+03:00
NOTE: process _user321942_+asm1 (321942) initiating offline of disk 50.3167206305 (RECO_CD_08_EXACEL11) with mask 0x7e in group 3 (RECO) without client assisting
...

    On 07:17:25, after offlining the first faulty disk, Exadata disk worker process (XDWK) tried to access  partner disks of these ASM disks. All subsequent IOs to faulty ASM disks will be directed to their partners.

...
Starting background process XDWK
2020-08-02T07:17:25.851671+03:00
XDWK started with pid=39, OS id=166367
2020-08-02T07:17:25.869309+03:00
WARNING: I/Os to unhealthy ASM disk (DATA_CD_08_EXACEL11) in group DATA 1/0x975748AF will be diverted to its online partner disks
WARNING: I/Os to unhealthy ASM disk (RECO_CD_08_EXACEL11) in group RECO 3/0x976748B1 will be diverted to its online partner disks
WARNING: I/Os to unhealthy ASM disk (MORE_CD_08_EXACEL11) in group MORE 2/0x976748B0 will be diverted to its online partner disks
2020-08-02T07:17:25.952679+03:00
...

    On 07:17:32, Exadata Auto disk management feature decided to start rebalancing immediately (I mean no delay for disk_repair_time wait), as the corresponding physical disk was in a failed state and that was not a proactive failure detection, so waiting would be useless. For that reason, ASM disks had been dropped. That is the normal procedure. If the underlying physical disk status was in proactive/predictive failure, ASM would have been waiting for the disk_repair_time period before dropping the ASM disks. I will also write another post about that situation. More information is available in Metalink Docs. (Doc Id 1484274.1 Auto disk management feature in Exadata, Doc Id 1452325.1 Determining when Disks should be replaced on Oracle Exadata Database Machine, Doc Id 1390836.1 How to Replace a Hard Drive in an Exadata Storage Cell Server (Predictive Failure))

...
2020-08-02T07:17:32.505170+03:00
SQL> /* Exadata Auto Mgmt: Proactive DROP ASM Disk */
alter diskgroup DATA drop
disk DATA_CD_08_EXACEL11 force
...

    On 07:18:58, After dropping the first faulty disk, at the rebalancing phase, when reading from the partner disk, Murphy's law was on stage :

Anything That Can Go Wrong Will Go Wrong.


...
2020-08-02T07:18:58.745833+03:00
WARNING: Read Failed. group:1 disk:73 AU:381 offset:0 size:1048576
path:o/192.168.10.18/DATA_CD_06_exacel10
incarnation:0xbcc7ba7b asynchronous result:'I/O error'
subsys:OSS krq:0x7f303e6e09c0 bufp:0x7f303d0d1000 osderr1:0xc9 osderr2:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 36233 usec Time waited on I/O: 0 usec
WARNING: Read Failed. group:1 disk:73 AU:413 offset:2097152 size:1048576
path:o/192.168.10.18/DATA_CD_06_exacel10
incarnation:0xbcc7ba7b asynchronous result:'I/O error'
subsys:OSS krq:0x7f303e6de9d0 bufp:0x7f303c8d1000 osderr1:0xc9 osderr2:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 37220 usec Time waited on I/O: 0 usec
NOTE: Suppressing further IO Read errors on group:1 disk:73
WARNING: Read Failed. group:1 disk:73 AU:396 offset:1048576 size:1048576
path:o/192.168.10.18/DATA_CD_06_exacel10
incarnation:0xbcc7ba7b asynchronous result:'I/O error'
subsys:OSS krq:0x7f303e6d6080 bufp:0x7f303dde1000 osderr1:0xc9 osderr2:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 39234 usec Time waited on I/O: 0 usec
NOTE: All mirrors failed, try recover mode for gn 1 fn 269 extent 1981
...
    Unfortunately, the mirrored disk (exacel10 disk 6) went offline and the fault was detected.  This ended with the dismounting of the +DATA disk group. This disk group was storing the "Oracle Cluster Registry" and "Voting" files, eventually, CRS went down.

ERROR: disk 73 () in group 1 (DATA) cannot be offlined because all disks [73(), 133()] with mirrored data would be offline.
 
...
NOTE: process _arb0_+asm1 (167050) initiating offline of disk 73.3167206011 (DATA_CD_06_EXACEL10) with mask 0x7e in group 1 (DATA) with client assisting
NOTE: initiating PST update: grp 1 (DATA), dsk = 73/0xbcc7ba7b, mask = 0x6a, op = clear mandatory
...
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_167050.trc:
ORA-15130: diskgroup "" is being dismounted
ORA-15066: offlining disk "DATA_CD_06_EXACEL10" in group "DATA" may result in a data loss
Dirty detach reconfiguration started (new ddet inc 1, cluster inc 100)
...
    We changed the second faulty physical disk status from "UNCONFIGURED(BAD)"  to "Unconfigured-Good". 

[root@exacel10 ~]# /opt/MegaRAID/storcli/storcli64 -Pdlist -aAll | grep "Slot\|Firmware"
Slot Number: 0
Firmware state: Online, Spun Up
Device Firmware Level: A8C0
Slot Number: 1
Firmware state: Online, Spun Up
Device Firmware Level: A8C0
Slot Number: 2
Firmware state: Online, Spun Up
Device Firmware Level: 0B25
Slot Number: 3
Firmware state: Online, Spun Up
Device Firmware Level: 0B25
Slot Number: 4
Firmware state: Online, Spun Up
Device Firmware Level: 0B25
Slot Number: 5
Firmware state: Online, Spun Up
Device Firmware Level: 0B25
Slot Number: 6
FIRMWARE STATE: UNCONFIGURED(BAD) ------
Device Firmware Level: 0B25
Slot Number: 7
Firmware state: Online, Spun Up
Device Firmware Level: 0B25
Slot Number: 8
Firmware state: Online, Spun Up
Device Firmware Level: 0B25
Slot Number: 9
Firmware state: Online, Spun Up
Device Firmware Level: 0B25
Slot Number: 10
Firmware state: Online, Spun Up
Device Firmware Level: 0B25
Slot Number: 11
Firmware state: Online, Spun Up
Device Firmware Level: 0B25
[root@exacel10 ~]# /opt/MegaRAID/storcli/storcli64 -PDMakeGood -PhysDrv[20:6] -a0
Adapter: 0: EnclId-20 SlotId-6 state changed to Unconfigured-Good.
Exit Code: 0x00
[root@exacel10 ~]# /opt/MegaRAID/storcli/storcli64 -Pdlist -aAll | grep "Slot\|Firmware"
...
Slot Number: 6
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: 0B25
...
[root@exacel10 ~]# /opt/MegaRAID/storcli/storcli64 /c0/e20/sall show
CLI Version = 007.0530.0000.0000 Sep 21, 2018
Operating system = Linux 4.1.12-124.30.1.el7uek.x86_64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.
Drive Information :
=================
------------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
------------------------------------------------------------------------------
20:0 28 Onln 0 557.861 GB SAS HDD N N 512B HUS1560SCSUN600G U -
20:1 30 Onln 10 557.861 GB SAS HDD N N 512B HUS1560SCSUN600G U -
20:2 25 Onln 1 557.861 GB SAS HDD N N 512B ST360057SSUN600G U -
20:3 16 Onln 2 557.861 GB SAS HDD N N 512B ST360057SSUN600G U -
20:4 29 Onln 3 557.861 GB SAS HDD N N 512B ST360057SSUN600G U -
20:5 24 Onln 4 557.861 GB SAS HDD N N 512B ST360057SSUN600G U -
20:6 13 UGood - 557.861 GB SAS HDD N N 512B ST360057SSUN600G U - -------->
20:7 27 Onln 5 557.861 GB SAS HDD N N 512B ST360057SSUN600G U -
20:8 23 Onln 6 557.861 GB SAS HDD N N 512B ST360057SSUN600G U -
20:9 22 Onln 7 557.861 GB SAS HDD N N 512B ST360057SSUN600G U -
20:10 9 Onln 8 557.861 GB SAS HDD N N 512B ST360057SSUN600G U -
20:11 8 Onln 9 557.861 GB SAS HDD N N 512B ST360057SSUN600G U -
------------------------------------------------------------------------------
EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
SeSz-Sector Size|Sp-Spun|U-Up|D-Down/PowerSave|T-Transition|F-Foreign
UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded
   Then we tried to change the disk status to "Online". This command was crucial; it could have been our deus ex machina. 
[root@exacel10 ~]# /opt/MegaRAID/MegaCli/MegaCli64/MegaCli64 -PDOnline -PhysDrv[20:6] -a0
[root@exacel10 ~]# /opt/MegaRAID/storcli/storcli64 /c0/e20/s6 set online
CLI Version = 007.0530.0000.0000 Sep 21, 2018
Operating system = Linux 4.1.12-124.30.1.el7uek.x86_64
Controller = 0
Status = Failure
Description = Set Drive Online Failed. ------> BURADA HATA ALDIK.
Detailed Status :
===============
------------------------------------------------
Drive Status ErrCd ErrMsg
------------------------------------------------
/c0/e20/s6 Failure 255 Operation not allowed.
------------------------------------------------
We couldn't bring it online again. We conceded the loss of standby databases and DATA disk group, but did not give up yet. We tried to save cluster configuration and started with restoring OCR. That is a big story to tell and will be available on my next post.


Hope it helps.

Comments

Popular posts from this blog

Secure PostgreSQL : Patroni, Etcd, Pgbackrest Included

How to Upgrade PostgreSQL, PostGIS and Patroni in Air-Gapped Environments

Oracle Grid Release Update by using Ansible Playbooks