Mirrored Disk Failure in Normal Redundancy Mode. (A Failure Story Part 1)

 IRON MAN  WAS DOWN

I’ve moved my blog from https://insanedba.blogspot.com to https://dincosman.com Please update your bookmarks and follow/subscribe at the new address for all the latest updates and content. More up-to-date content of this post may be available there.

Lately, in our DR (Disaster Recovery) site, we have experienced two mirrored disk failures in normal redundancy mode which ended up with recreating of dataguard databases.  I will try to explain our problem in detail.

    Our databases were down. CRS state was offline. ASM was down. Iron Man was down.

Mirrored Disk Failure in Normal Redundancy Mode


We started diagnosing the issue with manually starting up the ASM instance on one node. We have 3 disk groups. 2 of them got mounted, but one disk group (+DATA) could not get mounted. This disk group (+DATA) was holding ocrconfig and serving as voting disk.  This is the command and output. 


We checked all ASM instances alert.logs to clarify the chronological order of events. Let's examine the findings.

    On 07:17:22, on exacel11 tried to offline disk 8. This was the first faulty disk.


    On 07:17:25, after offlining the first faulty disk, Exadata disk worker process (XDWK) tried to access  partner disks of these ASM disks. All subsequent IOs to faulty ASM disks will be directed to their partners.


    On 07:17:32, Exadata Auto disk management feature decided to start rebalancing immediately (I mean no delay for disk_repair_time wait), as the corresponding physical disk was in a failed state and that was not a proactive failure detection, so waiting would be useless. For that reason, ASM disks had been dropped. That is the normal procedure. If the underlying physical disk status was in proactive/predictive failure, ASM would have been waiting for the disk_repair_time period before dropping the ASM disks. I will also write another post about that situation. More information is available in Metalink Docs. (Doc Id 1484274.1 Auto disk management feature in Exadata, Doc Id 1452325.1 Determining when Disks should be replaced on Oracle Exadata Database Machine, Doc Id 1390836.1 How to Replace a Hard Drive in an Exadata Storage Cell Server (Predictive Failure))


    On 07:18:58, After dropping the first faulty disk, at the rebalancing phase, when reading from the partner disk, Murphy's law was on stage :

Anything That Can Go Wrong Will Go Wrong.


    Unfortunately, the mirrored disk (exacel10 disk 6) went offline and the fault was detected.  This ended with the dismounting of the +DATA disk group. This disk group was storing the "Oracle Cluster Registry" and "Voting" files, eventually, CRS went down.

ERROR: disk 73 () in group 1 (DATA) cannot be offlined because all disks [73(), 133()] with mirrored data would be offline.
 
    We changed the second faulty physical disk status from "UNCONFIGURED(BAD)"  to "Unconfigured-Good". 

   Then we tried to change the disk status to "Online". This command was crucial; it could have been our deus ex machina. 
We couldn't bring it online again. We conceded the loss of standby databases and DATA disk group, but did not give up yet. We tried to save cluster configuration and started with restoring OCR. That is a big story to tell and will be available on my next post.


Hope it helps.

Comments

Popular posts from this blog

Oracle Grid Release Update by using Ansible Playbooks

Oracle Database Release Update by Using Ansible Playbooks

How to Upgrade PostgreSQL, PostGIS and Patroni in Air-Gapped Environments