Exadata: Disk controller was hung. Cell was power cycled
Just another manic magic Monday.
I’ve moved my blog from https://insanedba.blogspot.com to https://dincosman.com
Please update your bookmarks and follow/subscribe at the new address for all the latest updates and content. More up-to-date content of this post may be available there.
After a great weekend, we came to the office and performed our daily health checks like every Monday. One of our storage servers (cell) of Exadata X2-2 X4270 M2 had lost 11 ASM disks out of a total of 34 ASM disks. We struck it lucky, all databases were still up despite all the losses.
Let's examine what happened to our cell server. When I checked the mailbox, I saw an alert mail from the problematic cell stating that "Disk controller was hung. Cell was power cycled." It looks like the cell disk controller was not performing well (maybe a bug or a peak moment) and forced the server to reboot. But normally reboots do not end up with disk losses.
I started by checking the cell's physical disk status.
What I got from the output was; we had one flash disk and one hard disk failure (disk number 3) and also one hard disk was in import failure status (disk number 7). But that did not explain 11 ASM disks failure. It should have been 6 ASM disks according to the output. There should be something more.
I continued with checking grid disks.
5 grid disks related to two physical disks (disk number 0 and disk number 11) were in "cacheContentLost" status. I checked Oracle Support for the grid disks with "cacheContentLost" status.
Doc Id 2346075.1 was related to our problem. The document was clear and explained the steps to recover the grid disks with cacheContentLost state.
When write-back flash cache is active on storage cells, in a flash disk failure, the grid disks cached by the failed ones can be storing stale data. If the flash disk failure occurs while the Exadata storage software is running, a resilvering operation is started to resynchronize the stale blocks from the other storage servers.
But if the flash disk failure happens while the storage software is not running or during the rebooting phase of the cell, the resilvering operation is not started and the grid disks will be labeled with 'cacheContentLost' state. The grid disks stay offline to prevent the databases from accessing the stale data.
In our state, the disk controller was hung and it ended up rebooting the server. During the reboot phase, the flash disk failure occurred and grid disks stuck in "cacheContentLost". Our team checked the gv$asm_operation view for ongoing rebalance operations, but there were no rows. ASM disks related to that grid disk were already dropped, and disk repair time had already passed.
We decided to recreate the grid disks which are in the "cacheContentLost" state to make them visible in ASM.
After executing those commands, we now have 1 hard disk with an import failure, 1 hard disk with a failure, and 1 flash disk with 4 fmods in a failed state. We opened SRs for the flash disk and the hard disk with failure. We replaced them with the spare new ones we had. Normally, no additional steps are required to re-create the cell disks or grid disks for flash disk and hard disk replacement.
Now let's continue with our case. We had only one hard disk left with problems. We were only missing three grid disks and ASM disks. That disk was in an import failure status. We executed the commands below to check hard disk information.
Foreign state is not looking good. I will try to change the foreign state of that hard disk. The commands below are executed for clearing the foreign state and reconfiguring RAID on that hard disk.
Grid disks for that disk were still in "not present" state. We decided to go for the re-enable command for that physical disk. The commands for re-enable are as follows.
Now everything is perfect again. It was really a manic Monday. After that case, to avoid experiencing a similar situation again, we also decided to update our Exadata servers image to the latest one. The issue has not happened again yet.
Hope it helps.
Comments
Post a Comment