Be CAREFUL! Before Applying Release Update
After applying release update 19.13 to our standby site cluster, consisting of two physical machines (non Oracle-engineered), we started to experience Node Evictions and Reboot Problems. We immediately started to search for the root cause of the issue and followed some steps to make them stable again before applying RU 19.13 to our production sites.
* We chose a sample that is reflecting the problem. On Feb 22, at 13:35, the host machine (bltdb02) rebooted.
* We looked at the lastgasp log files to get the details why the node got rebooted. Grid cssdagent or cssdmonitor can record node reboots here.
* These files are in ASCII text file format, there is nothing worthy of mention except the timestamps. The records in these files were related to another date.
* Searched for something related in /var/log/messages.
* Nothing useful here.
* Looked for errors, warnings, ORA-errors in the RDBMS DB alert.logs.
* The database detected a forward time drift and traced it. No more records in db alert.log. The instance was crashed and had no time to record something useful. There were 12 minutes between the last trace and database start.
* VKTM is a high priority process. When one of the high priority processes is dead, the instance is dead.
* Let's examine VKTM trace file.
* VKTM process is alive and detected a drift. There are some bugs in support.oracle.com related with VKTM high CPU usage. The CPU usage rate of VKTM process was at the range of 1% - 5% in our system. That is a normal rate, there should be another root cause.
* There was nothing useful in the rebooted server database alert.logs.
* We examined the database alert.logs of the healthy server in the cluster. (not the rebooted one)
* The processes making cross instance calls on the healthy node were waiting for a reply. But the node evicted did not reply in a timely fashion and the "Reconfiguration" starts. Also, the line "GES: System Load is HIGH" is reporting the problem with the Global Enqueue Services.
* We also investigated the logs of ASM instances. Nothing useful.
* At this point, we started to investigate the logs of the grid/crs logs.
* At 13:28:08, grid inter-process communication daemon had been non-responsive for 10 seconds. Afterwards, other daemons also started to fail. And cross instance calls also could not get a reply. In this case, we can say that we should investigate the OS resources at 13:27:58.
* We looked at details in the ocssd (Oracle Cluster Synchronization Services Daemon) trace file.
* It was time to view the gipcd (Grid Interprocess Communication Daemon) trace file.
* According to the above logs, the gipc daemon could not open the OLR. OLR is stored on the local disk. IO requests to the local disks were not replied.
* We also looked for any core dumps from crs. There weren't any.
* Up to this point, we had looked at Linux system logs, RDBMS and ASM alert.logs, CRS alert logs and daemon logs. We suspected that a resource exhaustion took place.
* We decided to look for the OSWatcher stats (iostat, vmstat, pidstat, top etc) at the node eviction time period.
* There was a warning message in the oswps.log file as below:
BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected
kernel BUG at list_del corruption
list_add corruption
do_IRQ: stack overflow: kernel stack overflow (current):
general protection fault
unable to handle kernel double fault:
RTNL: assertion failed seek!
page_mapcount(page) went negative! madness at
NETDEV WATCHDOG
sysctl table check failed: nobody cared
IRQ handler type mismatch
Kernel panic - not syncing: Machine Check Exception: Machine check events logged
divide error: bounds: coprocessor segment overrun: invalid TSS: segment not present: invalid opcode: alignment check: stack segment: fpu exception: simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD
* That was just an argument string that was passed to abrt binary. IGNORED IT.
* In the oswiostat log file, There were a iospike on local disk (dm-0) at 13:28:17.
* GRID_HOME is installed under "/" path and local disk (dm-0) was starving at that time.
* Also we created a dashboard to see clearly what was happening on server.
* We examined the graphs of OSWatcher, below is a worth mentioning one.
* Average wait time should not exceed 20ms for a healthy system. We saw a huge spike for dm-0.
* OSwatcher analysis document also told us the same thing with other words.
* This disk was not experiencing that kind of an issue before 19.13 RU and working in an expected way at 19.7 RU. It looks like our "/" local disk is having hard times with 19.13 RU. Oracle Grid and Database software uses local disks for writing audit and trace files mostly. Trace files are mostly created at times of trouble.
* We downloaded the oneoff patch and applied it to GRID_HOME. After monitoring of two days, same problem occurred again.
* We decided to analyze OSwatcher statistics of the incident times all together and were lucky to find that, just before the IO spike, a java process was consuming a huge amount of CPU in all samples. (like below %197, %374)
* THE PROBLEM WAS CVU (CLUSTER VERIFICATION UTILITY.)
* We checked its parameters:
* First occurrence of the issue was just 6 hours after the patch applied and also the other incident time intervals were multiples of 6 hours interval.
* It looks like with version 19.13, cvu resource is using too much CPU or have a bug. With too much CPU usage, some processes are stuck on "D state" and some processes do not reply in a timely fashion. At this chaotic time, trace files also got written to local disk and sudden io spikes are recorded. Afterwards, healthy node evicts the node on which cvu resource is running and cvu relocates to healthy node.
* The purpose of this resource was to invoke clusterware health checks at regular intervals but instead of it, it was the reason why clusterware node evictions happened.
* We stopped and disabled it and will look for its behaviour in future RUs.
Update in October 2023.
Hope it helps everyone experiencing "Node evictions and Reboots".
Comments
Post a Comment