Be CAREFUL! Before Applying Release Update

I’ve moved my blog from https://insanedba.blogspot.com to https://dincosman.com Please update your bookmarks and follow/subscribe at the new address for all the latest updates and content. More up-to-date content of this post may be available there.

After applying release update 19.13 to our standby site cluster, consisting of two physical machines (non Oracle-engineered), we started to experience Node Evictions and Reboot Problems. We immediately started to search for the root cause of the issue and followed some steps to make them stable again before applying RU 19.13 to our production sites.

Node Eviction After Database Release Update

* We chose a sample that is reflecting the problem. On Feb 22, at 13:35, the host machine (bltdb02) rebooted.

* We looked at the lastgasp log files to get the details why the node got rebooted. Grid cssdagent or cssdmonitor can record node reboots here.

* These files are in ASCII text file format, there is nothing worthy of mention except the timestamps. The records in these files were related to another date.

* Searched for something related in /var/log/messages.

* Nothing useful here.

* Looked for errors, warnings, ORA-errors in the RDBMS DB alert.logs.

* The database detected a forward time drift and traced it. No more records in db alert.log. The instance was crashed and had no time to record something useful. There were 12 minutes between the last trace and database start.

* VKTM is a high priority process. When one of the high priority processes is dead, the instance is dead.

* Let's examine VKTM trace file.

* VKTM process is alive and detected a drift. There are some bugs in support.oracle.com related with VKTM high CPU usage. The CPU usage rate of VKTM process was at the range of 1% - 5% in our system. That is a normal rate, there should be another root cause.

* There was nothing useful in the rebooted server database alert.logs.

* We examined the database alert.logs of the healthy server in the cluster. (not the rebooted one)

* The processes making cross instance calls on the healthy node were waiting for a reply. But the node evicted did not reply in a timely fashion and the "Reconfiguration" starts. Also, the line "GES: System Load is HIGH" is reporting the problem with the Global Enqueue Services.

* We also investigated the logs of ASM instances. Nothing useful.

* At this point, we started to investigate the logs of the grid/crs logs.

* At 13:28:08, grid inter-process communication daemon had been non-responsive for 10 seconds. Afterwards, other daemons also started to fail. And cross instance calls also could not get a reply. In this case, we can say that we should investigate the OS resources at 13:27:58.

* We looked at details in the ocssd (Oracle Cluster Synchronization Services Daemon) trace file.

* It was time to view the gipcd (Grid Interprocess Communication Daemon) trace file.

* According to the above logs, the gipc daemon could not open the OLR. OLR is stored on the local disk. IO requests to the local disks were not replied.

* We also looked for any core dumps from crs. There weren't any.

* Up to this point, we had looked at Linux system logs, RDBMS and ASM alert.logs, CRS alert logs and daemon logs. We suspected that a resource exhaustion took place.

* We decided to look for the OSWatcher stats (iostat, vmstat, pidstat, top etc) at the node eviction time period.

* There was a warning message in the oswps.log file as below:

        BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected
        kernel BUG at list_del corruption
        list_add corruption
        do_IRQ: stack overflow: kernel stack overflow (current):
        general protection fault
        unable to handle kernel double fault:
        RTNL: assertion failed seek!
        page_mapcount(page) went negative! madness at
        NETDEV WATCHDOG
        sysctl table check failed: nobody cared
        IRQ handler type mismatch
        Kernel panic - not syncing: Machine Check Exception: Machine check events logged
        divide error: bounds: coprocessor segment overrun: invalid TSS: segment not present: invalid opcode: alignment check: stack segment: fpu exception: simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD

* That was just an argument string that was passed to abrt binary. IGNORED IT.

* In the oswiostat log file, There were a iospike on local disk (dm-0) at 13:28:17.

* GRID_HOME is installed under "/" path and local disk (dm-0) was starving at that time.

* Also we created a dashboard to see clearly what was happening on server.

* We examined the graphs of OSWatcher, below is a worth mentioning one.

* Average wait time should not exceed 20ms for a healthy system. We saw a huge spike for dm-0.

* OSwatcher analysis document also told us the same thing with other words.

* This disk was not experiencing that kind of an issue before 19.13 RU and working in an expected way at 19.7 RU. It looks like our "/" local disk is having hard times with 19.13 RU. Oracle Grid and Database software uses local disks for writing audit and trace files mostly. Trace files are mostly created at times of trouble.

* We searched support.oracle for the "grid clusterware threads and local disk io" subject and found a bug that could be related with our issue. Bug 33744602 "Exadata: Clusterware threads are having scheduling delays when io to the local disk is busy"

* We downloaded the oneoff patch and applied it to GRID_HOME. After monitoring of two days, same problem occurred again.

* We decided to analyze OSwatcher statistics of the incident times all together and were lucky to find that, just before the IO spike, a java process was consuming a huge amount of CPU in all samples. (like below %197, %374)

* THE PROBLEM WAS CVU (CLUSTER VERIFICATION UTILITY.)

* We checked its parameters:
        
    

* First occurrence of the issue was just 6 hours after the patch applied and also the other incident time intervals were multiples of 6 hours interval.

* It looks like with version 19.13, cvu resource is using too much CPU or have a bug. With too much CPU usage, some processes are stuck on "D state" and some processes do not reply in a timely fashion. At this chaotic time, trace files also got written to local disk and sudden io spikes are recorded. Afterwards, healthy node evicts the node on which cvu resource is running and cvu relocates to healthy node.

* The purpose of this resource was to invoke clusterware health checks at regular intervals but instead of it, it was the reason why clusterware node evictions happened.

* We stopped and disabled it and will look for its behaviour in future RUs.

Update in October 2023.

Bug 33081245 - Node Reboot Due to Memory Consumption by Cluvfy Calling getoraclepatchlist (Doc ID 33081245.8) is related with our issue and fixed (tested) in 19.20 Release Update and also well documented in CVU auto execution caused high load every 6 hours (Doc ID 2899444.1)

Hope it helps everyone experiencing "Node evictions and Reboots".

Search This Blog

Insane DBA...

Node Eviction after Applying Release Update 19.13

Be CAREFUL! Before Applying Release Update

Comments

Post a Comment

Popular posts from this blog

Secure PostgreSQL : Patroni, Etcd, Pgbackrest Included

How to Upgrade PostgreSQL, PostGIS and Patroni in Air-Gapped Environments

Oracle Database Release Update by Using Ansible Playbooks

Subscribe to InsaneDBA...