Node Eviction after Applying Release Update 19.13

Be CAREFUL! Before Applying Release Update

I’ve moved my blog from https://insanedba.blogspot.com to https://dincosman.com Please update your bookmarks and follow/subscribe at the new address for all the latest updates and content. More up-to-date content of this post may be available there.

After applying release update 19.13 to our standby site cluster, consisting of two physical machines (non Oracle-engineered), we started to experience Node Evictions and Reboot Problems. We immediately started to search for the root cause of the issue and followed some steps to make them stable again before applying RU 19.13 to our production sites.

* We chose a sample that is reflecting the problem. On Feb 22, at 13:35, the host machine (bltdb02) rebooted.

[oracle@bltdb02 ~]$ last reboot
reboot system boot 4.14.35-2047.506 Tue Feb 22 13:35 - 11:32 (1+21:56)
reboot system boot 4.14.35-2047.506 Sat Feb 19 19:05 - 11:32 (4+16:27)
...

* We looked at the lastgasp log files to get the details why the node got rebooted. Grid cssdagent or cssdmonitor can record node reboots here.

* These files are in ASCII text file format, there is nothing worthy of mention except the timestamps. The records in these files were related to another date.

* Searched for something related in /var/log/messages.

[root@bltdb02 ~]# vi /var/log/messages
...
Feb 22 13:25:15 bltdb02 systemd-logind: Removed session 1202.
Feb 22 13:25:57 bltdb02 systemd-logind: New session 1203 of user oracle.
Feb 22 13:25:57 bltdb02 systemd: Started Session 1203 of user oracle.
Feb 22 13:25:57 bltdb02 systemd-logind: Removed session 1203.
Feb 22 13:27:25 bltdb02 chronyd[2019]: Selected source 95.140.32.4
Feb 22 13:28:13 bltdb02 rtkit-daemon[1999]: The canary thread is apparently starving. Taking action.
Feb 22 13:28:13 bltdb02 rtkit-daemon[1999]: Demoting known real-time threads.
Feb 22 13:28:13 bltdb02 rtkit-daemon[1999]: Demoted 0 threads.
Feb 22 13:35:49 bltdb02 kernel: microcode: microcode updated early to revision 0x46, date = 2021-01-27
Feb 22 13:35:49 bltdb02 kernel: Linux version 4.14.35-2047.506.10.el7uek.x86_64 (mockbuild@host-100-100-224-6) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16.0.3) (GCC)) #2 SMP Thu Aug 12 17:12:22 PDT 2021
Feb 22 13:35:49 bltdb02 kernel: Command line: BOOT_IMAGE=/vmlinuz-4.14.35-2047.506.10.el7uek.x86_64 root=/dev/mapper/ol_bltdb02-root ro rd.lvm.lv=ol_bltdb02/root rd.lvm.lv=ol_bltdb02/swap rhgb quiet numa=off transparent_hugepage=never LANG=en_US.UTF-8
...

* Nothing useful here.

* Looked for errors, warnings, ORA-errors in the RDBMS DB alert.logs.
[oracle@bltdb02 trace]$ vi $ORACLE_BASE/diag/rdbms/clouddg/clouddg2/trace/alert_clouddg2.log
...
2022-02-22T13:28:11.385787+03:00
Warning: VKTM detected a forward time drift.
2022-02-22T13:28:12.528795+03:00
Please see the VKTM trace file for more details:
/u01/app/oracle/diag/rdbms/clouddg/clouddg2/trace/clouddg2_vktm_7250.trc
2022-02-22T13:40:27.076354+03:00
Starting ORACLE instance (normal) (OS id: 24477)
...
* The database detected a forward time drift and traced it. No more records in db alert.log. The instance was crashed and had no time to record something useful. There were 12 minutes between the last trace and database start.

* VKTM is a high priority process. When one of the high priority processes is dead, the instance is dead.

SQL> SELECT PI.KSPPINM PARAMETER_NAME, PSV.KSPPSTVL INSTANCE_VALUE
FROM X$KSPPI PI, X$KSPPCV PCV, X$KSPPSV PSV
WHERE PI.INDX = PCV.INDX
AND PI.INDX = PSV.INDX
AND ( PI.KSPPINM = '_HIGH_PRIORITY_PROCESSES'
OR PI.KSPPINM = '_HIGHEST_PRIORITY_PROCESSES'); --PARAMETER NAME
PARAMETER_NAME INSTANCE_VALUE
-------------- --------------
_highest_priority_processes VKTM
_high_priority_processes LMS*|LM1*|LM2*|LM3*|LM4*|LM5*|LM6*|LM7*|LM8*|LM9*|LM*|LCK0|CKPT|DBRM|RMS0|LGWR|CR*|RS0*|RS1*|RS2*

* Let's examine VKTM trace file.

[oracle@bltdb02 trace]$ vi /u01/app/oracle/diag/rdbms/clouddg/clouddg2/trace/clouddg2_vktm_7250.trc
...
*** 2022-02-22T13:28:12.530124+03:00
kstmchkdrift (kstmhighrestimecntkeeper:highres): Time jumped forward by (29192208)usec at (238987772848) whereas (1000000) is allowed
*** 2022-02-22T13:28:52.029681+03:00
kstmchkdrift (kstmhighrestimecntkeeper:highres): Time jumped forward by (8942061)usec at (239010720349) whereas (1000000) is allowed
*** 2022-02-22T13:30:35.618002+03:00
kstmchkdrift (kstmhighrestimecntkeeper:highres): Time jumped forward by (106464068)usec at (239117184417) whereas (1000000) is allowed
...

* VKTM process is alive and detected a drift. There are some bugs in support.oracle.com related with VKTM high CPU usage. The CPU usage rate of VKTM process was at the range of 1% - 5% in our system. That is a normal rate, there should be another root cause.

* There was nothing useful in the rebooted server database alert.logs.

* We examined the database alert.logs of the healthy server in the cluster. (not the rebooted one)

[oracle@bltdb01 trace]$ vi $ORACLE_BASE/diag/rdbms/clouddg/clouddg1/trace/alert_clouddg1.log
...
2022-02-22T13:22:57.223935+03:00
ARC4 (PID:19605): Archived Log entry 59912 added for T-1.S-40494 ID 0x79b8ac20 LAD:1
2022-02-22T13:29:06.766088+03:00
Warning: VKTM detected a forward time drift.
2022-02-22T13:29:07.895922+03:00
Time drifts can result in unexpected behavior such as time-outs.
Please see the VKTM trace file for more details:
/u01/app/oracle/diag/rdbms/clouddg/clouddg1/trace/clouddg1_vktm_18186.trc
2022-02-22T13:29:14.051676+03:00
rfs (PID:21834): Possible network disconnect with primary database
2022-02-22T13:29:16.020365+03:00
LGWR (ospid: 18286) waits for event 'reliable message' for 14 secs.
2022-02-22T13:29:16.020486+03:00
LGWR (ospid: 18286) is hung in an acceptable location (inwait 0x1.ffff).
2022-02-22T13:29:17.310015+03:00
LGWR (ospid: 18286) waits for event 'reliable message' for 98 secs.
2022-02-22T13:29:26.992888+03:00
LGWR (ospid: 18286) waits for event 'reliable message' for 108 secs.
GES: System Load is HIGH.
GES: Current load is 205.25 and high load threshold is 60.00
2022-02-22T13:30:37.076993+03:00
LGWR (ospid: 18286) waits for event 'reliable message' for 178 secs.
2022-02-22T13:30:47.036633+03:00
LGWR (ospid: 18286) waits for event 'reliable message' for 188 secs.
2022-02-22T13:30:57.116521+03:00
LGWR (ospid: 18286) waits for event 'reliable message' for 198 secs.
2022-02-22T13:31:07.045602+03:00
LGWR (ospid: 18286) waits for event 'reliable message' for 208 secs.
2022-02-22T13:31:17.124637+03:00
LGWR (ospid: 18286) waits for event 'reliable message' for 218 secs.
2022-02-22T13:31:27.068715+03:00
LGWR (ospid: 18286) waits for event 'reliable message' for 228 secs.
2022-02-22T13:31:37.012616+03:00
LGWR (ospid: 18286) waits for event 'reliable message' for 238 secs.
2022-02-22T13:31:47.092550+03:00
LGWR (ospid: 18286) waits for event 'reliable message' for 248 secs.
2022-02-22T13:31:57.028611+03:00
LGWR (ospid: 18286) waits for event 'reliable message' for 258 secs.
2022-02-22T13:32:07.110413+03:00
LGWR (ospid: 18286) waits for event 'reliable message' for 268 secs.
2022-02-22T13:32:07.818327+03:00
Increasing priority of 2 RS
Reconfiguration started (old inc 43, new inc 45)
List of instances (total 1) :
1
Dead instances (total 1) :
2
My inst 1
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
2022-02-22T13:32:08.011516+03:00
...

* The processes making cross instance calls on the healthy node were waiting for a reply. But the node evicted did not reply in a timely fashion and the "Reconfiguration" starts. Also, the line "GES: System Load is HIGH" is reporting the problem with the Global Enqueue Services.
* We also investigated the logs of ASM instances. Nothing useful.
* At this point, we started to investigate the logs of the grid/crs logs.
[oracle@bltdb02 ~]$ vi /u01/app/oracle/diag/crs/bltdb02/crs/trace/alert.log
...
2022-02-22 13:28:08.409 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10220' milliseconds. Additional diagnostics: process name: 'ocssd'.
2022-02-22 13:28:10.689 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10220' milliseconds. Additional diagnostics: process name: 'evmd'.
2022-02-22 13:28:10.690 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10140' milliseconds. Additional diagnostics: process name: 'octssd'.
2022-02-22 13:28:10.691 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10230' milliseconds. Additional diagnostics: process name: 'crsd'.
2022-02-22 13:28:46.083 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '11070' milliseconds. Additional diagnostics: process name: 'gipcd'.
2022-02-22 13:28:55.975 [ORAROOTAGENT(31749)]CRS-5818: Aborted command 'check' for resource 'ora.drivers.acfs'. Details at (:CRSAGF00113:) {0:12:2} in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_orarootagent_root.trc.
2022-02-22 13:29:02.667 [GIPCD(6935)]CRS-7509: The 'worker thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '11020' milliseconds. Additional diagnostics: process name: 'ohasd'.
2022-02-22 13:29:13.572 [GIPCD(6935)]CRS-7509: The 'worker thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10610' milliseconds. Additional diagnostics: process name: 'evmd'.
2022-02-22 13:29:24.861 [GIPCD(6935)]CRS-7509: The 'worker thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10500' milliseconds. Additional diagnostics: process name: 'octssd'.
2022-02-22 13:29:37.440 [GIPCD(6935)]CRS-7509: The 'worker thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10330' milliseconds. Additional diagnostics: process name: 'crsd'.
2022-02-22 13:30:23.167 [GIPCD(6935)]CRS-7508: The Oracle Grid Interprocess Communication (GIPC) high availability 'worker thread' is not scheduled correctly on 'gipcd' process.
2022-02-22 13:30:40.073 [GIPCD(6935)]CRS-7508: The Oracle Grid Interprocess Communication (GIPC) high availability 'worker thread' is not scheduled correctly on 'gipcd' process.
2022-02-22 13:30:46.340 [ORAROOTAGENT(2123)]CRS-5818: Aborted command 'check' for resource 'ora.asmnet2.asmnetwork'. Details at (:CRSAGF00113:) {2:33488:492} in /u01/app/oracle/diag/crs/bltdb02/crs/trace/crsd_orarootagent_root.trc.
2022-02-22 13:30:54.601 [ORAROOTAGENT(31749)]CRS-5818: Aborted command 'check' for resource 'ora.ctssd'. Details at (:CRSAGF00113:) {0:12:2} in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_orarootagent_root.trc.
2022-02-22 13:30:54.943 [ORAROOTAGENT(2123)]CRS-5818: Aborted command 'check' for resource 'ora.asmnet1.asmnetwork'. Details at (:CRSAGF00113:) {2:33488:492} in /u01/app/oracle/diag/crs/bltdb02/crs/trace/crsd_orarootagent_root.trc.
2022-02-22 13:31:08.813 [GIPCD(6935)]CRS-7508: The Oracle Grid Interprocess Communication (GIPC) high availability 'worker thread' is not scheduled correctly on 'gipcd' process.
2022-02-22 13:31:10.393 [ORAROOTAGENT(31749)]CRS-5818: Aborted command 'check' for resource 'ora.cluster_interconnect.haip'. Details at (:CRSAGF00113:) {0:12:2} in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_orarootagent_root.trc.
2022-02-22 13:31:17.882 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '135990' milliseconds. Additional diagnostics: process name: 'gipcd'.
2022-02-22 13:31:34.429 [OCSSD(11975)]CRS-1608: This node was evicted by node 1, bltdb01; details at (:CSSNM00005:) in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ocssd.trc.
2022-02-22 13:31:36.556 [OCSSD(11975)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ocssd.trc
2022-02-22 13:31:39.514 [OCSSD(11975)]CRS-1652: Starting clean up of CRSD resources.
2022-02-22 13:31:37.929 [GIPCD(6935)]CRS-7508: The Oracle Grid Interprocess Communication (GIPC) high availability 'worker thread' is not scheduled correctly on 'gipcd' process.
2022-02-22 13:31:47.967 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '134250' milliseconds. Additional diagnostics: process name: 'ohasd'.
2022-02-22 13:31:54.206 [CSSDAGENT(11954)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 14109 milliseconds; details are at (:CLSN00121:) in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_cssdagent_root.trc.
2022-02-22 13:31:53.576 [GIPCD(6935)]CRS-7509: The 'worker thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '131800' milliseconds. Additional diagnostics: process name: 'ocssd'.
2022-02-22 13:31:51.782 [ORAAGENT(6585)]CRS-5818: Aborted command 'check' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_oraagent_oracle.trc.
2022-02-22 13:31:56.356 [CSSDMONITOR(11917)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 14119 milliseconds; details are at (:CLSN00121:) in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_cssdmonitor_root.trc.
2022-02-22 13:31:56.359 [GIPCD(6935)]CRS-7508: The Oracle Grid Interprocess Communication (GIPC) high availability 'worker thread' is not scheduled correctly on 'gipcd' process.
2022-02-22 13:31:57.971 [CSSDAGENT(11954)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 6999 milliseconds; details are at (:CLSN00121:) in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_cssdagent_root.trc.
2022-02-22 13:31:58.642 [CSSDMONITOR(11917)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 7019 milliseconds; details are at (:CLSN00121:) in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_cssdmonitor_root.trc.
2022-02-22 13:31:57.393 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '134250' milliseconds. Additional diagnostics: process name: 'evmd'.
2022-02-22 13:36:28.536 [OHASD(3926)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 3926 ---> Tekrar başlıyor.
2022-02-22 13:36:28.729 [OHASD(3926)]CRS-0714: Oracle Clusterware Release 19.0.0.0.0.
2022-02-22 13:36:28.838 [OHASD(3926)]CRS-2112: The OLR service started on node bltdb02.
2022-02-22 13:36:29.427 [OHASD(3926)]CRS-8017: location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2022-02-22 13:36:29.427 [OHASD(3926)]CRS-1301: Oracle High Availability Service started on node bltdb02.
2022-02-22 13:36:31.253 [ORAROOTAGENT(6324)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 6324
2022-02-22 13:36:31.580 [CSSDAGENT(6347)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 6347
2022-02-22 13:36:31.580 [CSSDMONITOR(6350)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 6350
2022-02-22 13:36:31.662 [ORAAGENT(6339)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 6339
...
view raw crs_alert.log hosted with ❤ by GitHub
* At 13:28:08, grid inter-process communication daemon had been non-responsive for 10 seconds. Afterwards, other daemons also started to fail. And cross instance calls also could not get a reply. In this case, we can say that we should investigate the OS resources at 13:27:58.
* We looked at details in the ocssd (Oracle Cluster Synchronization Services Daemon) trace file.
[oracle@bltdb02 ~]$ vi /u01/app/oracle/diag/crs/bltdb02/crs/trace/ocssd_605.trc
...
2022-02-22 13:31:21.778 : CSSD:2703701760: [ INFO] GR+DB_+ASM:10693-22179350,
2022-02-22 13:31:22.639 :GIPCHTHR:2700547840: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30010 loopCount 119 sendCount 33 recvCount 87 postCount 33 sendCmplCount 33 recvCmplCount 53
2022-02-22 13:31:23.586 : CSSD:2720790272: [ INFO] clssgmpHandleFenceReq: Received fence request from node bltdb01, number 1, for clientID 2:20983:1, request does not kill primary
2022-02-22 13:31:23.586 : CSSD:2720790272: [ INFO] clssgmpcSendLocalFenceReq: Sending a fence rquest to node bltdb02, number 2, for clientID 2:20983:1
2022-02-22 13:31:23.586 : CSSD:2703701760: [ INFO] clssgmfInitLocalFence: Member level fence for global groupmemberID 14:2:1:33, fence type 1
2022-02-22 13:31:23.586 : CSSD:2703701760: [ INFO] clssgmfInitLocalFence: Fencing member (0x7fda4c06b520) with memberID 14:2:1 at the request of clientID 1:402:1
2022-02-22 13:31:23.586 : CSSD:2703701760: clssgmFenceMember: global grock(DB+ASM), member(1), type(1), cookie(0x7fdaa1267b00)
2022-02-22 13:31:23.586 : CSSD:2703701760: [ INFO] clssgmFenceObj: fencing member (0x7fda4c06b520) memberID 14:2:1 grock DB+ASM fence type 1 flags 0x0
2022-02-22 13:31:23.586 : CSSD:2703701760: [ INFO] clssgmProcFenceDeath: Fencing target(0x7fda4c06b520), type(3)
2022-02-22 13:31:23.586 : CSSD:2703701760: clssgmProcessFenceMembership: member(0x7fda4c06b520), memberID 14:2:1
2022-02-22 13:31:23.586 : CSSD:2703701760: clssgmProcessFenceClient: Client clientID 2:20983:1 with pid(10676-22179343), proc(0x7fda543dd890), client(0x7fda544654d0)
2022-02-22 13:31:23.587 : CSSD:2703701760: clssscpidentry_construct: entry(0x7fda4c07e8a0) initialized(1), context(0x7fda4c06b520), name(0x7fda4c07e8c0/10676-22179343), pid(0x7fda4c0a0bb0) type(0)
...
2022-02-22 13:31:23.592 :GIPCXCPT:2706855680: gipcInternalDissociate: obj 0x7fda54976d50 [0000000001e29c43] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_bltdb02_)(GIPCID=5b964801-494e6ae3-11975))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_bltdb02_)(GIPCID=494e6ae3-5b964801-10881))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 10881, readyRef (nil), ready 1, wobj (nil), sendp (nil) status 0flags 0x2013061e, flags-2 0x0, usrFlags 0x10010 } not associated with any container, ret gipcretFail (1)
2022-02-22 13:31:23.592 :GIPCXCPT:2706855680: gipcDissociateF [clssgmDeadClient : clssgm1.c : 3160]: EXCEPTION[ ret gipcretFail (1) ] failed to dissociate obj 0x7fda54976d50 [0000000001e29c43] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_bltdb02_)(GIPCID=5b964801-494e6ae3-11975))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_bltdb02_)(GIPCID=494e6ae3-5b964801-10881))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 10881, readyRef (nil), ready 1, wobj (nil), sendp (nil) status 0flags 0x2013061e, flags-2 0x0, usrFlags 0x10010 }, flags 0x0
...
2022-02-22 13:31:36.055 : CSSD:939521792: [ ERROR] (:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node bltdb01, number 1, sync 540650595, stamp 42948914, fence inited 0
...
view raw ocssd.trc hosted with ❤ by GitHub
* It was time to view the gipcd (Grid Interprocess Communication Daemon) trace file.
* According to the above logs, the gipc daemon could not open the OLR. OLR is stored on the local disk. IO requests to the local disks were not replied.
* We also looked for any core dumps from crs. There weren't any.
* Up to this point, we had looked at Linux system logs, RDBMS and ASM alert.logs, CRS alert logs and daemon logs. We suspected that a resource exhaustion took place.
* We decided to look for the OSWatcher stats (iostat, vmstat, pidstat, top etc) at the node eviction time period.
[oracle@bltdb02 archive]$ cd /u01/app/oracle/oracle.ahf/data/repository/suptools/bltdb02/oswbb/oracle/archive/oswiostat
[oracle@bltdb02 archive]$ gunzip bltdb02.kkk.tsk_iostat_22.02.22.1300.dat.gz
[oracle@bltdb02 archive]$ cd /u01/app/oracle/oracle.ahf/data/repository/suptools/bltdb02/oswbb/oracle/archive/oswpidstat
[oracle@bltdb02 archive]$ gunzip bltdb02.kkk.tsk_pidstat_22.02.22.1300.dat.gz
[oracle@bltdb02 archive]$ cd /u01/app/oracle/oracle.ahf/data/repository/suptools/bltdb02/oswbb/oracle/archive/oswpidstatd
[oracle@bltdb02 archive]$ gunzip bltdb02.kkk.tsk_pidstatd_22.02.22.1300.dat.gz
[oracle@bltdb02 archive]$ cd /u01/app/oracle/oracle.ahf/data/repository/suptools/bltdb02/oswbb/oracle/archive/oswvmstat
[oracle@bltdb02 archive]$ gunzip bltdb02.kkk.tsk_vmstat_22.02.22.1300.dat.gz
[oracle@bltdb02 archive]$ cd /u01/app/oracle/oracle.ahf/data/repository/suptools/bltdb02/oswbb/oracle/archive/oswtop
[oracle@bltdb02 archive]$ gunzip bltdb02.kkk.tsk_top_22.02.22.1300.dat.gz
[oracle@bltdb02 archive]$ cd /u01/app/oracle/oracle.ahf/data/repository/suptools/bltdb02/oswbb/oracle/archive/oswmpstat
[oracle@bltdb02 archive]$ gunzip bltdb02.kkk.tsk_wmpstat_22.02.22.1300.dat.gz
[oracle@bltdb02 archive]$ cd /u01/app/oracle/oracle.ahf/data/repository/suptools/bltdb02/oswbb/oracle/archive/oswps
[oracle@bltdb02 archive]$ gunzip bltdb02.kkk.tsk_ps_22.02.22.1300.dat.gz
* There was a warning message in the oswps.log file as below:
        BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected
        kernel BUG at list_del corruption
        list_add corruption
        do_IRQ: stack overflow: kernel stack overflow (current):
        general protection fault
        unable to handle kernel double fault:
        RTNL: assertion failed seek!
        page_mapcount(page) went negative! madness at
        NETDEV WATCHDOG
        sysctl table check failed: nobody cared
        IRQ handler type mismatch
        Kernel panic - not syncing: Machine Check Exception: Machine check events logged
        divide error: bounds: coprocessor segment overrun: invalid TSS: segment not present: invalid opcode: alignment check: stack segment: fpu exception: simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD
    
* That was just an argument string that was passed to abrt binary. IGNORED IT.

* In the oswiostat log file, There were a iospike on local disk (dm-0) at 13:28:17.
zzz ***Tue Feb 22 13:28:17 +03 2022
avg-cpu: %user %nice %system %iowait %steal %idle
13.06 0.00 5.39 64.53 0.00 17.02
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 7.00 0.00 912.00 6.00 62180.00 48.00 135.57 55.75 64.85 65.27 0.00 1.09 100.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
..
dm-0 0.00 0.00 940.00 6.00 61092.00 48.00 129.26 54.33 60.48 60.87 0.00 1.06 100.00
dm-1 0.00 0.00 1.00 0.00 4.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00
..
dm-22 0.00 0.00 1.00 1.00 16.00 16.00 32.00 0.00 0.50 1.00 0.00 0.50 0.10
view raw iostat.dat hosted with ❤ by GitHub

* GRID_HOME is installed under "/" path and local disk (dm-0) was starving at that time.
[oracle@bltdb02 ~]$ ls -ls /dev/mapper
...
0 lrwxrwxrwx 1 root root 7 Feb 22 13:35 ol_bltdb02-root -> ../dm-0
..
[oracle@bltdb02 ~]$ lsblk
...
sda 8:0 0 136.7G 0 disk
├─sda2 8:2 0 1G 0 part /boot
├─sda3 8:3 0 135.5G 0 part
│ ├─ol_bltdb02-swap 252:1 0 16G 0 lvm [SWAP]
│ └─ol_bltdb02-root 252:0 0 119.5G 0 lvm /
└─sda1 8:1 0 200M 0 part /boot/efi
...

* Also we created a dashboard to see clearly what was happening on server.
[oracle@bltdb02 ~]# export ORACLE_HOME=/u01/app/19/grid
[oracle@bltdb02 ~]# tfactl run oswbb
--> Enter D for dashboard

* We examined the graphs of OSWatcher, below is a worth mentioning one.
OSWatcher can graph IO activity
* Average wait time should not exceed 20ms for a healthy system. We saw a huge spike for dm-0.

* OSwatcher analysis document also told us the same thing with other words.
# Section 1: System Status
...
Subsystem Status
------------------------
CPU OK
MEMORY OK
I/O WARNING
NET OK
...
# Section 3: System General Findings
...
WARNING: Disk high average wait queue time observed.
Advise: Check disks to see why average wait queue time is so high.
Check: I/O requests are taking too long to be serviced.
Check: The application. Are there more read/write requests to this disk than normal?
Check: Disk may be defective if service time is also high.
...

* This disk was not experiencing that kind of an issue before 19.13 RU and working in an expected way at 19.7 RU. It looks like our "/" local disk is having hard times with 19.13 RU. Oracle Grid and Database software uses local disks for writing audit and trace files mostly. Trace files are mostly created at times of trouble.

* We searched support.oracle for the "grid clusterware threads and local disk io" subject and found a bug that could be related with our issue. Bug 33744602 "Exadata: Clusterware threads are having scheduling delays when io to the local disk is busy"

* We downloaded the oneoff patch and applied it to GRID_HOME. After monitoring of two days, same problem occurred again.

* We decided to analyze OSwatcher statistics of the incident times all together and were lucky to find that, just before the IO spike, a java process was consuming a huge amount of CPU in all samples. (like below %197, %374)
Tue Feb 22 13:24:52 EET 2022 28479 oracle 197.00 1/app/19/grid/jlib/jackson-databind.jar -DCV_DESTLOC=/tmp -DCV_HOME=/u01/app/19/grid -DRUNNING.MODE=cvuresource -DCV_LOG_LOCATION=/u01/app/oracle/crsdata/bltdb02/cvu/cvulog -DCV_PID_FILE=/u01/app/oracle/crsdata/bltdb02/cvu/init/cvu.pid -DCV_TRACELOC=/u01/app/oracle/crsdata/bltdb02/cvu/cvutrc -Xmx2048m oracle.ops.verification.client.CluvfyDriver comp healthcheck -mandatory -_format -_check_type partial
...
Tue Feb 22 13:27:22 EET 2022 8313 oracle 374.00 1/app/19/grid/OPatch/jlib/oracle.opatch.classpath.jar -DCV_HOME=/u01/app/19/grid -DGI_HOME=/u01/app/19/grid -DCV_TRACELOC=/u01/app/oracle/crsdata/bltdb02/cvu/cvutrc oracle.ops.verification.helper.helper19.CVUHelper19 -getOraclePatchList /u01/app/19/grid

* THE PROBLEM WAS CVU (CLUSTER VERIFICATION UTILITY.)
* We checked its parameters:
[oracle@bltdb01 trace]$ crsctl stat res ora.cvu -p
...
RUN_INTERVAL=21600
...
--21600 seconds equal to 6 hours.

* First occurrence of the issue was just 6 hours after the patch applied and also the other incident time intervals were multiples of 6 hours interval.

* It looks like with version 19.13, cvu resource is using too much CPU or have a bug. With too much CPU usage, some processes are stuck on "D state" and some processes do not reply in a timely fashion. At this chaotic time, trace files also got written to local disk and sudden io spikes are recorded. Afterwards, healthy node evicts the node on which cvu resource is running and cvu relocates to healthy node.

* The purpose of this resource was to invoke clusterware health checks at regular intervals but instead of it, it was the reason why clusterware node evictions happened.

* We stopped and disabled it and will look for its behaviour in future RUs.
[oracle@bltdb01 ~]$ /u01/app/19/grid/bin/srvctl stop cvu
[oracle@bltdb01 ~]$ /u01/app/19/grid/bin/srvctl disable cvu

Update in October 2023.

Hope it helps everyone experiencing "Node evictions and Reboots".

Comments

Popular posts from this blog

Secure PostgreSQL : Patroni, Etcd, Pgbackrest Included

How to Upgrade PostgreSQL, PostGIS and Patroni in Air-Gapped Environments

Oracle Grid Release Update by using Ansible Playbooks