Announcement: This blog site has been relocated to https://dincosman.com and is currently in READ-ONLY mode.
Node Eviction after Applying Release Update 19.13
Get link
Facebook
X
Pinterest
Email
Other Apps
-
Be CAREFUL! Before Applying Release Update
I’ve moved my blog from https://insanedba.blogspot.com to https://dincosman.com
Please update your bookmarks and follow/subscribe at the new address for all the latest updates and content. More up-to-date content of this post may be available there.
After applying release update 19.13 to our standby site cluster, consisting of two physical machines (non Oracle-engineered), we started to experience Node Evictions and Reboot Problems. We immediately started to search for the root cause of the issue and followed some steps to make them stable again before applying RU 19.13 to our production sites.
* We chose a sample that is reflecting the problem. On Feb 22, at 13:35, the host machine (bltdb02) rebooted.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* We looked at the lastgasp log files to get the details why the node got rebooted. Grid cssdagent or cssdmonitor can record node reboots here.
* These files are in ASCII text file format, there is nothing worthy of mention except the timestamps. The records in these files were related to another date.
* Searched for something related in /var/log/messages.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Feb 22 13:25:15 bltdb02 systemd-logind: Removed session 1202.
Feb 22 13:25:57 bltdb02 systemd-logind: New session 1203 of user oracle.
Feb 22 13:25:57 bltdb02 systemd: Started Session 1203 of user oracle.
Feb 22 13:25:57 bltdb02 systemd-logind: Removed session 1203.
Feb 22 13:27:25 bltdb02 chronyd[2019]: Selected source 95.140.32.4
Feb 22 13:28:13 bltdb02 rtkit-daemon[1999]: The canary thread is apparently starving. Taking action.
Feb 22 13:28:13 bltdb02 rtkit-daemon[1999]: Demoting known real-time threads.
Feb 22 13:28:13 bltdb02 rtkit-daemon[1999]: Demoted 0 threads.
Feb 22 13:35:49 bltdb02 kernel: microcode: microcode updated early to revision 0x46, date = 2021-01-27
Feb 22 13:35:49 bltdb02 kernel: Linux version 4.14.35-2047.506.10.el7uek.x86_64 (mockbuild@host-100-100-224-6) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16.0.3) (GCC)) #2 SMP Thu Aug 12 17:12:22 PDT 2021
* Looked for errors, warnings, ORA-errors in the RDBMS DB alert.logs.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* The database detected a forward time drift and traced it. No more records in db alert.log. The instance was crashed and had no time to record something useful. There were 12 minutes between the last trace and database start.
* VKTM is a high priority process. When one of the high priority processes is dead, the instance is dead.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* VKTM process is alive and detected a drift. There are some bugs in support.oracle.com related with VKTM high CPU usage. The CPU usage rate of VKTM process was at the range of 1% - 5% in our system. That is a normal rate, there should be another root cause.
* There was nothing useful in the rebooted server database alert.logs.
* We examined the database alert.logs of the healthy server in the cluster. (not the rebooted one)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* The processes making cross instance calls on the healthy node were waiting for a reply. But the node evicted did not reply in a timely fashion and the "Reconfiguration" starts. Also, the line "GES: System Load is HIGH" is reporting the problem with the Global Enqueue Services.
* We also investigated the logs of ASM instances. Nothing useful.
* At this point, we started to investigate the logs of the grid/crs logs.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
[oracle@bltdb02 ~]$ vi /u01/app/oracle/diag/crs/bltdb02/crs/trace/alert.log
...
2022-02-22 13:28:08.409 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10220' milliseconds. Additional diagnostics: process name: 'ocssd'.
2022-02-22 13:28:10.689 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10220' milliseconds. Additional diagnostics: process name: 'evmd'.
2022-02-22 13:28:10.690 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10140' milliseconds. Additional diagnostics: process name: 'octssd'.
2022-02-22 13:28:10.691 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10230' milliseconds. Additional diagnostics: process name: 'crsd'.
2022-02-22 13:28:46.083 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '11070' milliseconds. Additional diagnostics: process name: 'gipcd'.
2022-02-22 13:28:55.975 [ORAROOTAGENT(31749)]CRS-5818: Aborted command 'check' for resource 'ora.drivers.acfs'. Details at (:CRSAGF00113:) {0:12:2} in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_orarootagent_root.trc.
2022-02-22 13:29:02.667 [GIPCD(6935)]CRS-7509: The 'worker thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '11020' milliseconds. Additional diagnostics: process name: 'ohasd'.
2022-02-22 13:29:13.572 [GIPCD(6935)]CRS-7509: The 'worker thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10610' milliseconds. Additional diagnostics: process name: 'evmd'.
2022-02-22 13:29:24.861 [GIPCD(6935)]CRS-7509: The 'worker thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10500' milliseconds. Additional diagnostics: process name: 'octssd'.
2022-02-22 13:29:37.440 [GIPCD(6935)]CRS-7509: The 'worker thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '10330' milliseconds. Additional diagnostics: process name: 'crsd'.
2022-02-22 13:30:23.167 [GIPCD(6935)]CRS-7508: The Oracle Grid Interprocess Communication (GIPC) high availability 'worker thread' is not scheduled correctly on 'gipcd' process.
2022-02-22 13:30:40.073 [GIPCD(6935)]CRS-7508: The Oracle Grid Interprocess Communication (GIPC) high availability 'worker thread' is not scheduled correctly on 'gipcd' process.
2022-02-22 13:30:46.340 [ORAROOTAGENT(2123)]CRS-5818: Aborted command 'check' for resource 'ora.asmnet2.asmnetwork'. Details at (:CRSAGF00113:) {2:33488:492} in /u01/app/oracle/diag/crs/bltdb02/crs/trace/crsd_orarootagent_root.trc.
2022-02-22 13:30:54.601 [ORAROOTAGENT(31749)]CRS-5818: Aborted command 'check' for resource 'ora.ctssd'. Details at (:CRSAGF00113:) {0:12:2} in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_orarootagent_root.trc.
2022-02-22 13:30:54.943 [ORAROOTAGENT(2123)]CRS-5818: Aborted command 'check' for resource 'ora.asmnet1.asmnetwork'. Details at (:CRSAGF00113:) {2:33488:492} in /u01/app/oracle/diag/crs/bltdb02/crs/trace/crsd_orarootagent_root.trc.
2022-02-22 13:31:08.813 [GIPCD(6935)]CRS-7508: The Oracle Grid Interprocess Communication (GIPC) high availability 'worker thread' is not scheduled correctly on 'gipcd' process.
2022-02-22 13:31:10.393 [ORAROOTAGENT(31749)]CRS-5818: Aborted command 'check' for resource 'ora.cluster_interconnect.haip'. Details at (:CRSAGF00113:) {0:12:2} in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_orarootagent_root.trc.
2022-02-22 13:31:17.882 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '135990' milliseconds. Additional diagnostics: process name: 'gipcd'.
2022-02-22 13:31:34.429 [OCSSD(11975)]CRS-1608: This node was evicted by node 1, bltdb01; details at (:CSSNM00005:) in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ocssd.trc.
2022-02-22 13:31:36.556 [OCSSD(11975)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ocssd.trc
2022-02-22 13:31:39.514 [OCSSD(11975)]CRS-1652: Starting clean up of CRSD resources.
2022-02-22 13:31:37.929 [GIPCD(6935)]CRS-7508: The Oracle Grid Interprocess Communication (GIPC) high availability 'worker thread' is not scheduled correctly on 'gipcd' process.
2022-02-22 13:31:47.967 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '134250' milliseconds. Additional diagnostics: process name: 'ohasd'.
2022-02-22 13:31:54.206 [CSSDAGENT(11954)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 14109 milliseconds; details are at (:CLSN00121:) in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_cssdagent_root.trc.
2022-02-22 13:31:53.576 [GIPCD(6935)]CRS-7509: The 'worker thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '131800' milliseconds. Additional diagnostics: process name: 'ocssd'.
2022-02-22 13:31:51.782 [ORAAGENT(6585)]CRS-5818: Aborted command 'check' for resource 'ora.gpnpd'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_oraagent_oracle.trc.
2022-02-22 13:31:56.356 [CSSDMONITOR(11917)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 14119 milliseconds; details are at (:CLSN00121:) in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_cssdmonitor_root.trc.
2022-02-22 13:31:56.359 [GIPCD(6935)]CRS-7508: The Oracle Grid Interprocess Communication (GIPC) high availability 'worker thread' is not scheduled correctly on 'gipcd' process.
2022-02-22 13:31:57.971 [CSSDAGENT(11954)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 6999 milliseconds; details are at (:CLSN00121:) in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_cssdagent_root.trc.
2022-02-22 13:31:58.642 [CSSDMONITOR(11917)]CRS-1661: The CSS daemon is not responding. If this persists, a reboot will occur in 7019 milliseconds; details are at (:CLSN00121:) in /u01/app/oracle/diag/crs/bltdb02/crs/trace/ohasd_cssdmonitor_root.trc.
2022-02-22 13:31:57.393 [GIPCD(6935)]CRS-7510: The 'daemon thread' of Oracle Grid Interprocess communication (GIPC) has been non-responsive for '134250' milliseconds. Additional diagnostics: process name: 'evmd'.
2022-02-22 13:36:28.536 [OHASD(3926)]CRS-8500: Oracle Clusterware OHASD process is starting with operating system process ID 3926 ---> Tekrar başlıyor.
* At 13:28:08, grid inter-process communication daemon had been non-responsive for 10 seconds. Afterwards, other daemons also started to fail. And cross instance calls also could not get a reply. In this case, we can say that we should investigate the OS resources at 13:27:58.
* We looked at details in the ocssd (Oracle Cluster Synchronization Services Daemon) trace file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2022-02-22 13:31:22.639 :GIPCHTHR:2700547840: gipchaWorkerWork: workerThread heart beat, time interval since last heartBeat 30010 loopCount 119 sendCount 33 recvCount 87 postCount 33 sendCmplCount 33 recvCmplCount 53
2022-02-22 13:31:23.586 : CSSD:2720790272: [ INFO] clssgmpHandleFenceReq: Received fence request from node bltdb01, number 1, for clientID 2:20983:1, request does not kill primary
2022-02-22 13:31:23.586 : CSSD:2720790272: [ INFO] clssgmpcSendLocalFenceReq: Sending a fence rquest to node bltdb02, number 2, for clientID 2:20983:1
2022-02-22 13:31:23.586 : CSSD:2703701760: [ INFO] clssgmfInitLocalFence: Member level fence for global groupmemberID 14:2:1:33, fence type 1
2022-02-22 13:31:23.586 : CSSD:2703701760: [ INFO] clssgmfInitLocalFence: Fencing member (0x7fda4c06b520) with memberID 14:2:1 at the request of clientID 1:402:1
2022-02-22 13:31:23.586 : CSSD:2703701760: clssgmFenceMember: global grock(DB+ASM), member(1), type(1), cookie(0x7fdaa1267b00)
2022-02-22 13:31:23.586 : CSSD:2703701760: [ INFO] clssgmFenceObj: fencing member (0x7fda4c06b520) memberID 14:2:1 grock DB+ASM fence type 1 flags 0x0
* It was time to view the gipcd (Grid Interprocess Communication Daemon) trace file.
* According to the above logs, the gipc daemon could not open the OLR. OLR is stored on the local disk. IO requests to the local disks were not replied.
* We also looked for any core dumps from crs. There weren't any.
* Up to this point, we had looked at Linux system logs, RDBMS and ASM alert.logs, CRS alert logs and daemon logs. We suspected that a resource exhaustion took place.
* We decided to look for the OSWatcher stats (iostat, vmstat, pidstat, top etc) at the node eviction time period.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* There was a warning message in the oswps.log file as below:
BUG: WARNING: at WARNING: CPU: INFO: possible recursive locking detected
kernel BUG at list_del corruption
list_add corruption
do_IRQ: stack overflow: kernel stack overflow (current):
general protection fault
unable to handle kernel double fault:
RTNL: assertion failed seek!
page_mapcount(page) went negative! madness at
NETDEV WATCHDOG
sysctl table check failed: nobody cared
IRQ handler type mismatch
Kernel panic - not syncing: Machine Check Exception: Machine check events logged
divide error: bounds: coprocessor segment overrun: invalid TSS: segment not present: invalid opcode: alignment check: stack segment: fpu exception: simd exception: iret exception: /var/log/messages -- /usr/bin/abrt-dump-oops -xtD
* That was just an argument string that was passed to abrt binary. IGNORED IT.
* In the oswiostat log file, There were a iospike on local disk (dm-0) at 13:28:17.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* GRID_HOME is installed under "/" path and local disk (dm-0) was starving at that time.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Also we created a dashboard to see clearly what was happening on server.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* We examined the graphs of OSWatcher, below is a worth mentioning one.
* Average wait time should not exceed 20ms for a healthy system. We saw a huge spike for dm-0.
* OSwatcher analysis document also told us the same thing with other words.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* This disk was not experiencing that kind of an issue before 19.13 RU and working in an expected way at 19.7 RU. It looks like our "/" local disk is having hard times with 19.13 RU. Oracle Grid and Database software uses local disks for writing audit and trace files mostly. Trace files are mostly created at times of trouble.
* We downloaded the oneoff patch and applied it to GRID_HOME. After monitoring of two days, same problem occurred again.
* We decided to analyze OSwatcher statistics of the incident times all together and were lucky to find that, just before the IO spike, a java process was consuming a huge amount of CPU in all samples. (like below %197, %374)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* THE PROBLEM WAS CVU (CLUSTER VERIFICATION UTILITY.)
* We checked its parameters:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* First occurrence of the issue was just 6 hours after the patch applied and also the other incident time intervals were multiples of 6 hours interval.
* It looks like with version 19.13, cvu resource is using too much CPU or have a bug. With too much CPU usage, some processes are stuck on "D state" and some processes do not reply in a timely fashion. At this chaotic time, trace files also got written to local disk and sudden io spikes are recorded. Afterwards, healthy node evicts the node on which cvu resource is running and cvu relocates to healthy node.
* The purpose of this resource was to invoke clusterware health checks at regular intervals but instead of it, it was the reason why clusterware node evictions happened.
* We stopped and disabled it and will look for its behaviour in future RUs.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Step by step guide for securing your PostgreSQL Database Environment I’ve moved my blog from https://insanedba.blogspot.com to https://dincosman.com Please update your bookmarks and follow/subscribe at the new address for all the latest updates and content. More up-to-date content of this post may be available there. In today's world, keeping data secure is absolutely crucial. As companies depend more and more on databases to handle all sorts of sensitive information, making sure those databases are locked down tight is a top priority. Ensuring the overall security of PostgreSQL is a multifaceted task that involves implementing various measures to protect the database from potential threats. Many measures may be taken in every layer of the database environment. I will explain some in detail and summarize others in one sentence. Lets start with the database itself. Secure PostgreSQL Database: 1. Authentication and Authorization : Use RBAC (role based access control), ...
Upgrade Your PostgreSQL Database: Major Steps and Minor Touches I’ve moved my blog from https://insanedba.blogspot.com to https://dincosman.com Please update your bookmarks and follow/subscribe at the new address for all the latest updates and content. More up-to-date content of this post may be available there. In this blog post, I will show how a complete PostgreSQL environment major update can be done. Minor upgrades are not a tough task; they are no different from a standard Linux package upgrade. As my production environment serves in an airgapped-environment, I will complete all the major upgrades with no internet connection. I've downloaded all the required whl and .rpm packages before the upgrade. Before starting, a short description of major and minor PostgreSQL upgrades is provided below: Major Upgrade: Upgrading PostgreSQL to a new major version, such as moving from 10 to 11 or 12 to 13. Major upgrades bring significa...
Ansible is a friend in need and a friend indeed for DBAs. I’ve moved my blog from https://insanedba.blogspot.com to https://dincosman.com Please update your bookmarks and follow/subscribe at the new address for all the latest updates and content. More up-to-date content of this post may be available there. In this blog post, I will mention about Redhat Ansible and show how you can patch your grid infrastructure by using it. Ansible is an open source, command-line IT automation software application written in Python. It can configure systems, deploy software, and orchestrate advanced workflows to support application deployment, system updates, and more. Ansible's main strengths are simplicity and ease of use. It is a little bit similar to dcli commands in Exadata environments, but much more powerful as you may run playbooks which consist of different tasks. Ansible stands out from other configuration management tools because it doesn't need agents installed on target host...
Comments
Post a Comment