IBM Support

Oracle RAC process reboots server with no warning

Technote (troubleshooting)


Problem(Abstract)

Oracle RAC process reboots server with no warning

Symptom

AIX server shuts down and/or reboots.

A REBOOT_ID is logged in /var/adm/ras/errlog indicating "SYSTEM SHUTDOWN BY USER" although no shutdown or reboot command was issued by any user.

example error message...

LABEL: REBOOT_ID
IDENTIFIER: 2BFA76F6

Date/Time: Wed Dec 3 08:19:09 2008
Sequence Number: 1447
Machine Id: 0000ABCD1234
Node Id: nodeA
Class: S
Type: TEMP
Resource Name: SYSPROC

Description
SYSTEM SHUTDOWN BY USER

Probable Causes
SYSTEM SHUTDOWN

Detail Data
USER ID
0
0=SOFT IPL 1=HALT 2=TIME REBOOT
0
TIME TO REBOOT (FOR TIMED REBOOT ONLY)
0


Cause

Oracle Real Application Clusters (RAC) is known to reboot the operating system with no warning due to configuration of the oprocd daemon

Environment

AIX with Oracle RAC

Diagnosing the problem

Oracle Real Application Clusters (RAC) runs processes which can, under certain circumstances, reboot the server without any warning to the users. Servers experiencing node evictions caused by critical processes not being able to get scheduled in a timely fashion may be rebooted by the Oracle RAC processes.



Oracle 10g and Oracle 11gR1

Oracle 10g and 11gR1 run a process called oprocd. The idea of OPROCD is quite straightforward. It’s goal is to provide I/O fencing. Basically oprocd works by setting a timer, then sleeping. If, when it wakes up again and gets scheduled onto cpu, it sees that a longer time has passed than the acceptable margin, oprocd will decide to reboot the node.

You can check for the oprocd process with the ps command...

# ps -ef | grep oprocd
root 221672 1 0 08:27:44 - 0:00
/u01/crs/oracle/product/10.2.0/crs_1/bin/oprocd run -t 1000 -m 500 -f

These options to oprocd are saying -t 1000 (wake up every 1000 ms) and -m 500 (allow up to 500 ms margin of error on the time that oprocd wakes up before rebooting). In other words, if oprocd wakes up after > 1.5 secs it’s going to force a reboot.


Oracle 11gR2

The Oracle Cluster Synchronization Service Daemon (OCSSD) performs some of the clusterware functions on AIX and other operating systems. Oracle 11gR2 has the following OCSSD related processes which can reboot the node...

/apps/crs/11.2.0/bin/ohasd.bin reboot
/apps/crs/11.2.0/bin/crsd.bin reboot
/apps/crs/11.2.0/bin/octssd.bin reboot

/apps/crs/11.2.0/bin/ocssd
/apps/crs/11.2.0/bin/cssdagent
/apps/crs/11.2.0/bin/cssdmonitor

/apps/crs/11.2.0/bin/ocssd.bin

Resolving the problem

Oracle 10g / Oracle 11gR1:


oprocd

*** IBM recommends the customer contact Oracle Support before making any modifications ***

The timeout and margin times are computed from the elements of diagwait and reboot time and it isn't recommended changing them via the init.cssd file, but rather through the command 'crsctl set css diagwait <secs>'.

There is a formula involved in the calculation of the times. For example, if the reboot time is 3 and you submit a diagwait setting of 13 you will get -t 1000 -m 10000.

# crsctl set css diagwait 13 -force

# ps -ef | grep oprocd
root 221672 1 0 08:27:44 - 0:00
/u01/crs/oracle/product/10.2.0/crs_1/bin/oprocd run -t 1000 -m 10000 -f

You can see that the margin has changed to 10000 ms, that is 10 seconds in place of the default 0.5 seconds. This is a 20 fold increase allows oprocd more time to determine if the node needs to be rebooted.

IBM and Oracle came to the agreement that a diagwait value of 13 is a suitable value if the best practices are used...


IBM recommends customers follow best practices, and if possible update to AIX 6.1 or AIX 7.1 with current Technology Levels which include the new non-pagable kernel as the preferred corrective action.

The Oracle master document can be found here...

http://www.oracle.com/technetwork/database/clusterware/overview/rac-aix-system-stability-131022.pdf



Oracle 11gR2:

ocssd, cssdagent, cssdmonitor, ohasd.bin, crsd.bin, octssd.bin

*** IBM recommends the customer contact Oracle Support before making any modifications ***

The following Oracle document provides additional information on the cssdagent process which is related to oprocd...


The cssdagent process monitors the cluster and provides I/O fencing. This service formerly was provided by Oracle Process Monitor Daemon (oprocd), also known as OraFenceService on Windows. A cssdagent failure may result in Oracle Clusterware restarting the node.

root 11010182 1 0 18:43:40 - 0:05 /GDICMP/oracle/cloud/product/11.2/bin/cssdagent


ADDENDUM:

The following document was co-authored by Oracle and IBM engineers and discusses the above processes and contains recommended configurations...


Cross reference information
Segment Product Component Platform Version Edition
Operating Systems AIX family AIX 5.2, 5.3, 6.1

Document information

More support for: AIX family

Software version: Version Independent

Operating system(s): AIX

Reference #: T1011228

Modified date: 19 October 2011


Translate this page: