Intelligent Management: troubleshooting health management

You can look for the following problems when health management is not working, or not working the way that you expect.

Finding the correct logs

The health controller is a distributed resource that is managed by the high availability (HA) manager. It exists within all node agent and deployment manager processes and is active within one of these processes. If a process fails, the controller becomes active on another node agent or deployment manager process.

To determine where the health controller is running, click Runtime operations > Component stability > Core components in the administrative console. The location and stability status of the health controller displays.

The performance advisor is enabled with the predefined memory leak health policy

The predefined memory leak health policy uses the performance advisor function, so the performance advisor is enabled when this policy has assigned members. To disable the performance advisor, remove this health policy or narrow the membership of the health policy. To preserve the health policy for future use, keep the memory leak policy, but remove all members. To change the members, click Operational policies > Health policies > memory_leak_policy. You can edit the health policy memberships by adding and removing specific members.

Health controller settings

The following list contains issues that are encountered as a result of the health controller settings:

Health controller is disabled

To verify the setting in the administrative console, click Operational policies > Autonomic managers > Health controller and select both the Configuration and Runtime tabs. The health controller is enabled by default.

Restarts are prohibited at this time

To verify the prohibited restart times in the administrative console, click Operational policies > Autonomic managers > Health controller and select the Prohibited restart field. By default, no time values are prohibited.

Restarting too soon after the previous restart

To check the minimum restart interval in the administrative console, click Operational policies > Autonomic managers > Health controller and modify the Minimum restart interval field. No minimum interval is defined by default.

Control cycle is too long

To check the control cycle length in the administrative console, click Operational policies > Autonomic managers > Health controller and adjust the value if necessary. The health controller checks for policy violations periodically. If its control cycle length is too long, it might not restart servers quickly enough.

The server was restarted X times consecutively, and the health condition continues to be violated

In this case, X indicates the maximum consecutive restart parameter of the health controller. The health controller concludes that restarts are not fixing the problem, and disables the restarts for the server. The following message displays in the log:

WXDH0011W: Server servername  exceeded maximum verification failures: disabling restarts.

The health controller continues to monitor the server and displays messages in the log if the health policy is violated:

WXDH0012W: Server servername with restarts disabled failed health check.

You can enable restarts for the server by performing any of the following actions:

Disable and then enable the health controller.
Adjust the Maximum consecutive restarts controller setting.
Run the following command from the prompt:
```
wsadmin -profile HmmControllerProcs.jacl enableServer servername
```
This script is available in the <app_server_root>\bin directory on the node agent or deployment manager nodes. This script requires a running deployment manager.

Health policy settings

The following issues are encountered as a result of the health policy settings:

The server is not part of a health policy

To verify that the health policy memberships apply to your server in the administrative console, click Operational policies > Health policies.

The reaction mode of a policy containing the server is a supervised mode

To check the administrative console, click System administration > Task management > Runtime tasks. Find approval requests for a restart action for a policy in Supervised mode. Servers are restarted automatically when you set the reaction mode to Automatic. The following message is written to the log for the supervised condition:

WXDH0024I: Server server name has violated the health policy health condition, 
reaction mode is supervised.

The server is a member of a static cluster, and it is the only cluster member that is running

The health policy does not bring down all members of a cluster at the same time. If a cluster has one cluster member, or one cluster member is running, then the cluster is not restarted.

The server is a member of a dynamic cluster. The number of running instances does not exceed the minimum value, and the placement controller is disabled

To check the minimum number of instances required for the dynamic cluster, click Servers > Clusters > Dynamic clusters in the administrative console. In this case, health management treats the dynamic cluster like a static cluster, using the minimum number of instances parameters.

The health controller has not received the policy

The health controller does not run on the deployment manager where the health policies are created. If the deployment manager is restarted after the health controller started, the health controller might not have the new policy.

To resolve this problem, perform the following steps:

Disable the health controller. In the administrative console, click Operational policies > Autonomic managers > Health controller.
Synchronize the configuration repositories with the back-end nodes. In the administrative console, click System administration > Nodes. Select the nodes to synchronize, and click Synchronize.
Restart the health controller. In the administrative console, click Operational policies > Autonomic managers > Health controller.
Synchronize the configuration repositories with the back-end nodes. In the administrative console, click System administration > Nodes. Select the nodes to synchronize, and click Synchronize.

Application placement controller interactions

The following list contains issues triggered by health management and application placement controller interactions:

The server is a member of a dynamic cluster, but the placement controller cannot be contacted

For dynamic cluster members, health monitoring checks with the application placement controller to determine whether a server can be restarted. If the application placement controller is enabled, but cannot be contacted, the following message displays in the log:

WXDH1018E: Could not contact the placement controller

Verify that the placement controller is running. To determine where the health controller is running, click Runtime operations > Component stability > Core components in the administrative console. The location and stability status of the health controller displays. The health controller logs messages to the particular node agent or deployment manager that is indicated by the current location.

The server is stopped, but not started.

In a dynamic cluster, a restart can take one of several forms:

Restart in place (stop server, start server).
Note: Always occurs when a dynamic cluster is in a manual mode.
Start a server instance on another node, and stop the failing one.
Stop the failing server only, assuming that the remaining application instances can satisfy demand.

Sensor problems

The following list contains issues that are related to health management and node group membership settings:

No sensor data is received for the server.

Health management cannot detect a policy violation if it receives no data from the sensors that are required by the policy. If sensor data is not received during the control cycle, health management prints the following log message:

WXDH3001E: No sensor data received during control cycle from server server_name for 
health class healthpolicy.

For response time conditions, health management receives data from the on demand router (ODR). No data is generated for these conditions until requests are sent through the ODR.

Task management status

Sometimes a Restart action task status ends up in Failed or Unknown state. This scenario happens when the server does not stop during the time period that is allocated by default, or when the task times out. Use the following cell level property to adjust the timeout for your environment: HMM.StopServerTimeout. The value is expressed in milliseconds, and the default value is 10000. This property allows health management to extend the wait time for server stop notifications that are received from the on demand configuration.

To increase the timeout for your environment, go to Operational policies > Autonomic managers > Health controller > Restart timeout. The default value is 5 minutes. The restart task starts after twice the amount that is specified, allowing the server to stop and start.