Intelligent Management: troubleshooting health management
You can look for the following problems when health management is not working, or not working the way that you expect.
Finding the correct logs
The health controller is a distributed resource that is managed by the high availability (HA) manager. It exists within all node agent and deployment manager processes and is active within one of these processes. If a process fails, the controller becomes active on another node agent or deployment manager process.
To determine where the health controller is running, click
in the administrative console. The location and stability status of the health controller displays.The performance advisor is enabled with the predefined memory leak health policy
The predefined memory leak health policy uses the performance advisor function, so the performance advisor is enabled when this policy has assigned members. To disable the performance advisor, remove this health policy or narrow the membership of the health policy. To preserve the health policy for future use, keep the memory leak policy, but remove all members. To change the members, click . You can edit the health policy memberships by adding and removing specific members.Health controller settings
The following list contains issues that are encountered as a result of the health controller settings:- Health controller is disabled
- To verify the setting in the administrative console, click Configuration and Runtime tabs. The health controller is enabled by default. and select both the
- Restarts are prohibited at this time
- To verify the prohibited restart times in the administrative console, click Prohibited restart field. By default, no time values are prohibited. and select the
- Restarting too soon after the previous restart
- To check the minimum restart interval in the administrative console, click Minimum restart interval field. No minimum interval is defined by default. and modify the
- Control cycle is too long
- To check the control cycle length in the administrative console, click and adjust the value if necessary. The health controller checks for policy violations periodically. If its control cycle length is too long, it might not restart servers quickly enough.
- The server was restarted X times consecutively, and the health condition continues to be violated
- In this case, X indicates the maximum consecutive restart parameter of the health controller. The health controller concludes that restarts are not fixing the problem, and disables the restarts for the server. The following message displays in the log:
WXDH0011W: Server servername exceeded maximum verification failures: disabling restarts.
The health controller continues to monitor the server and displays messages in the log if the health policy is violated:WXDH0012W: Server servername with restarts disabled failed health check.
You can enable restarts for the server by performing any of the following actions:- Disable and then enable the health controller.
- Adjust the Maximum consecutive restarts controller setting.
- Run the following command from the prompt:
This script is available in the <app_server_root>\bin directory on the node agent or deployment manager nodes. This script requires a running deployment manager.wsadmin -profile HmmControllerProcs.jacl enableServer servername
Health policy settings
The following issues are encountered as a result of the health policy settings:- The server is not part of a health policy
- To verify that the health policy memberships apply to your server in the administrative console, click .
- The reaction mode of a policy containing the server is a supervised mode
- To check the administrative console, click Supervised mode.
Servers are restarted automatically when you set the reaction mode
to Automatic. The following message is written
to the log for the supervised condition:
WXDH0024I: Server server name has violated the health policy health condition, reaction mode is supervised.
. Find approval
requests for a restart action for a policy in - The server is a member of a static cluster, and it is the only cluster member that is running
- The health policy does not bring down all members of a cluster at the same time. If a cluster has one cluster member, or one cluster member is running, then the cluster is not restarted.
- The server is a member of a dynamic cluster. The number of running instances does not exceed the minimum value, and the placement controller is disabled
- To check the minimum number of instances required for the dynamic cluster, click in the administrative console. In this case, health management treats the dynamic cluster like a static cluster, using the minimum number of instances parameters.
- The health controller has not received the policy
- The health controller does not run on the deployment manager where
the health policies are created. If the deployment manager is restarted
after the health controller started, the health controller might not
have the new policy. To resolve this problem, perform the following steps:
- Disable the health controller. In the administrative console, click .
- Synchronize the configuration repositories with the back-end nodes. In the administrative console, click Synchronize. . Select the nodes to synchronize, and click
- Restart the health controller. In the administrative console, click .
- Synchronize the configuration repositories with the back-end nodes. In the administrative console, click Synchronize. . Select the nodes to synchronize, and click
Application placement controller interactions
The following list contains issues triggered by health management and application placement controller interactions:
- The server is a member of a dynamic cluster, but the placement controller cannot be contacted
- For dynamic cluster members, health monitoring checks with the
application placement controller to determine whether a server can
be restarted. If the application placement controller is enabled,
but cannot be contacted, the following message displays in the log:
WXDH1018E: Could not contact the placement controller
Verify that the placement controller is running. To determine where the health controller is running, click in the administrative console. The location and stability status of the health controller displays. The health controller logs messages to the particular node agent or deployment manager that is indicated by the current location. - The server is stopped, but not started.
- In a dynamic cluster, a restart can take one of several forms:
- Restart in place (stop server, start server). Note: Always occurs when a dynamic cluster is in a manual mode.
- Start a server instance on another node, and stop the failing one.
- Stop the failing server only, assuming that the remaining application instances can satisfy demand.
- Restart in place (stop server, start server).
Sensor problems
The following list contains issues that are related to health management and node group membership settings:
- No sensor data is received for the server.
- Health management cannot detect a policy violation if it receives
no data from the sensors that are required by the policy. If sensor
data is not received during the control cycle, health management prints
the following log message:
WXDH3001E: No sensor data received during control cycle from server server_name for health class healthpolicy.
For response time conditions, health management receives data from the on demand router (ODR). No data is generated for these conditions until requests are sent through the ODR.
Task management status
Sometimes a Restart action task status ends up in Failed or Unknown state. This scenario happens when the server does not stop during the time period that is allocated by default, or when the task times out. Use the following cell level property to adjust the timeout for your environment:HMM.StopServerTimeout
.
The value is expressed in milliseconds, and the default value is 10000.
This property allows health management to extend the wait time for
server stop notifications that are received from the on demand configuration.To increase the timeout for your environment, go to
. The default value is 5 minutes. The restart task starts after twice the amount that is specified, allowing the server to stop and start.