Health management
With the health monitoring and management subsystem, you can take a policy-driven approach to monitoring the application server environment and take action when certain criteria are discovered.
Health monitoring and management subsystem
The health management subsystem continuously monitors the state of servers and the work that is performed by the servers in your environment. The health management subsystem consists of two main elements: the health controller and health policies.
The health controller is the autonomic manager that controls the health monitoring and management subsystem, and acts on your health policies to ensure that certain conditions exist. The health controller is a distributed resource managed by the high availability manager, and exists within all node agent and deployment manager processes. The health controller is active in one of these processes. If the active process fails, the health controller can become active on another node agent or deployment manager process.
The health controller runs on a control cycle. The control cycle length defines the amount of time between environment checks initiated by the health controller. At the end of the control cycle, the health controller checks the environment and generates runtime tasks to resolve any breaches in the health conditions.
You define the health policies, which include the health conditions that you want to monitor in your environment and the health actions to take if these conditions are not met.
You can disable or enable health management using the health controller, while still having multiple health policies defined on the system. You can limit the server restart frequency or prohibit restarts during certain periods.
The health management subsystem functions when Intelligent Management is in automatic or supervised operating mode. When the reaction mode on the policy is set to automatic, the health management system takes action when a health policy violation is detected. In supervised mode, the health management system creates a runtime task that offers one or more reactions. The system administrator can approve or deny the proposed actions.
Health conditions
- Age-based condition
- Tracks the amount of time that the server is running. If the amount of time exceeds the defined threshold, the health actions run.
- Excessive request timeout condition
- Specifies a percentage of HTTP requests that can time out. When the percentage of requests exceeds the defined value, the health actions run. The timeout value depends on your environment configuration. For more information about the excessive request timeout health condition, see excessive request timeout health policy target timeout value.
- Excessive response time condition
- Tracks the amount of time that requests take to complete. If the
time exceeds the defined response time threshold, the health actions
run.Attention: Any requests that exceed the timeout threshold set not included in the excessive response time calculations. For example, if the default timeout value of 60 seconds is in effect then any requests that exceed that threshold and timeout are not included in the calculations for excessive response time. This restriction applies even if you do not have the excessive request timeout health condition defined in your environment.
- Memory condition: excessive memory usage
- Tracks the memory usage for a member. When the memory usage exceeds a percentage of the heap size for a specified time, health actions run to correct this situation.
- Memory condition: memory leak
- Tracks consistent downward trends in free memory that is available to a server in the Java™ heap. When the Java heap approaches the maximum configured size, you can perform either heap dumps or server restarts.
- Storm drain condition
- Tracks requests that have a significantly decreased response time. This policy relies on change point detection on given time series data.
- Workload condition
- Specifies a number of requests that are serviced before policy members restart to clean out memory and cache data.
- Garbage collection percentage condition
- Monitors a Java virtual machine (JVM) or set of JVMs to determine whether they spend more than a defined percentage of time in garbage collection during a specified time period.
With these predefined health policy conditions, actions have been taken to optimize the distribution of the needed data, minimize the impact of monitoring, and enforce the health policy in your environment.
You can also define custom conditions for your health policy if the predefined health conditions do not fit your needs. You define custom conditions as a subexpression that is tested against metrics in your environment. When you define a custom condition, consider the cost of collecting the data, analyzing the data, and if needed, enforcing the health policy. This cost can increase depending on the amount of traffic and number of servers in your network. Analyze the performance of your custom health conditions before you use them in production.PMIMetric_FromServerStart$systemModule$cpuUtilization > 90L
Health actions
Health actions define the process to use when a health condition is not met. Depending on the conditions that you define, the actions can vary. The following table lists the health actions that are supported in various server environments:
Health action | WebSphere® application servers that run in the same Intelligent Management cell | Other middleware servers (including external WebSphere application servers) |
---|---|---|
Restart server | Supported | Supported |
Take thread dumps | Supported | Not supported |
Take Java virtual machine (JVM) heap dumps | Supported for servers that are running on the IBM® Software Development Kit | Not supported |
Put server into maintenance mode | Supported | Supported |
Put server into maintenance mode and break HTTP and SIP request affinity to the server | Supported | Supported |
Take server out of maintenance mode | Supported | Supported |
Generate a Simple Network Management Protocol (SNMP) trap | Supported | Supported |
Important: To preserve disk space, the Performance and Diagnostic Advisor does not take heap dumps if more than 10 heap dumps already exist in the WebSphere Application Server home directory. Depending on the size of the heap and the workload on the application server, taking a heap dump might be expensive and might temporarily affect system performance.
- Restart in place (stop server, start server). This restart always occurs when a dynamic cluster is in manual mode.
- Start a server instance on another node, and stop the failing one.
- Stop the failing server only, assuming that the remaining application instances can satisfy demand.
You can also define a custom action. With a custom action, you define an executable file to run when the health condition breaches. You must define custom actions before you create the health policy that contains the custom actions.
Health policy targets
Health policy targets can be a single server, each of the servers in a cluster or dynamic cluster, the on demand router (ODR), or each of the servers in a cell. You can define multiple health policies to monitor the same set of servers.
If you are using predefined health conditions, the support varies depending on the server type. Certain middleware servers do not support all of the policy types. The following table summarizes the health policy support, by server type:Predefined health policy | WebSphere application servers that run in the same Intelligent Management cell | Other middleware servers (including external WebSphere application servers) |
---|---|---|
Age-based policy | Supported | Supported |
Workload policy | Supported | Supported |
Memory leak detection | Supported | Not supported |
Excessive memory usage | Supported | Supported for WebSphere Application Server Community Edition servers. Not supported for other middleware server types. |
Excessive request timeout | Supported | Supported for other middleware servers to which the ODR routes requests. |
Excessive response time | Supported | Supported |
Storm drain detection | Supported | Supported |
Garbage collection percentage | Supported | Not supported |
Default health policies
You can create default health policies using predefined health conditions installed with the product.
To create a default health policy, click
, and select one of the predefined health conditions.- Default memory leak: Default standard detection level. The default memory leak health policy uses the performance advisor function. The performance advisor is enabled when this policy is enabled. To disable the performance advisor, remove this health policy or narrow the membership of the health policy. To preserve the health policy for future use, keep the default memory leak policy, but remove all members. To change the members, click . You can edit the health policy memberships by adding and removing members from the policy.
- Default excessive memory usage: Set to 95 percent of the JVM heap size for 15 minutes
- Default excessive request timeout: Set for 5 percent of the requests timing out
- Default excessive response time: Set to 120 seconds
- Default storm drain: Default standard detection level
- Garbage collection percentage: Set to 10 percent. The default sampling time is 2 minutes.
To view the recommendations made by default health policies and to take actions on these recommendations, click
.