By performing these routine procedures in addition to your daily, weekly,
monthly and quarterly health checks, you ensure that your Tivoli Monitoring environment
continues to run smoothly.
Run the taudit.js tool which can
be found in the Tivoli® Open Process Automation Library (OPAL) by searching for "Web SOAP scheduled reporting tools" or navigation
code "1TW10TM0U." This tool provides an overall status of the environment.
Run this tool every day.
Take a monitoring server backup every 24 hours in early stages and then move
to weekly backups. If you have effective snapshot software, you can take backups
with the monitoring server or portal server, or both online. Otherwise, shutdown the monitoring server and portal server before
taking a backup. Test these backups after you first develop the process and
at least twice a year thereafter by restoring to a monitoring server in your test
environment to ensure the backups are successfully backing up the data you
need to restore your production monitoring server in the event of an outage or need
for rolling back to a previous state.
Make sure the portal server database backup is in the plan and is being
made daily as the environment is being rolled out and then weekly as the environment
matures and less frequent changes are made to the environment. Test these
backups after you first develop the process and at least twice a year thereafter
by restoring to a portal server in your test environment to ensure the backups
are successfully backing up the data you need to restore your production portal
server in the event of an outage or need to rollback to previous state.
Make sure the DB2® warehouse backup is in the plan and is being made weekly.
The reason you need to do this weekly is because of huge database size.
Check daily that the warehouse agent is performing by looking at the warehouse
logs (hostname_hd_timestamp-nn.log).
Check daily that the Summarization and Pruning agent is performing by looking at the (hostname_sy_timestamp-nn.log) logs.
Check the monitoring server (hostname_ms_timestamp-nn.log) and portal server logs
(hostname_cq_timestamp-nn.log) for any obvious
errors and exceptions.
Check that there are no monitoring servers overloaded with agents. One
way to do this is by checking the "Self-Monitoring Topology" workspace, which
has a "Managed Systems per TEMS" view showing the number of agents reporting
to each monitoring server.
For DB2, run the REORGCHK and RUNSTATS on the warehouse database daily.
For DB2, run the REORGCHK and RUNSTATS on the portal server database weekly.
Check that events are reaching the Tivoli Enterprise Console server and also from the
user created Universal Agents.
Check that all the fired situations are answered with a response and are
not in open state for a long period of time.
Check that all the agents are responding by making SOAP down calls to
each agent. Running taudit.js (as mentioned above) checks this automatically.
Check the core components process memory and CPU usage and that you have
situations created to monitor them.