Auditing TEMS for High Impact Workloads

Technote (FAQ)


How can a user determine what workload is badly impacting a TEMS?


New releases are now published here:
Sitworld: TEMS Audit Process and Tool

Release 1.28000 is available along with a documentation rewrite.

Auditing TEMS for High Impact Workloads

Overview – Version 1.10000 12 August 2013

There have been cases every year where a TEMS was running with high enough CPU/Storage resource usage that the customer was concerned. In some cases, the TEMS experienced a steady storage growth and failure after seven or more days. In all recent cases, this condition has been triggered by workloads. Situations are the most common workload but SOAP calls, historical data collection and Portal Client workspaces can be involved.

Some years ago, the root cause for this symptom was more likely to be a product defect where storage is allocated and not freed – a storage leak. However that cause is now rare.

Another older root cause involves Pure situations configured with DisplayItem where the DisplayItem value took on a widely set of varying values. The ITM product now has a solution for that issue - see Appendix 1.

The remaining causes are hidden because the symptom is not logged by obvious error messages or incorrect results. Typical diagnostic tracing does not show the issue – nor does the dump or core files that are captured. The issue shows a diffuse pattern of high utilization. The results are obvious but the cause not so much.

This technote explains a procedure for auditing a TEMS to identify the workload stressors that cause the high usage. When the stressors are identified, the situation/SOAP/workspace query can be modified to remove the stress and performance will be immediately improved. If for some reason no changes are possible, then the user can simply plan for the needed periodic TEMS recycles to avoid unplanned outages.

A recent addition reports on historical data exporter processing.

This procedure points to the problem. At that point the identified problem needs to be worked on and understood. In one recent case, a single situation was split into two situations and the remote TEMS workload on six large servers dropped by 90%. Perform this audit periodically to avoid these types of issues.

Audit process stage 1: capture the diagnostic data needed.

Pick a TEMS which is heavily utilized. This will typically be a remote TEMS. There does not have to be a large number of agents. In one case there were about 400 AIX servers reporting.

Turn on diagnostic tracing as described in Appendix 3. The trace filter needed is

error (unit:kpxrpcrq,Entry="IRA_NCS_Sample" state er)

When the diagnostic trace has been running for a least an hour, gather the diagnostic logs which contain the data. The logs will be in

Windows: <installdir>\logs
Linux/Unix: <installdir>/logs
z/OS: RKLVLOG sysout

For non-zOS cases, the log names will have this form [default settings]


NN will be 01/02/03/04/05.

There is one set of such logs for each time the TEMS is started. You can identify them as the ones with most recent dates. In the default configuration, the logs are written in segments: 01 is first, then 02/03/04/05 and when the 05 segment is full it wraps back to segment 02. The end of each log segment contains a pointer to the next one. The current active one does not have a pointer. You must create a single log file of the segments in the correct order.

Automatic processing of log order is possible but that would complicate the audit program. In addition, you might choose to select only one segment for analysis and any automatic process would prevent that.

Audit Process: Analysis

Attached to this technote is a Perl program This reads the log created above and creates a report file in comma separated variables format. Here is what the result of an actual case [with names changed] as displayed in Lotus Symphony. Some rows were removed for brevity, so the total lines do not add up.

The optional first report section involves “too big” situation/Soap/Workspace view cases. There is also a remote SQL time out count if that occurred.

First Report Column Definitions:

Situation Name of the situation if present
Table Table the filter object were associated with
Object sizes The sizes of the Filter Plan and Pool objects
Count Number of too big messages seen

If you are unfamiliar with the issue of “too big” filters, read this technote, especially discussion on Stage 4 transmission to agent.

ITM Situation Limits

The second report section provides an overall summary of the processing results.

The third report section summarizes TEMS processing results by situation. The results are sorted in descending order by the total result bytes seen.

Situation Report Column Definitions:
Situation Name of the situation if present
Table Table the result set was associated with
Count Number of times a result was processed
Rows Total number of rows in all results
ResultBytes Total number of bytes in all results
Result/Min Result Bytes per minute
Fraction What fraction this situation contributed
Cumulative% The total fraction of situations to this point
MinResult Minimum size of results
MaxResult Maximum size of results
MaxNode First agent which supplied maximum results

*Note: if the Situation or Table begin (NULL)- that means the results came from a real time data request such as a Portal Client workspace view.

The fourth report section summarizes TEMS processing results by managed system [agent or node] and excludes HEARTBEAT processing. The results are sorted in descending order by the total result bytes seen.

Managed System Report Column Definitions:
Node Name of the managed system [Agent]
Table Table the result set was associated with
Count Number of times a result was processed
Rows Total number of rows in all results
ResultBytes Total number of bytes in all results
Result/Min Result Bytes per minute
MinResult Minimum size of results
MaxResult Maximum size of results
MaxSituation First situation which supplied maximum results

And in the *total line

Table Duration of results capture in seconds
Count Total number of results
Rows Total number of rows
ResultBytes Total number of result bytes
Result/Min total result bytes per second

Analysis – Background on Results Activity at the TEMS

Most of TEMS processing is driven by incoming results. Each agent is given situations to perform. In the best cases, as the situations are evaluated, result row(s) are calculated and sent to the TEMS. Situations contain formula to filter the results so that partial results are transmitted. For Pure situations [without a sampling interval] each result stands alone and is sent to the TEMS. Results are always sent to the Agent Proxy [part of TEMS]. If there are no results, 0 rows are sent which indicates a negative formula. The Agent Proxy then propagates the results to SITMON when appropriate and a situation event may result.

Agent side filtering is clearly a best case, but there are cases where the agent does no filtering and here are four examples.

1) The formula is too big to be transmitted to the agent.
2) Results must be sorted
3) The Situation formula contains column functions [*AVG, *COUNT, *SUM, *MAX, *MIN, *TIME, etc] which can only be evaluated at the TEMS. Only the *MISSING column functions can be processed at the agent. The *IN column function is a shortcut notation and not a column function.
4) History collection situations

Situation Formula can have issues (1) and (2) and (3).

Portal Client and SOAP requests can have issues (2) and (3).

Historical data is sent to the TEMS when historical configuration is specified that way. There is no filtering of Historical data. The alternative is to collect data at the Agent and not transship to the TEMS.

Analysis – A Voyage of Discovery

TEMS performance can be improved by reducing the number of result rows processed. The Second section – Results report – can be used to identify what workload has the biggest impact. The situation needs to be understood completely and new situation formula needs to be created. If a SOAP or TEP Workspace view or Historical data is the issue, that also needs to be understood.

Example Case 1 – Formula Too Big

If the situation shows up in the Too Big report [first section] the solution is usually easy. Here is an example - IBM_UX_FileMonitoring2_Warn. In this case the situation results composed 93% to 96% of the TEMS processed results bytes, depending on the remote TEMS view. The Too Big report showed

Situation Table FilterSize
IBM_UX_FileMonitoring2_Warn FILEINFO 39776 + 22968

The filter plan is 39776 bytes and the limit for Agent transmission is 32767. See Appendix 1 for details.

The Situation formula can be captured using

tacmd login
tacmd viewsit -s <sitname> >sitname.lst

In this case, the formula definition was

*IF ( ( *VALUE File_Information.File *IN (''crontab'',''syslog.conf'',''logrotate.conf'',''ftpusers'',''passwd'',''shadow'',''pam.conf'',''hosts.allow'',''hosts.deny'',''shutdown.allow'',''sysconfig'',''sysctl.conf'',''inittab'',''init'',''services'',''inetd.conf'',''securetty'') *AND *VALUE File_Information.File_Content_Changed *EQ Yes *AND *VALUE File_Information.Path *EQ ''/etc/'' ) *OR ( *VALUE File_Information.File *EQ grub.conf *AND *VALUE File_Information.File_Content_Changed *EQ Yes *AND *VALUE File_Information.Path *EQ ''/boot/grub/'' ) *OR ( *VALUE File_Information.File *EQ init *AND *VALUE File_Information.File_Content_Changed *EQ Yes *AND *VALUE File_Information.Path *EQ ''/sbin/'' ) )

Superficially this doesn’t look that bad, However the *IN clause – which is a shortcut notation for

“*VALUE attribute *EQ ‘xxx’ *OR *VALUE attribute * EQ ‘yyy’ …”

consumes a huge amount of filter plan object space. In typical use, you can only have 15 to 20 values before exceeding the limits.

The suggestion was to split 8 of the tests into a new situation like this

*IF ( *VALUE File_Information.File *IN (''hosts.deny'',''shutdown.allow'',''sysconfig'',''sysctl.conf'',''inittab'',''init'',''services'',''inetd.conf'',''securetty'') *AND *VALUE File_Information.File_Content_Changed *EQ Yes *AND *VALUE File_Information.Path *EQ ''/etc/'' )

Then those cases must be removed from the original situation. That is sure to work and needs little testing.

This situation so dominated the TEMS incoming results, it wasn’t worthwhile looking at others. On six remote TEMS tasks the CPU dropped from 90% to 10% utilization.

Example Case 2 – Use of Column Functions

Here are some cases where situations were regularly delivering 100K bytes of data every 5 minute cycle:

*IF *VALUE R/3_Intermediate_Documents.Direction *EQ Inbound *AND
*VALUE R/3_Intermediate_Documents.Message_Type *EQ INVOIC *AND *VALUE
R/3_Intermediate_Documents.Status_Number *EQ 56 *AND *TIME
R/3_Intermediate_Documents.Create_Time *LE 'Local_Time.Timestamp - 1H'

==> In this case there was a *TIME test, which forces TEMS evaluation.

*IF *COUNT R/3_Updates_Information.Status *GE 200 *AND *COUNT
R/3_Updates_Information.Status *LT 300 *AND *VALUE
R/3_Updates_Information.State_Code *EQ V1

==> *COUNT needs TEMS evaluation


==> In many cases, *SCAN requires TEMS evaluation

*IF *VALUE R/3_Work_Processes.Status *EQ Waiting *AND *COUNT
R/3_Work_Processes.Number *EQ 0

==> *COUNT for value zero will never fire. Should using MISSING

Here are some ways to decrease the impact of such situations:

1) If the results are coming from a specific node, reconfigure that system to reduce the number of results.
2) Increase Sampling interval – for example from 5 minutes to 15 minutes.
3) Use only one *COUNT in a formula.
4) The time based schemes could be pushed to a two situation scheme. In that scheme a situation does the non-TIME tests and has an action command that calls a shell file which does the time test. When required the situation writes a log entry. A second situation picks up the log entry and fires

Example Case 3 – Maximum Result Size

A large Maximum Result size can have a dramatic impact even though the average is not the highest. You can review the diagnostic traces for Rcvd lines like this

Rcvd 2500 rows sz 1132 tbl *.KSAIDOCS req IBM_Outbound_IDoc_Stat29 <11535572,481297336> node <XXX:YYY:mySAP>

The agents producing the high number of rows should be studied to determine why that is happening.

Example Case 4 – Managed System Report

In one case, the report on a heavily stressed remote TEMS showed 2.5 megabytes of results processed per minute over 8 hours. 70% of the results data came from a single SAP agent. Another SAP agent contributed 13% and a dozen others contributed 2% or less. The customer was able to focus on that SAP environment. By tuning and balancing that SAP environment, processing improved dramatically and as a minor side effect the stress on the remote TEMS was reduced.

SOAP Related Results

SOAP requests can result in heavy results activity. The two causes are SQL with large WHERE clauses and SQL which require a sort. The following trace at the hub TEMS will show what is happening.

error (unit:kpxrpcrq state)(unit:kshdhtp,Entry="getHeaderValue" all) (unit:kshreq,Entry="buildSQL" all)

The getHeaderValue entries will tell what ip address the SOAP is coming from and the buildSQL entries will show you the SQL that results.

In general for SOAP the usual solution is to simplify the SQL by using a smaller WHERE clause. In that case there will be more results, so there will be post-filtering in the program doing the SOAP. If this is a sort issue, sort the results after the data is retrieved instead of using SQL.

Portal Client related results

Portal Client workspace views can also generate large WHERE clauses and Sorting. Here is the trace filter:

error (unit:ctsql in er) (UNIT:ctdatabus IN ER)

The TEPS trace will show the outgoing SQL. It may take some investigation to discover the workspace view causing the high impact. You could have people logon one at a time and then see which one causes the issue. Most times it will be a customer created workspace. The ctdatabus trace entries will show which user(s) and which workspace(s) are involved.

The solution for Portal Client is like SOAP: make the SQL simpler, with fewer WHERE clauses. Post filtering is built into the Portal Client on the Properties/Filter dialog. In Portal Client workspaces the views can be sorted with a click on the title column so a SQL sort is not that important.

Historical Data Collection Export Processing on TEMS

You can configure historical data to collect at the Agent or the TEMS. Best practice is to collect at the Agent because a TEMS or communications outage will not lead to a loss of historical data. However collecting at the TEMS is a valid and supported environment. In that case you have the potential of a different sort of TEMS failure when the historical data is arriving faster than it can be exported to the Warehouse Proxy Agent. In that case the historical data may build up and cause serious processing problems by filling the mount point or disk.

A TEMS audit can reveal the incoming data and the export rate. You can use that to decide what data is collected, how frequently and whether Agent collection is best. The incoming data is the result rows associated with UADVISOR situations. The export reports require an additional trace control.

error (unit:khdxdacl,Entry="routeExportRequest" state er) (unit:khdxdacl,Entry=" routeData" detail er)

This trace will result in recording the export details including size, number and attribute group.

When the diagnostic file is processed by three additional reports will be generated. Here is an example of what is produced:

This run used the -expslot 10 option. This made the time slots 10 minutes instead of the default one hour.

Here are the three reports:

Historical Export summary by object - Column Definitions:
Object Name of Attribute Group
Table Table Name of Attribute Group
Appl Application Name of Attribute Group
Rowsize Size of one row
Rows Number of rows
Bytes Total number of bytes in all rows
Bytes_Min Result Bytes per minute

Historical Export summary by time - Column Definitions:
Time Time of trace
Rows Total number of rows in time slot
Bytes Total number of bytes in time slot
Secs Duration of time in seconds
Bytes_Min Result Bytes per minute

Historical Export summary by Object and time - Column Definitions:
Object Name of Attribute Group
Table Table Name of Attribute Group
Appl Application Name of Attribute Group
Rowsize Size of one row
Rows Number of rows
Bytes Total number of bytes
Time Time slot of traces

The goal is to understand the actual export rate. Compare this with the UADVISOR situations input rate and then make choices to avoid problems or do capacity planning. For example, having a Warehouse Proxy Agent on the same server as the TEMS process will maximize throughput assuming the server has adequate capacity.

Historical Data Collection Export Processing on Agent

If you are collecting data at the Agent the historical data export also occurs there. There may be times where you want to determine if the export process is not falling behind.

This can be analyzed using the following trace:

error (unit:khdxcpub,Entry="KHD_ValidateHistoryFile" state er) (unit:khdxhist,Entry="openMetaFile" state er) (unit:khdxhist,Entry="open" state er) (unit:khdxhist,Entry="close" state er) (unit:khdxhist,Entry="copyHistoryFile" state er)

Here is a sample report somewhat truncated so all three sections can be viewed.

The report columns for sections two and three are identical. The first section has some added columns of interest.

Historical Export summary by object - Column Definitions:
Object Name of Attribute Group
Table Table Name of Attribute Group
Appl Application Name of Attribute Group
Rowsize Size of one row
Rows Number of rows
Bytes Total number of bytes in all rows
Bytes_Min Result Bytes per minute
Cycles Number of export cycles during trace
MinRows Minimum number of rows after export
MaxRows Maximum number of rows after export
AvgRows Average number of rows after export
LastRows Number of rows after export on last cycle

Each cycle the export process prunes rows older than [default] 24 hours. If you run the audit on successive days you can see if the number of rows is stable or increasing. If it is not stable, there may be too much data for the network or the WPA to handle. At ITM 622 FP7 and ITM 623 GA there was an APAR IZ99471 change which can make the export process up to 10 times faster.

Summary: Make Changes and Retest

After you change things, gather tracing again and see how much improvement there is. When you are satisfied with the results, then stop making changes. If the issue is causing storage growth, you can do a planned recycle of the affected TEMS tasks to avoid unplanned outages.

At ITM 622 FP7 and ITM 623 GA an APAR IZ99471 change was made that can make the export process up to 10 times faster.

If you have remote TEMS tasks with different workload characteristics, then repeat the audit process.

Repeat this audit as needed to identify new stressors. For some sites that might be a couple times a year. For others that might be once a day.

Appendix 1: Pure situations with DisplayItem

A common cause of TEMS storage growth involves pure situations configured with DisplayItem where the DisplayItem value takes on a widely varying set of values. See this technote for details:

ITM Pure Situation events and Event Merging Logic

These are easy to spot since the TEMS database file QA1CSTSC.DB becomes very large indeed. Since the table is cached in storage, the TEMS process space becomes very large. See the technote for a solution introduced at ITM 621 FP3 and ITM 622 FP3.

Appendix 2: TEMS Audit Program Installation and Usage

There is one attached file - - a 1300 line Perl program. If you are viewing this document as an IBM Technote on a web page, look for an unlabeled icon just above the “Document Feedback” section at the end. Do a right-click and then save that file locally. It is a Perl program. That is usually present for Linux/Unix systems. For Windows you may need to install from or any other source. The program only uses core services.

It has been tested with

Activstate Perl on Windows - perl 5, version 16, subversion 2 (v5.16.3)

zLinux with Perl v5.8.7

No CPAN modules are used.


-h display help information
-z z/OS RKLVLOG log
-b Show HEARTBEATs in Managed System section
-expslot <mins> Historical export report, display slot size - default 60 mins
-v Produce progress messages in STDERR
-inplace Process the log segments without copying to work directory
-logpath Directory path to TEMS logs directory
-workpath Directory path to work directory, default is the system
Environment variable Windows – TEMP, Linux/Unix tmp

If there is a remaining parameter, it is a full qualified name of a log file with diagnostic trace data.

A recent change allows hands-free operation. In this mode, you do not specify a specific file. Instead you specify a path to a normal TEMS logs directory [or run it in that directory. In this mode the logpath [or default current directory] is examined to determine the current active log segments. Then they are processed in the correct time order. By default the log segments are copied into a work directory to avoid the issue where a log is deleted as we use it. The -inplace option skips that copying issue.

Note: If there is a one hour or higher time gap between the first and next segment, the first segment is processed only for too big messages. Otherwise all the log segments are processed in time order.

The report is written to standard output and that must be redirected to capture the report file.

On Linux/Unix, use chmod to make the program executable. You may also have to alter the first line of depending on where Perl is installed in your environment.

Run example:

perl cmsall.log >audit.csv

Example output: This is imported into a spreadsheet such as Lotus Symphony to view the results and do additional processing such as sorting or percentage calculations. The lines shown are a partial report to match the spreadsheet display earlier. That is why the *total line does not add up. This is from an earlier level of the Audit report so some things are different.

TEMS Audit report v0.75

Too Big Report
IBM_UX_FileMonitoring2_Warn,FILEINFO,39776 + 22968,
IBM_DBA_ArchiveMountPt_Warn,UNIXDISK,67808 + 15668,
IBM_UX_FileMonitoring1_Warn,FILEINFO,37472 + 21280,
IBM_DBA_ArchiveMountPt_Fatal,UNIXDISK,60512 + 14732,
IBM_DBA_ArchiveMountPt_Critical,UNIXDISK,65888 + 15220,
IBM_MDW_MQ_MountPoint_Fatal,UNIXDISK,37856 + 8928,
IBM_MDW_MQ_MountPoint_Warn,UNIXDISK,43232 + 9552,
IBM_DBA_ArchiveMountPt1_Warn,UNIXDISK,36320 + 8448,
IBM_DBA_ArchiveMountPt1_Critica,UNIXDISK,36320 + 8448,
KLZSYS-nosituation,KLZSYS,44960 + 10240,
KLZCPU-nosituation,KLZCPU,44960 + 10240,
TEST_IBM_EventIdsMonitoring,NTEVTLOG,40928 + 10012,
IBM_MDW_MQ_MountPoint1_Fatal,UNIXDISK,37856 + 8928,

Situation Summary Report

Managed System Summary Report - non-HEARTBEAT situations

Appendix 3 – Turning Tracing On and Off

A trace filter looks like this

error (unit:kpxrpcrq,Entry="IRA_NCS_Sample" state er)

The first token is a general level of tracing. The unit: specifies a local source type. The optional Entry= specifies a function name.

The easiest way is to turn on the tracing dynamically. If you are unfamiliar see this technote:

Dynamically modify trace settings for an IBM Tivoli Monitoring component

The commands needed
ras1 set error (unit:kpxrpcrq,Entry="IRA_NCS_Sample" state er)

ras1 set error (unit:kpxrpcrq,Entry="IRA_NCS_Sample" any er)

The general rule is that an “any” specification will remove the effect of any prior set(s). The er [error] should be added back.

Sometimes you cannot use the diagnostic console. Here is what you do for the different platforms.


Add this to the TEMS config file – which has the name hostname_ms_temsname.config

KBB_RAS1= 'error (unit:kpxrpcrq,Entry="IRA_NCS_Sample" state er)'

The values must be in quotes because the .config file is source included into a shell file. You could also update the ms.ini file, but you would need to do an itmcmd config before it takes place. The ms.ini file would not use the single quotes.


- right click on TEMS line
- Select Advanced
- Select Edit Trace Parms
- In the “Enter RAS1 Filters:” input area add

error (unit:kpxrpcrq,Entry="IRA_NCS_Sample" state er)


Add this to the RKANDATU(KBBENV) file

KBB_RAS1= error (unit:kpxrpcrq,Entry="IRA_NCS_Sample" state er)

Afterwards recycle the TEMS.

To remove the traces, restore the line to “error” in all cases and recycle the TEMS.

# 0.50000 - initial development
# 0.60000 - too big report first, add managed system report
# 0.70000 - convert RKLVLOG time stamps to Unix epoch seconds
# 0.75000 – Add optional SOAP SQL report, sort reports by results or count
# 0.80000 - Add historical exporter reports for TEMS
# 0.90000 – Add historical exporter reports for agent
# 0.95000 – Move Summary before report and add fraction and cumulative%
# 0.99000 - Calculate the correct log segments
# 1.00000 – Implement work directory for log segments
# 1.01000 - erase work directory files when finished
# 1.05000 - count trace lines and size per minute
# - inplace added, capture too big cases in hands off mode
# - Add count of remote SQL time out messages
# - Handle isolated historical data lines better
# 1.10000 - Add Advisory section
# - Fix hands off logic with Windows logs
# - Add 16meg truncated result advisory

Document information

More support for:

IBM Tivoli Monitoring V6
ITM Tivoli Enterprise Mgmt Server V6

Software version:

All Versions

Operating system(s):

AIX, HP-UX, Linux, Solaris, Windows, z/OS

Software edition:

All Editions

Reference #:


Modified date:


Translate my page

Content navigation