IBM Support

Recognizing and fixing session timeout problems

Troubleshooting


Problem

What actions need to take place to recognize and fix IBM Spectrum Protect session timeout problems?

Symptom

An aborting backup, restore, archive or retrieve can be the result of an IBM Spectrum Protect session timeout problem. Symptoms depend on what type of data you are backing up, restoring, archiving or retrieving and where the timeout is originating from. You should think of a session timeout problem when you see any of the following messages or client return codes:

ACTLOG
- - -
ANR0530W Transaction failed ...
ANR0538I A resource waiter has been aborted.
ANR0480W Session ... severed
ANR0481W Session ... for node ... terminated - client did not respond ..
ANR0482W Session ... for node ... terminated - idle ...

Client return codes
- - - - - - - - - -
#define DSM_RC_WILL_ABORT 157 /* Transaction will be aborted */
#define DSM_RC_TSM_FAILURE -71 /* TSM communications failure */
#define DSM_RC_TSM_ABORT -72 /* Session aborted abnormally */

Here are 2 examples for errors reported by application clients.

DB2: db2diag.log
- - - - - - - -
2010-02-04-07.47.12.350916-480 E12401049A348 LEVEL: Error
PID : 4452766 TID : 1 PROC : db2vend
INSTANCE: instpd01 NODE : 000
EDUID : 1
FUNCTION: DB2 UDB, database utilities, sqluvput, probe:1338
DATA #1 : TSM RC, PD_DB2_TYPE_TSM_RC, 4 bytes
TSM RC=0x0000009D=157 -- see TSM API Reference for meaning.

2010-02-04-07.47.12.677852-480 E12402534A348 LEVEL: Error
PID : 4452766 TID : 1 PROC : db2vend
INSTANCE: instpd01 NODE : 000
EDUID : 1
FUNCTION: DB2 UDB, database utilities, sqluvend, probe:1504
DATA #1 : TSM RC, PD_DB2_TYPE_TSM_RC, 4 bytes
TSM RC=0xFFFFFFB8=-72 -- see TSM API Reference for meaning.

Oracle: RMAN output
- - - - - - - - - -
ORA-27192: skgfcls: sbtclose2 returned error - failed to close file
ORA-19511: Error received from media manager layer, error text:
ANS1235E (RC-71) An unknown system error has occurred ...

As the variety of symptoms is considerable, this list cannot be complete.

Cause

You most likely ran into one of the following timeouts:
commtimeout, idletimeout, resourcetimeout, or some external timeout
triggered by one of your networking components.

Resolving The Problem

Calculation and activation of timeout settings:

o Estimate the total time required for the desired backup, restore, archive or retrieve.
o Add 50% to the estimate and convert the result into minutes for resourcetimeout and idletimeout, into seconds for commtimeout.
o You can activate the result at runtime, without restarting the IBM Spectrum Protect Server / StorageAgent:
At the administrative command prompt (dsmadmc) run the commands:
'setopt commtimeout sec'
'setopt idletimeout min'
'setopt resourcetimeout min'
(Replace sec with the seconds calculated above, replace min with the minutes calculated above. Important for resourcetimeout: See the notes below.)
o If your IBM Spectrum Protect Server is part of a library manager - library client setup, make sure all components involved are set to the calculated timeout values.
(You can issue all commands from the same command prompt as above by command routing. Simply prepend each command with the name of the component, e.g. 'library_manager_name: setopt commtimeout sec')
o If you are using LanFree, you have to perform the same changes for the StorageAgent (same method).
o Keep in mind, in many cases timeouts do NOT originate from IBM Spectrum Protect, but from various external network components (firewalls, gateways, routers, switches, etc.).
You may have to contact your network administrator for checking current settings and implementing any necessary changes:
Calculate and set the timeouts for these external components according to the same algorithm as described above for the Tivoli Storage Manager timeouts, thus ensuring any component can stay connected with each other component for the complete duration of the backup, restore, archive or retrieve.
o If possible use the new feature of socket KEEPALIVE.
Although the KEEPALIVE feature is officially documented in the manual "Administrator's Reference" since IBM Spectrum Protect server version 7.1.5, it is already available for earlier product versions starting with Server / StorageAgent version 6.3.4.200
It helps prevent timeouts by network components outside IBM Spectrum Protect, such as firewalls, gateways, routers, switches, etc. and must be activated by the command 'setopt KEEPALIVE Yes'
It takes effect for any NEW session opened AFTER the command. Product versions before 6.3.4.200 rely on a LanFree keep alive (ping) whose interval is determined by the lesser of the values for resourcetimeout and idletimeout, divided by 4.
Therefore do NOT set resourcetimeout and idletimeout greater than the calculated value, in order to prevent unnecessary prolongation of the LanFree keep alive interval between Server and StorageAgent.
o Now you are ready to retry your failed backup, archive, restore or retrieve.

Special note regarding the option resourcetimeout

Note 1
Increasing the value of resourcetimeout is especially recommended as a response to the actlog message:
ANR0538I A resource waiter has been aborted.

[{"Product":{"code":"SSGSG7","label":"Tivoli Storage Manager"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Server","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Supported Versions","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
17 June 2018

UID

swg21689909