IT37869: FDC containing a NullPointerException occurs if a managed transfer enters recovery during negotiation and times out

APAR status

Closed as program error.

Error description

An MQ 9.2 Managed File Transfer agent is acting as the source
agent for managed transfers, where the managed transfers have a
transfer recovery timeout of 30 seconds.

Intermittently, the managed transfers processed by the agent go
into recovery during the negotiation phase because the
destination agent is not running. This causes the source agent
to write the following messages to its event log (output0.log):

  BFGTR0068I: Transfer ID: <transfer identifier>, to destination
agent <destination agent>, is entering recovery, immediately,
due to recoverable error: BFGTR0064I: The transfer sender timed
out waiting for a response from the transfer receiver.

  BFGTR0081I: Transfer ID: '<transfer identifier>' is now in
recovering state. The agent will wait for 30 seconds for the
transfer to recover.

30 seconds later, the destination agent still isn't running, and
so the managed transfer times out. When this happens, the source
agent writes the message:

  BFGSS0081E: Recovery of transfer ID: '<transfer_identifier>'
timed out after 30 seconds. The managed transfer has been
terminated.

to its event log, and moves the managed transfer into a
RecoveryTimedOut state.


Shortly after the destination agent restarts, the source agent
generates an FDC containing the information shown below:

...
Thread:     38 (TransferCompletedWorker)
Class:
com.ibm.wmqfte.cmdhandler.impl.FTECommandHandlerWorkerThread
Method:     processCommand
Probe:      FFDC_001
Cause:
com.ibm.wmqfte.cmdhandler.FTECommandHandlerErrorException:
BFGCH0066E: An internal error has occurred. An exception has
been reported by the State Store. The exception is
java.lang.NullPointerException
com.ibm.wmqfte.cmdhandler.FTECommandHandlerErrorException:
BFGCH0066E: An internal error has occurred. An exception has
been reported by the State Store. The exception is
java.lang.NullPointerException
	at
com.ibm.wmqfte.cmdhandler.impl.FTECommandHandlerStateStoreImpl.p
rocessInternalCommand(FTECommandHandlerStateStoreImpl.java:...)
	at
com.ibm.wmqfte.cmdhandler.impl.MessageHandlerImpl.internalMessag
eHandler(MessageHandlerImpl.java:...)
	at
com.ibm.wmqfte.cmdhandler.impl.MessageHandlerImpl.processCommand
(MessageHandlerImpl.java:...)
	at
com.ibm.wmqfte.cmdhandler.impl.FTECommandHandlerWorkerThread.pro
cessCommand(FTECommandHandlerWorkerThread.java:...)
	at
com.ibm.wmqfte.cmdhandler.impl.FTECommandHandlerWorkerThread.run
(FTECommandHandlerWorkerThread.java:...)
	at java.lang.Thread.run(Thread.java:...)
	at com.ibm.wmqfte.thread.FTEThread.run(FTEThread.java:...)
Caused by: java.lang.NullPointerException
	at
com.ibm.wmqfte.transfer.impl.TransferSenderRunnable.auditComplet
e(TransferSenderRunnable.java:...)
	at
com.ibm.wmqfte.transfer.impl.TransferSenderImpl.onComplete(Trans
ferSenderImpl.java:...)
	at
com.ibm.wmqfte.statestore.impl.FTEStateStoreImpl.sendForgetToFar
End(FTEStateStoreImpl.java:...)
	at
com.ibm.wmqfte.cmdhandler.impl.FTECommandHandlerStateStoreImpl.p
rocessInternalCommand(FTECommandHandlerStateStoreImpl.java:...)
	... 6 more

and the managed transfer gets stuck in the RecoveryTimedOut
state.

Local fix

```
n/a
```

Problem summary

****************************************************************
USERS AFFECTED:
This issue affects users of MQ 9.2 Managed File Transfer, who
have enabled the transfer recovery timeout functionality for
managed transfers.


Platforms affected:
MultiPlatform

****************************************************************
PROBLEM DESCRIPTION:
The MQ Managed File Transfer recovery timeout functionality
provides a mechanism to automatically fail a managed transfer if
it enters recovery, and stays in recovery for longer than the
specified timeout period. Some information on this can be found
in the topic:

 Transfer recovery timeout concepts

in the MQ 9.2 section of IBM Documentation. The URI for the
topic is:

https://www.ibm.com/docs/en/ibm-mq/9.2?topic=transfers-transfer-
recovery-timeout-concepts


Now, if an agent was acting as the source agent for a managed
transfer that had a transfer recovery timeout specified, and the
managed transfer entered recovery during the transfer
negotiation phase because the destination agent did not respond,
then the source agent would write the following messages to its
event log (output0.log):

  BFGTR0068I: Transfer ID: <transfer identifier>, to destination
agent <destination agent>, is entering recovery, immediately,
due to recoverable error: BFGTR0064I: The transfer sender timed
out waiting for a response from the transfer receiver.

  BFGTR0081I: Transfer ID: '<transfer identifier>' is now in
recovering state. The agent will wait for <timeout> seconds for
the transfer to recover.

If the managed transfer subsequently timed out due to the
transfer recovery timeout, the source agent moved the managed
transfer into a "RecoveryTimedOut" state, and wrote the message:

  BFGSS0081E: Recovery of transfer ID: '<transfer_identifier>'
timed out after <timeout> seconds. The managed transfer has been
terminated.

to its event log.

When the destination agent eventually responded to the transfer
negotiation message, the source agent started to perform the
required cleanup for the managed transfer before cancelling it.
As part of this processing, it tried to generate some internal
audit information for the managed transfer based on the items
that had been transferred. However, because the managed transfer
hadn't actually started, the attempt to do this failed and the
source agent generated an FDC containing the following
information:

...
Thread:     38 (TransferCompletedWorker)
Class:
com.ibm.wmqfte.cmdhandler.impl.FTECommandHandlerWorkerThread
Method:     processCommand
Probe:      FFDC_001
Cause:
com.ibm.wmqfte.cmdhandler.FTECommandHandlerErrorException:
BFGCH0066E: An internal error has occurred. An exception has
been reported by the State Store. The exception is
java.lang.NullPointerException
com.ibm.wmqfte.cmdhandler.FTECommandHandlerErrorException:
BFGCH0066E: An internal error has occurred. An exception has
been reported by the State Store. The exception is
java.lang.NullPointerException
&#09;at
com.ibm.wmqfte.cmdhandler.impl.FTECommandHandlerStateStoreImpl.p
rocessInternalCommand(FTECommandHandlerStateStoreImpl.java:...)
&#09;at
com.ibm.wmqfte.cmdhandler.impl.MessageHandlerImpl.internalMessag
eHandler(MessageHandlerImpl.java:...)
&#09;at
com.ibm.wmqfte.cmdhandler.impl.MessageHandlerImpl.processCommand
(MessageHandlerImpl.java:...)
&#09;at
com.ibm.wmqfte.cmdhandler.impl.FTECommandHandlerWorkerThread.pro
cessCommand(FTECommandHandlerWorkerThread.java:...)
&#09;at
com.ibm.wmqfte.cmdhandler.impl.FTECommandHandlerWorkerThread.run
(FTECommandHandlerWorkerThread.java:...)
&#09;at java.lang.Thread.run(Thread.java:...)
&#09;at com.ibm.wmqfte.thread.FTEThread.run(FTEThread.java:...)
Caused by: java.lang.NullPointerException
&#09;at
com.ibm.wmqfte.transfer.impl.TransferSenderRunnable.auditComplet
e(TransferSenderRunnable.java:...)
&#09;at
com.ibm.wmqfte.transfer.impl.TransferSenderImpl.onComplete(Trans
ferSenderImpl.java:...)
&#09;at
com.ibm.wmqfte.statestore.impl.FTEStateStoreImpl.sendForgetToFar
End(FTEStateStoreImpl.java:...)
&#09;at
com.ibm.wmqfte.cmdhandler.impl.FTECommandHandlerStateStoreImpl.p
rocessInternalCommand(FTECommandHandlerStateStoreImpl.java:...)
&#09;... 6 more

This caused the managed transfer to remain registered in the
agent's internal state store, where it occupied a slot in the
"source transfer list". This meant that the number of source
transfers that the agent could participate in, as specified by
the agent property:

  maxSourceTransfers

was reduced by 1.

If the issue occurred multiple times during the lifetime of an
agent, it was possible for all of the source transfer slots to
be occupied by managed transfers that had encountered the issue.
This would result in the agent being unable to act as the source
agent for any new managed transfers.

Problem conclusion

To resolve this issue, two changes have been made:

Change 1:
-----------
The first change modifies the logic used to handle managed
transfers that have timed out due to a transfer recovery
timeout, so that the source agent only collects audit
information for the items in the managed transfer if the managed
transfer actually started before it entered recovery and timed
out. This prevents the FDC containing the NullPointerException
reported in the APAR from occurring.

Change 2:
-----------
In addition to this, the design of the transfer recovery timeout
functionality has been changed, so that managed transfers which
have entered a RecoveryTimedOut state are moved into an
"inactive transfers list". This allows the source agent to
participate in other managed transfer requests while it is
waiting to perform the processing required to clean up state for
managed transfers that have entered recovery and timed out.

---------------------------------------------------------------
The fix is targeted for delivery in the following PTFs:

Version    Maintenance Level
v9.2 LTS   9.2.0.5
v9.x CD    9.2.5

The latest available maintenance can be obtained from
'WebSphere MQ Recommended Fixes'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006037

If the maintenance level is not yet available information on
its planned availability can be found in 'WebSphere MQ
Planned Maintenance Release Dates'
http://www-1.ibm.com/support/docview.wss?rs=171&uid=swg27006309
---------------------------------------------------------------

Temporary fix

Comments

APAR Information

APAR number
IT37869
Reported component name
MQ BASE V9.2
Reported component ID
5724H7281
Reported release
920
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2021-08-03
Closed date
2021-11-29
Last modified date
2021-11-29

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
MQ BASE V9.2
Fixed component ID
5724H7281

Applicable component levels

[{"Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSYHRD","label":"IBM MQ"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"920"}]

Document Information

Modified date:
30 November 2021

Tips

IT37869: FDC containing a NullPointerException occurs if a managed transfer enters recovery during negotiation and times out

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?