Restarting batch jobs

Stopping and restarting batch jobs are essential functions when processing large volumes of data.

The batch processor can restart the following types of stopped batch jobs:

  • Batch jobs that were stopped by the runbatch.sh -stop <processId> command.
  • Batch jobs that stopped gracefully due to manual intervention (updateTask). The status of these batch jobs is Stopped.
  • Batch jobs that stopped gracefully due to a system error. The status of these batch jobs is Stopped.
  • Batch jobs that stopped unexpectedly due to system failure (crash). The status of these batch jobs depends on the state the task was in when the failure occurred:
    • If the job was in staging, the status is Pending.
    • If the job was being processed, the status is In Progress.
    • If the job was in the process of being stopped but had not yet stopped, the status is Stopping.

To restart a stopped task, update the task status to Pending by either:

  • Sending an updateTask XML transaction request to InfoSphere® MDM with a start task action code.
  • Running the command runbatch.sh -start <processId>

The batch processor will then pick up the pending request and try to restart it from where it stopped.

The batch processor is able to restart a batch job at the correct place by using three detailed status files that are kept for each task: a Stage file, a Result file, and a Restart file.

Stage file
When the batch processor starts a new batch job, it creates a Stage file. The name of the Stage file is based on the process ID, task name, and task ID. For example, if the process ID is 15787858, the task name is Persist Entities, and the task ID is 680132805003874901, then the Stage file name is 15787858_Persist Entities_680132805003874901_stage.

The first line of the Stage file is a title. Each line after the title stores a message ID and record. The message ID is a unique, sequential number generated by the batch processor at runtime to identify each record involved in the job. The types of information for each record in the Stage file depend on what is defined in the METADATA_KEY of the CDMETADATAINFOTP code table for the task.

For example:

MessageID,ENTITY_ID,ENTITY_TYPE
1,100000000000000001,mdmper
2,100000000000000002,mdmper
3,100000000000000003,mdmper
4,100000000000000004,mdmper

Alternate example:

MessageID,NO_TITLE_LINEENTITY_ID
1,<?xml version="1.0" encoding="UTF-8"?><TCRMService ...
2,<?xml version="1.0" encoding="UTF-8"?><TCRMService ...
3,<?xml version="1.0" encoding="UTF-8"?><TCRMService ...
4,<?xml version="1.0" encoding="UTF-8"?><TCRMService ...
Result file
The batch processor records the results of each job in a Result file. Similar to the Stage file, the Result file name is based on the process ID, task name, and task ID, such as 15787858_Persist Entities_680132805003874901_result

The Result file stores the unique message ID of each record in the batch job along with a result category to represent the outcome of the processing for that record:

  • F represents a failed outcome.
  • S represents a successful outcome.

Each line in the Result file represents a different record. For example:

1,S
2,F
3,S
4,S

The batch processor determines whether to mark a processing outcome as a success or failure depending on the result categorizer class, as defined in the Batch.properties file.

resultCategorizer=com.ibm.mdm.batchframework.message.BatchMessageCategorizer

The BatchMessageCategorizer determines the message outcome based on whether the transaction results in a DWLResponseException message. If so, the outcome is a failure (F); otherwise, the outcome is a success (S).

Tip: If the default BatchMessageCategorizer categorizer class’s behavior is not appropriate for your implementation, then you can use the ResultCodeMessageCategorizer categorizer class instead. Change the resultCategorizer property as follows:
resultCategorizer=com.ibm.mdm.batchframework.bulkprocessing.restart.ResultCodeMessageCategorizer

The ResultCodeMessageCategorizer determines the message outcome based on the value of the <ResultCode> tag from its response output. If the value is SUCCESS, then the outcome is S; otherwise, the outcome is F.

Restart file
When the batch processor restarts a batch job, it creates a Restart file by comparing the Stage file to the Result file and determining the list of remaining, unprocessed records. Similar to the Stage and Result files, the Restart file name is based on the process ID, task name, and task ID, such as 15787858_Persist Entities_680132805003874901_restart.

The Restart file has the same format as the Stage file. The Restart file contains a subset of the Stage file, and is made up of the entities in the Stage file, minus a subset of the entities in the Result file.

The batch processor uses the Restart file as an input file to process the remaining entities in the restarted batch job.