IBM InfoSphere Streams Version 4.1.0

Checkpointing and Cleanup

The ITE application implements a checkpointing mechanism to allow recovery after failures.

Checkpointing

Some components of the ITE application are stateful. They hold in-memory data needed to provide their functionality. These components are the filename deduplication, the record deduplication and potentially the custom correlation functions implemented by the user in the CustomContext component.

The data hold in memory includes
  • The list of already processed filenames, for the filename deduplication
  • The data used by the Bloom filter for record deduplication
  • Data structures used in the custom context to implement correlation functions, for example tables or lists for aggregations**

Without additional protection this data would be lost after a host or application failure. Although the application could be restarted to continue processing files, the results would be incorrect. For example the record deduplication would not be able to detect records processed before the failure. To solve this problem, the ITE application writes checkpoint files, after each processed input file. In case of a failure, the ITE application automatically recovers the internal state from the checkpoint files after the restart. The recovery process may take some time. When you initiate a graceful shutdown of the ITE application, using the provided command line tool, some optimizations will be used to reduce the time needed to restore the state after a restart.

Cleanup

There is another problem with keeping state for the deduplication components in memory. The filename deduplication holds a list of already processed filenames. This list could potentially become very large, so it is necessary to periodically remove old entries from the list. A similar problem occurs in the record deduplication. If old entries are never removed, the error rate of the filter increases, up to a point where the filter becomes useless. To solve these problems the ITE application periodically performs a Cleanup process, to remove old data from the deduplication components. The cleanup process is invoked at a configurable time and interval. Per default it runs every day at midnight. You can also configure how long entries in the deduplication components are retained, for example to keep the filename history for 10 days.

Checkpointing for custom correlation functions

If you implement stateful functionality in the Group component, you can decide if your function shall participate in the checkpointing and recovery mechanism. Your composite operator will receive commands from the ITE Application control, when to read, write or clear the internal state. If your use case does not require checkpointing and recovery in this component, you can simply ignore the related commands.