IBM InfoSphere Streams Version 4.1.0

Developing streams processing applications with consistent regions

Because of business requirements, some applications require that all tuples in an application are processed at least once. You can use a consistent region in your streams processing applications to avoid data loss due to software or hardware failure and meet your requirements for at-least-once processing.

A consistent region is a subgraph where the states of the operators become consistent by processing all the tuples and punctuation marks within defined points on a stream. This enables tuples within the subgraph to be processed at least once. The consistent region is periodically drained of its current tuples. All tuples in the consistent region are processed through to the end of the subgraph. In-memory state of operators are automatically serialized and stored on checkpoint for each of the operators in the region.

If any operator in a consistent region fails at run time, InfoSphere® Streams detects the failure and triggers the restart of the operator and the reset of the region. In-memory state of operators are automatically reloaded, and deserialized on reset of the operators.

The capability to drain the subgraph, which is coupled with start operators that can replay their output streams, enables a consistent region to achieve at-least-once processing.

A stream processing application can be defined with zero, one, or more consistent regions. You can define the start of a consistent region with the @consistent annotation on an operator. InfoSphere Streams then determines the scope of the consistent region automatically, but you can reduce the scope of the region with the @autonomous annotation. Your application must include one new JobControlPlane operator that coordinates the draining and resetting of the operators in the consistent regions.

When a subgraph is a consistent region, InfoSphere Streams enables the operators in that region to drain and reset. When a region is draining, it establishes logical divisions in the output streams of each operator in the region. A drain is successful when all operators in the region establish a logical division in their output streams, and when all tuples before the logical division are processed in their input streams. If a drain is successful, it means that all operators in the region consumed all input streams up until the established logical division. While the region is draining or resetting, operators in the region that completed their draining or resetting cannot submit new tuples. This behavior means that the tuple flow within the subgraph briefly stops while the region is draining or resetting.

Operators in a consistent region do not process or submit final punctuation markers. When final markers are received or submitted, that punctuation is silently dropped by the InfoSphere Streams instance.

Note: Applications with consistent regions must use the TCP transport.
Note: Processing Elements with operators that reside in a consistent region startup with a Java virtual machine. As a result, applications with consistent regions are expected to consume more memory than applications without consistent regions.