IBM Streams 4.2

Batching in a sink operator

Batching is used in a sink operator to transmit a batch of information that represents multiple tuples to the external system in a single operation.

When batching tuples, an implementation choice here is to either manage a batch of tuple objects, or manage the batch in the external system's Java™ API.

Batching using Tuple objects

When batching tuples, the Operator.process() adds the incoming tuple object to a set of tuples. A separate batch processing method receives the set of tuples and interacts with the external system's Java API to translate and transmit the batch as efficiently as possible. The Java Operator sample class com.ibm.streams.operator.samples.patterns.TupleConsumer provides generic batching function that is based on a set of tuples.

Batching using the external system's Java API

For batching using the external system's Java API, the Operator.process() method translates the tuple object, and then adds it to any batching mechanism provided by the API. Then, the batch processing method would transmit the batch to the external system. For example, with JDBC, the PreparedStatement API information from the tuples would be translated into parameters and added using addBatch(), and then run later using executeBatch().

Batching considerations

When batching tuples using either approach, there are a number of items to be considered:
  • What defines a complete batch, for example, number of tuples, elapsed time, or memory consumption of the batch.
  • How to handle final punctuation, which indicates no more tuples arrive and thus the processing of the operator can complete. Typically, the batch is processed even though it might be incomplete according to the definition of a complete batch.
  • How to handle delays in tuple arrivals. In any stream, there might be delays in tuple arrivals, which would delay the transmission of the batch to the external system. This arrival delay might cause delays in responses as seen by the external system. For example, a dashboard might not be updated. Typically, this problem is solved by having a timeout where the incomplete batch is transmitted to the external system if there is no incoming tuple for some time.