IBM InfoSphere Streams Version 4.1.0

Filename ingestion

The filename ingestion component scans one or more directories for input files and feeds the chains with the file names to process.

Most of the functionality can be adapted via configuration parameters. The File Type Detection and the File Sort functions allow customization by implementing an SPL composite operator.

Directory scan

This component periodically scans directories containing the input files to process. The scanning starts after the ITE application control detected that all parts of the application have been initialized. The scanner works similar to the normal Directory Scan operator in the SPL standard toolkit, but optionally allows to scan multiple directories in parallel. One of the directories can be designated as a high priority directory. Files found in this directory are marked as urgent, and will be processed by the application before non-urgent files. Hidden files, whose names start with a dot character are ignored by the scanner. The scanner does not sort the filenames, this functionality is implemented in the File Sort component. Other configurable aspects are the scan interval and the filename pattern used to decide if the filename is accepted or ignored. It is recommended to move files to the input directories, instead of copying them. This is to avoid that huge files are still in the copying process when detected by the scanner, and processing for the file starts before it is completely available.

FileType Validator

Use this optional function if you want to process multiple different file formats in the ITE application. For example, if you have input files with CSV encoding and ASN.1 encoding, or ASN.1 input files with two different grammars, you need to configure a list of FileReaders in the ChainProcessor. The list contains one entry for each input file format you want to parse. In this case the ChainProcessor needs to know which FileReader from the list to invoke for any given input file. It retrieves the list index from a certain file type attribute. It is the responsibility of the File Type Validator, to set this attribute to the correct value. To set the value, the validator could check the filename suffix and set the FileType to 1 for files with a “.csv” suffix and to 2 for files with an “.asn” suffix. If you use this feature you must implement the mapping to file types in a certain SPL composite operator. This gives you the freedom to use any algorithm to establish the mapping from a given input file to the FileReader that will be used to parse the file. You could even inspect the file content to figure out which reader to use. The Validator can also be used to reject files if certain conditions are not met. For example you can implement logic to reject files older than 10 days or files bigger than 1GB, or whatever is required in the use case.

Sort function

This optional function sorts the detected files before they are send to the chains. You can specify if you want to sort by filename, file size or file time. The file time can also be extracted from the filename, instead of using the timestamp provided by the directory scanner. You can configure the sort order to be ascending or descending. And you can customize the Sort function by implementing your own sort logic in a certain SPL composite operator. The sort function is applied to all files detected by the scanner in one scan interval.

Filename deduplication

This optional function ensures that each input file is processed only once. Internally it keeps a list of all filenames processed so far. If the name of a new input file is already in the list, the file is not processed, but moved to the duplicate directory. Files that have not been processed successfully are not stored in the list, so they can be reprocessed until successful. You can also configure the component to allow certain filenames to bypass the deduplication. Only the filenames are remembered, so if you use multiple input directories files with the same names from different input directories are detected as duplicates. Old entries in the filename history are removed on regular intervals to prevent the list from growing indefinitely. See Checkpointing and Cleanup for details.

FileMove

This function is responsible for moving processed files from the input directory to another directory. The target directory depends on the outcome of the file processing. If the file was processed successfully it is moved to the archive directory. If errors occurred it is moved to the invalid directory. If some of the Validators found an error it is moved to the rejected directory. And if the file has been processed before it is moved to the duplicate directory.

Related links:

  • Reference > Toolkits > Specialized toolkits > com.ibm.streams.teda 1.0.2 > Application framework > Architecture > ITE application > Checkpointing and Cleanup
Filename distribution
The File Ingestion component uses several mechanisms to distribute filenames to the chains.