How SuperC partitions and processes large files

In SuperC, there is no limit on the size of files processed in terms of lines, words or bytes. Yet it had an internal methodology based upon a maximum field size for each work area storage structure (for example, array size and precision of variables). A method was developed to do the overall comparison process by breaking very large files into smaller comparison partitions and combining the intermediate results into one overall result. The process had to be done carefully so that it did not look as if the file partitioning was determined after some arbitrary limit was reached. That could affect the results on either side of the break point. The partitioning had to be done heuristically based upon the comparison results from the previously inspected intermediate process.

A fixed partitioning size of 32K lines/words/bytes was selected that was based on some test studies. The compare processes up to this limit and iteratively adjusts the intermediate ending break point of the pass by an adaptive method. Continuation from the adjusted end point is the basis for the next pass. That end point might even be adjusted to some previous records that had already been processed. The objective is to achieve the next best compare set for future unprocessed records.

The overall process ends when both files reach the End-of-File during a pass. The results from the intermediate passes are combined into one user end result. Most large compares are never suspected to have been partitioned and recombined.

The unlimited file size solution may appear, at first, unnecessary for Line compare using a virtual address space that is nearly unlimited. Yet there often has to be some limit—even if it is a high value. Programs need to store data with predetermined precision limits and programs work better with limits that are reasonable. Word compare and Byte compare eventually needed a partitioning limit for the compare as the number of words and bytes become large even for small file sizes.

Because of this partitioning process, comparisons of large files may take a long time.