Partial name preprocessing for IBM InfoSphere Global Name Management

Product documentation


To speed up name processing, you can preprocess only the new entries when they
are added to a name list that was already preprocessed.


Parallel name preprocessing scenario

Tab navigation


Completing partial name preprocessing as an adjunct to parallel name preprocessing avoids repeating name analysis and streamlines the method of preprocessing for new name entries. Through partial name preprocessing, unique name identification (phase two of name preprocessing) with the application is disabled. Disabling phase two preserves the results from phase one to ensure that the original names are not unnecessarily analyzed a second time.

The scenario

Partial name preprocessing involves the following steps.

  1. When new name entries are added to the list, each entry is run through the first phase of name preprocessing and then appended to the original list of names that were preserved from phase one.
  2. The application creates new name lists by combining the original name list with any additional name entries.
  3. Phase two is then run on the new, complete list to identify any unique names.

The application

You configure and run the application to preprocess your name data.

Extract the contents of the following file to your computer to begin using the application.

Speeding up name preprocessing for existing name lists

In the following scenario, you complete partial name preprocessing as a supplement to parallel name preprocessing. New names are preprocessed and then added to the existing list of names without preprocessing the original name entries a second time.

Before you begin

Complete parallel name preprocessing on your name lists. See the parallel name preprocessing scenario that is linked to at the beginning of this document for more information.

  1. Extract the contents of the folder to your computer.
  2. Open the pnpp.1.config file in a text editor and enable the name analysis parameters that you want to run as part of the preprocessing operation. In the following code sample, categorization, transliteration, parsing, classification, and invalid character cleanup are enabled.


  3. Ensure that all file directories in the pnpp.1.config file match your computer environment. Modify any file paths to the appropriate directory as necessary. In the following example, the transliteration file is included so that you can identify and preprocess personal names like Linda Smith and business names like Linda Smith Architecture.


  4. Open a command prompt and run the following command to compile the application, where extract_location is the location where you extracted the contents of the file. An ant script, build.xml, is included to build a .jar file from the file.

    cd extract_location/pnpp

  5. Use the command line options to modify default parameters. The following command changes the file name for the configuration template files and the name of the output files.

    java pnpp.jar -config=my_pnpp_config -output=my_output_files

  6. Run partial name preprocessing for all new names that are added to the existing list of names. In each of the following steps, run the specified command to complete the appropriate action.
    • Analyze the original batch of names. The output is a written to a .csv file called BigList.csv.

      java pnpp.jar -phase2 -output=BigList < BigList.csv

    • Split the original list of names (BigList.csv) into four sub lists. Using sub lists can improve performance, but using more sub lists than available CPU cores can negatively affect performance.

      java pnpp.jar -phase1 -sublists=4 -output=BigList

    • Analyze each new name entry that is added to the list and save the entries to a .csv file called SmallList.csv.

      java pnpp.jar -phase2 -output=SmallList < SmallList.csv

    • Combine the small name list with the big name list.

      cat SmallList.npp >> BigList.npp

    • Split the combined list into four sub lists.

      java pnpp.jar -phase1 -sublists=4 -output=BigList


If no analysis errors are returned, multiple output files contain the preprocessed names.

Document information

More support for:

InfoSphere Global Name Management
InfoSphere Global Name Recognition

Software version:

4.1, 4.2

Operating system(s):

AIX, Linux, Solaris, Windows

Reference #:


Modified date:


Translate my page

Content navigation