Partial name preprocessing for IBM InfoSphere Global Name Management

Product Documentation

Abstract

To speed up name processing, you can preprocess only the new entries when they
are added to a name list that was already preprocessed.

Content

Parallel name preprocessing scenario

Parallel name preprocessing - selected tab,
Partial name preprocessing

Tab navigation

Preprocessing additional name data

The pnpp.java application
Speeding up preprocessing for existing name lists

Solution

Completing partial name preprocessing as an adjunct to parallel name preprocessing avoids repeating name analysis and streamlines the method of preprocessing for new name entries. Through partial name preprocessing, unique name identification (phase two of name preprocessing) with the pnpp.java application is disabled. Disabling phase two preserves the results from phase one to ensure that the original names are not unnecessarily analyzed a second time.

The scenario

Partial name preprocessing involves the following steps.

When new name entries are added to the list, each entry is run through the first phase of name preprocessing and then appended to the original list of names that were preserved from phase one.
The pnpp.java application creates new name lists by combining the original name list with any additional name entries.
Phase two is then run on the new, complete list to identify any unique names.

The pnpp.java application

You configure and run the pnpp.java application to preprocess your name data.

Extract the contents of the following file to your computer to begin using the pnpp.java application.

pnpp.zip

Speeding up name preprocessing for existing name lists

In the following scenario, you complete partial name preprocessing as a supplement to parallel name preprocessing. New names are preprocessed and then added to the existing list of names without preprocessing the original name entries a second time.

Before you begin

Complete parallel name preprocessing on your name lists. See the parallel name preprocessing scenario that is linked to at the beginning of this document for more information.

Extract the contents of the pnpp.zip folder to your computer.
Open the pnpp.1.config file in a text editor and enable the name analysis parameters that you want to run as part of the preprocessing operation. In the following code sample, categorization, transliteration, parsing, classification, and invalid character cleanup are enabled.

...
doCategorize=true
doRegularize=false
doTransliterate=true
doParse=true
doClassify=true
doNhClean=true
doFullName=false
parseThreshold=0.5
...

Ensure that all file directories in the pnpp.1.config file match your computer environment. Modify any file paths to the appropriate directory as necessary. In the following example, the latinTransRule.ibm transliteration file is included so that you can identify and preprocess personal names like Linda Smith and business names like Linda Smith Architecture.

...
ndaDir=/gnr/data
sifterRulesFile=/gnr/data/SifterRules.ibm
maxGnCacheSize=4000000
maxOnCacheSize=0
maxSnCacheSize=4000000
latinTransFile=/gnr/data/latinTransRule.ibm
angloRegFile=angloRegRule.ibm.
genericOnRegFile=genericOnRegRule.ibm
...

Open a command prompt and run the following command to compile the pnpp.java application, where extract_location is the location where you extracted the contents of the pnpp.zip file. An ant script, build.xml, is included to build a .jar file from the pnpp.java file.

cd extract_location/pnpp
ant

Use the command line options to modify default parameters. The following command changes the file name for the configuration template files and the name of the output files.

java pnpp.jar -config=my_pnpp_config -output=my_output_files

Run partial name preprocessing for all new names that are added to the existing list of names. In each of the following steps, run the specified command to complete the appropriate action.
- Analyze the original batch of names. The output is a written to a .csv file called BigList.csv.
  
  java pnpp.jar -phase2 -output=BigList < BigList.csv
- Split the original list of names (BigList.csv) into four sub lists. Using sub lists can improve performance, but using more sub lists than available CPU cores can negatively affect performance.
  
  java pnpp.jar -phase1 -sublists=4 -output=BigList
- Analyze each new name entry that is added to the list and save the entries to a .csv file called SmallList.csv.
  
  java pnpp.jar -phase2 -output=SmallList < SmallList.csv
- Combine the small name list with the big name list.
  
  cat SmallList.npp >> BigList.npp
- Split the combined list into four sub lists.
  
  java pnpp.jar -phase1 -sublists=4 -output=BigList

Results

If no analysis errors are returned, multiple output files contain the preprocessed names.

[{"Product":{"code":"SSEV5M","label":"InfoSphere Global Name Management"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Component":"InfoSphere Global Name Recognition","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"4.1;4.2","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Tips