Partial name preprocessing for IBM InfoSphere Global Name Management

Product documentation


Abstract

To speed up name processing, you can preprocess only the new entries when they
are added to a name list that was already preprocessed.

Content

Parallel name preprocessing scenario

Tab navigation



Solution

Completing partial name preprocessing as an adjunct to parallel name preprocessing avoids repeating name analysis and streamlines the method of preprocessing for new name entries. Through partial name preprocessing, unique name identification (phase two of name preprocessing) with the pnpp.java application is disabled. Disabling phase two preserves the results from phase one to ensure that the original names are not unnecessarily analyzed a second time.

The scenario

Partial name preprocessing involves the following steps.

  1. When new name entries are added to the list, each entry is run through the first phase of name preprocessing and then appended to the original list of names that were preserved from phase one.
  2. The pnpp.java application creates new name lists by combining the original name list with any additional name entries.
  3. Phase two is then run on the new, complete list to identify any unique names.

The pnpp.java application

You configure and run the pnpp.java application to preprocess your name data.

Extract the contents of the following file to your computer to begin using the pnpp.java application.

pnpp.zip

pnpp.zip

Speeding up name preprocessing for existing name lists

In the following scenario, you complete partial name preprocessing as a supplement to parallel name preprocessing. New names are preprocessed and then added to the existing list of names without preprocessing the original name entries a second time.

Before you begin

Complete parallel name preprocessing on your name lists. See the parallel name preprocessing scenario that is linked to at the beginning of this document for more information.

  1. Extract the contents of the pnpp.zip folder to your computer.
  2. Open the pnpp.1.config file in a text editor and enable the name analysis parameters that you want to run as part of the preprocessing operation. In the following code sample, categorization, transliteration, parsing, classification, and invalid character cleanup are enabled.

    ...
    doCategorize=true
    doRegularize=false
    doTransliterate=true
    doParse=true
    doClassify=true
    doNhClean=true
    doFullName=false
    parseThreshold=0.5
    ...

  3. Ensure that all file directories in the pnpp.1.config file match your computer environment. Modify any file paths to the appropriate directory as necessary. In the following example, the latinTransRule.ibm transliteration file is included so that you can identify and preprocess personal names like Linda Smith and business names like Linda Smith Architecture.

    ...
    ndaDir=/gnr/data
    sifterRulesFile=/gnr/data/SifterRules.ibm
    maxGnCacheSize=4000000
    maxOnCacheSize=0
    maxSnCacheSize=4000000
    latinTransFile=/gnr/data/latinTransRule.ibm
    angloRegFile=angloRegRule.ibm.
    genericOnRegFile=genericOnRegRule.ibm
    ...

  4. Open a command prompt and run the following command to compile the pnpp.java application, where extract_location is the location where you extracted the contents of the pnpp.zip file. An ant script, build.xml, is included to build a .jar file from the pnpp.java file.

    cd extract_location/pnpp
    ant

  5. Use the command line options to modify default parameters. The following command changes the file name for the configuration template files and the name of the output files.

    java pnpp.jar -config=my_pnpp_config -output=my_output_files

  6. Run partial name preprocessing for all new names that are added to the existing list of names. In each of the following steps, run the specified command to complete the appropriate action.
    • Analyze the original batch of names. The output is a written to a .csv file called BigList.csv.

      java pnpp.jar -phase2 -output=BigList < BigList.csv


    • Split the original list of names (BigList.csv) into four sub lists. Using sub lists can improve performance, but using more sub lists than available CPU cores can negatively affect performance.

      java pnpp.jar -phase1 -sublists=4 -output=BigList


    • Analyze each new name entry that is added to the list and save the entries to a .csv file called SmallList.csv.

      java pnpp.jar -phase2 -output=SmallList < SmallList.csv


    • Combine the small name list with the big name list.

      cat SmallList.npp >> BigList.npp


    • Split the combined list into four sub lists.

      java pnpp.jar -phase1 -sublists=4 -output=BigList


Results

If no analysis errors are returned, multiple output files contain the preprocessed names.

Rate this page:

(0 users)Average rating

Add comments

Document information


More support for:

InfoSphere Global Name Management
InfoSphere Global Name Recognition

Software version:

4.1, 4.2

Operating system(s):

AIX, Linux, Solaris, Windows, Windows 2003 server

Reference #:

7019348

Modified date:

2013-05-15

Translate my page

Machine Translation

Content navigation