Parallel name preprocessing scenario for IBM InfoSphere Global Name Recognition
Name Preprocessor converts a comma delimited file of name records into the files that you use in name searching. Through this scenario, you can simultaneously analyze a list of names and identify unique names in the list.
Parallel name preprocessing scenario
Preprocessing name data in parallel
Using Name Preprocessor to preprocess names for searching can be time consuming. Because Name Preprocessor is a single-threaded application, no parallelization is attempted, and the majority of available processes for a single CPU are used. IBM InfoSphere Global Name Recognition products first analyze each name by performing transliteration, categorization, parsing, and cultural classification. Name Preprocessor then identifies unique names and creates data files, composed of data tables, that the search engine uses when searching a list of names.
The pnpp.java application is capable of simultaneously running multiple instances of Name Preprocessor to conduct name analysis and unique name identification as separate, parallel processes. Multiple instances of Name Preprocessor are run concurrently, where each instance analyzes a portion of the names and then combines multiple outputs of the first phase for the second phase. You can use multiple CPU cores to run multiple Name Preprocessor processes in parallel to perform name processing, and can also increase the overall throughput of name preprocessing.
You want to use your organization's intranet to search a list of customer name records. However, the list contains thousands of personal names from various cultures, some of which have similar spellings, and often refer to the same individual. In addition, hundreds of business names are included in the list, and many of these names contain personal names. For example, a business name like Linda Smith Architecture might be listed separately from the entry Linda Smith, although these entries refer to the same individual. To enhance your search results, you must preprocess your name data before making it available to your search engine.
You can run the pnpp.java application to perform name analysis in preparation for the searches. In addition, the application performs unique name identification, which is useful when you search very large name lists where many common names can occur hundreds or thousands of times. The pnpp.java application completes these processes concurrently and logs any errors that occur.
The pnpp.java application
You configure and run the pnpp.java application to preprocess your name data.
Extract the contents of the following file to your computer to begin using the pnpp.java application.
Running the pnpp.java application
Before you begin
The Java Runtime Environment, Version 1.5 or later is required to run the pnpp.java application. Ensure that you have Java Development Kit (JDK) 5 or later installed on the computer where you want to run the pnpp.java application.
- Extract the contents of the pnpp.zip folder to your computer.
- Open the pnpp.1.config file in a text editor and enable the name analysis parameters that you want to run as part of the preprocessing operation. In the following code sample, categorization, transliteration, parsing, classification, and invalid character cleanup are enabled.
- Ensure that all file directories in the pnpp.1.config file match your computer environment. Modify any file paths to the appropriate directory as necessary. In the following example, the latinTransRule.ibm transliteration file is included so that you can identify and preprocess personal names like Linda Smith and business names like Linda Smith Architecture.
- Open a command prompt and run the following command to compile the pnpp.java application, where extract_location is the location where you extracted the contents of the pnpp.zip file. An ant script, build.xml, is included to build a .jar file from the pnpp.java file.
- Use the command line options to modify default parameters. The following command changes the file name for the configuration template files and the name of the output files.
java pnpp.jar -config=my_pnpp_config -output=my_output_files
- Run the following command to run the pnpp.jar file, where input_file is the file name of the list of name records that you want to preprocess.
In the Java Runtime Environment, Version 1.5
java pnpp.jar < input file
In the Java Runtime Environment, Version 1.6
java -jar pnpp.jar < input file
The first phase of preprocessing runs and begins the name analysis operations that you specified in the pnpp.1.config file. When these processes complete, the pnpp.java application scans the log files for errors. If the application detects any errors, the name preprocessing operation ends with an error condition and writes the analysis errors to the standard error output stream, which is specified through the Java Printstream err method. You specify the destination where the pnpp.java application writes the output.
Parameters for parallel preprocessing
The pnpp.java application can read a comma delimited file of name records, split these records into multiple sub lists, and run multiple, parallel instances of Name Preprocessor to perform name analysis. The application then combines individual results into a single list and runs another name preprocessing operation to perform unique name identification.
The pnpp.java application uses two configuration file templates, one for each phase of preprocessing. The default file names are pnpp.1.config and pnpp.2.config, but you can override the base name (pnpp). The pnpp.java application uses these configuration file templates to create temporary configuration files, where text enclosed in percent signs (such as %INPUT%) is replaced by values that are specific to individual Name Preprocessor instances. You can modify entries in the configuration file templates to provide site-specific values, such as the location of linguistic support files.
You can alter the standard behavior of the pnpp.java application by modifying any of the following command-line options. When you run the application, each command-line option is processed based on the most recent value.
Overrides the number of parallel analysis processes. The analysis process runs slower if the number of processes is greater than the number of available CPUs. The default value is 4.
Overrides the number of sub lists that are created. Creating more sub lists than available CPU cores can negatively affect performance. The default value is 2.
Supplies the base file name for the configuration template files. The default value is pnpp.
Supplies the name of the directory to use for temporary files. The default value is the current directory, which is represented by a period (.).
Supplies the character encoding of the input file. The pnpp.java application supports all character encodings that are supported by Java. The default value is UTF-8.
Supplies the base file name for the output files. The default value is pnpp.
Disables the name analysis phase and proceeds with unique name identification.
Disables the unique name identification phase and stops after name analysis is complete.
More support for:
InfoSphere Global Name Management
InfoSphere Global Name Recognition
Software version: 4.1, 4.2
Operating system(s): AIX, Linux, Solaris, Windows
Reference #: 7019314
Modified date: 15 May 2013