Explains how IBM Global Name Management products handle Unicode characters, specifically UTF-8 and UTF-16
NameClassifier, NameGenderizer, and Name Variation Generator (NVG) are ASCII-only. Input byte values that are outside the ASCII range are filtered out as noise. There is no Unicode support. This is also true of MetaMatch.
The Stats API in NameParser accepts input in virtually any encoding (using ICU) and produces output in UTF-8 (the internals are UTF-8-based). It can handle most of the extended-Latin characters in Unicode. The UTF-8 output currently will all be ASCII characters.
NameHunter 3.0 will also accept input in virtually any encoding and includes a modular transliteration system that enables handling of non-Latin scripts. Modules are currently available for extended Latin, Greek, Cyrillic, and Arabic. Modules for Thai, Korean, and Chinese have been discussed, but no decisions have been made on future script support. We currently handle non-Latin scripts by transliterating to Latin, although migration to more native support of non-Latin scripts is likely in future versions.
NameHunter 3.0 represents the direction in which all of our products will be moving -- the products will all have Unicode-based internals, will accept input and produce output in virtually any encoding, and will handle a wide variety of non-Latin scripts.
|Information Management||InfoSphere Global Name Scoring||IBM NameClassifier||2.1|