IBM Global Name Recognition product support of Unicode

Technote (FAQ)


Question

Explains how IBM Global Name Management products handle Unicode characters, specifically UTF-8 and UTF-16

Answer

NameClassifier, NameGenderizer, and Name Variation Generator (NVG) are ASCII-only. Input byte values that are outside the ASCII range are filtered out as noise. There is no Unicode support. This is also true of MetaMatch.

The Stats API in NameParser accepts input in virtually any encoding (using ICU) and produces output in UTF-8 (the internals are UTF-8-based). It can handle most of the extended-Latin characters in Unicode. The UTF-8 output currently will all be ASCII characters.

NameHunter 3.0 will also accept input in virtually any encoding and includes a modular transliteration system that enables handling of non-Latin scripts. Modules are currently available for extended Latin, Greek, Cyrillic, and Arabic. Modules for Thai, Korean, and Chinese have been discussed, but no decisions have been made on future script support. We currently handle non-Latin scripts by transliterating to Latin, although migration to more native support of non-Latin scripts is likely in future versions.

NameHunter 3.0 represents the direction in which all of our products will be moving -- the products will all have Unicode-based internals, will accept input and produce output in virtually any encoding, and will handle a wide variety of non-Latin scripts.

Cross reference information
Segment Product Component Platform Version Edition
Information Management InfoSphere Global Name Scoring IBM NameClassifier 2.1

Rate this page:

(0 users)Average rating

Document information


More support for:

InfoSphere Global Name Management
InfoSphere Global Name Analytics

Software version:

2.1

Operating system(s):

AIX, Linux Red Hat - iSeries, Linux Red Hat - pSeries, Solaris, Windows

Reference #:

1247560

Modified date:

2013-05-15

Translate my page

Machine Translation

Content navigation