IM IBM InfoSphere Global Name Management, Version 5.0

Chinese transliteration overview

IBM InfoSphere Global Name Management matches Chinese Hanzi personal names in their exact and equivalent Hanzi forms and matches Hanzi and transliterated Roman name equivalents to each other.

Chinese and Japanese written languages share many characters. Chinese language names written in Hanzi characters are similar to Japanese Kanji names with many shared characters. Chinese Hanzi names have the following characteristics that differentiate them from Kanji names:

These features imply that almost all Chinese names have only one pronunciation in Mandarin. Because of this, Chinese names can be transliterated directly within the NameTransliterator component. This differs from Japanese names, which often have many-to-many relationships between Kanji names and Romanized forms.

The International Components for Unicode (ICU) open source project has a set of system rules that transliterate commonly used Chinese characters into Mandarin Pinyin representations. Each character has only one output form. In the case of characters with multiple pronunciations, the most common one is selected. The IBM Global Name Management transliteration process uses the ICU internal rule set for most Chinese characters. Exceptions are handled by special rules.

Processing Chinese names requires more than adding transliteration rules. All Chinese characters in Mandarin Chinese are monosyllabic. There are about 1,350 syllables with tones and 410 syllables without tones. With tens of thousands of Chinese characters, a single syllable can commonly be written with dozens of different characters. As a result, names written in different characters can be transliterated into the same Latin form. In other words, there is a many-to-one relationship between Hanzi names and Romanized forms. A problem arises when the query name is a Chinese character name and the data list contains other Chinese character names that are pronounced the same way and have been transliterated into the same Romanized form. Without additional filtering procedures, these different Hanzi names will be returned by a name search as perfect matches.

Consider the following list showing five different names, each of which has at least a character different from the other names:

  1. 黄书东 - name written with the simplified character set
  2. 黃書東 - same name as (1), but written with the traditional character set
  3. 黄书冬 - different last character in given name
  4. 皇书东 - different surname character
  5. 皇舒冬 - all different characters.

All these names are transliterated into the same Latin form, namely “HUANG SHU DONG” (or HUANG2 SHU1 DONG1 if numeric tone markings are included). However, only names (1) and (2) are the same Chinese name. If these Roman forms are all in the data list, querying (1) “黄书东” would also return (3), (4), and (5) at 1.0 even though they are all different names to a native speaker. The NameHunter search process is enhanced to deal with this type of problematic result.

Handling Chinese Hanzi name data

The NameHunter function analyzes Chinese Hanzi name data with the following general process:

Capabilities include:
  • Recognizing given name and surname elements in Hanzi and processes these elements appropriately.
  • Matching Hanzi personal names based on the Hanzi variant character table.
  • Matching Hanzi and Roman personal name equivalents.

Chinese scoring is applied on a pass or fail basis. If the Hanzi names pass, scores generated for the Roman name mappings are used except where the Roman-based score is 1.0 and the Hanzi-based score is less than 1.0. In this case, a penalty factor of -.02 is applied to the Roman score so that it becomes .98. This is designed to indicate that the Hanzi name data is not a perfect match and thus prevent false positives.

The scoring algorithm uses a Chinese variant table that includes simplified versus traditional along with other variants. The highest variant score is .995. The table is in a format similar to other NameHunter variant tables and is expandable. For example, you can add character sets that are not true variants, but are pronounced the same with similar strokes to score them higher than the default score resulting from a different character.

Chinese surnames and given names are not delimited in normal usage. Even structured name data, such as from a residency application form, typically has only one full name field. The transliteration rule file includes a parsing algorithm by which an unparsed Chinese character full name is parsed into a surname and given name before being transliterated. This parsing is essential for cross-language name processing and helps provide correct Roman forms for those few exceptional surname characters that do not follow the most common pronunciation.

Chinese Hanzi name data analysis has the following limitations:

Chinese transliteration requires the file chineseTransRule.ibm. Updates to the configuration files for NameParser, Distributed Search, and NameWorks are also required if migrating from an earlier release of the product.



Feedback