IBM InfoSphere Global Name Management, Version 5.0

Chinese transliteration overview

IBM InfoSphere Global Name Management matches Chinese Hanzi personal names in their exact and equivalent Hanzi forms and matches Hanzi and transliterated Roman name equivalents to each other.

Chinese and Japanese written languages share many characters. Chinese language names written in Hanzi characters are similar to Japanese Kanji names with many shared characters. Chinese Hanzi names have the following characteristics that differentiate them from Kanji names:

There are few multi-character Chinese surnames, which means most Chinese full names are not ambiguous with only one surname and one given name to parse.
There are few characters with multiple readings (tonal variations aside) in modern Mandarin.
For characters with multiple readings, the most common one is usually assumed when used in personal names. There is a small set of characters that have a surname-specific pronunciation. Because of this, pronunciation assistance is typically not provided for Chinese personal names in normal usage.

These features imply that almost all Chinese names have only one pronunciation in Mandarin. Because of this, Chinese names can be transliterated directly within the NameTransliterator component. This differs from Japanese names, which often have many-to-many relationships between Kanji names and Romanized forms.

The International Components for Unicode (ICU) open source project has a set of system rules that transliterate commonly used Chinese characters into Mandarin Pinyin representations. Each character has only one output form. In the case of characters with multiple pronunciations, the most common one is selected. The IBM Global Name Management transliteration process uses the ICU internal rule set for most Chinese characters. Exceptions are handled by special rules.

Processing Chinese names requires more than adding transliteration rules. All Chinese characters in Mandarin Chinese are monosyllabic. There are about 1,350 syllables with tones and 410 syllables without tones. With tens of thousands of Chinese characters, a single syllable can commonly be written with dozens of different characters. As a result, names written in different characters can be transliterated into the same Latin form. In other words, there is a many-to-one relationship between Hanzi names and Romanized forms. A problem arises when the query name is a Chinese character name and the data list contains other Chinese character names that are pronounced the same way and have been transliterated into the same Romanized form. Without additional filtering procedures, these different Hanzi names will be returned by a name search as perfect matches.

Consider the following list showing five different names, each of which has at least a character different from the other names:

黄书东 - name written with the simplified character set
黃書東 - same name as (1), but written with the traditional character set
黄书冬 - different last character in given name
皇书东 - different surname character
皇舒冬 - all different characters.

All these names are transliterated into the same Latin form, namely “HUANG SHU DONG” (or HUANG2 SHU1 DONG1 if numeric tone markings are included). However, only names (1) and (2) are the same Chinese name. If these Roman forms are all in the data list, querying (1) “黄书东” would also return (3), (4), and (5) at 1.0 even though they are all different names to a native speaker. The NameHunter search process is enhanced to deal with this type of problematic result.

Handling Chinese Hanzi name data

The NameHunter function analyzes Chinese Hanzi name data with the following general process:

Hanzi name transliteration is done outside of NameHunter. Personal Hanzi names are transliterated before being sent to NameHunter.
NameHunter accepts both Roman name equivalents and Hanzi name data.
NameHunter matches personal names in Roman form and then eliminates false positives (which can be created by many-to-one Hanzi-to-Roman mappings) by performing Chinese scoring.

Capabilities include:

Recognizing given name and surname elements in Hanzi and processes these elements appropriately.
Matching Hanzi personal names based on the Hanzi variant character table.
Matching Hanzi and Roman personal name equivalents.

Chinese scoring is applied on a pass or fail basis. If the Hanzi names pass, scores generated for the Roman name mappings are used except where the Roman-based score is 1.0 and the Hanzi-based score is less than 1.0. In this case, a penalty factor of -.02 is applied to the Roman score so that it becomes .98. This is designed to indicate that the Hanzi name data is not a perfect match and thus prevent false positives.

The scoring algorithm uses a Chinese variant table that includes simplified versus traditional along with other variants. The highest variant score is .995. The table is in a format similar to other NameHunter variant tables and is expandable. For example, you can add character sets that are not true variants, but are pronounced the same with similar strokes to score them higher than the default score resulting from a different character.

Chinese surnames and given names are not delimited in normal usage. Even structured name data, such as from a residency application form, typically has only one full name field. The transliteration rule file includes a parsing algorithm by which an unparsed Chinese character full name is parsed into a surname and given name before being transliterated. This parsing is essential for cross-language name processing and helps provide correct Roman forms for those few exceptional surname characters that do not follow the most common pronunciation.

Chinese Hanzi name data analysis has the following limitations:

Chinese transliteration works with Mandarin Pinyin only.
Comparison between Chinese characters is only possible if their Romanized forms match at the pre-defined threshold.
Names that look similar but are pronounced differently are unlikely to pass initial matching. Adding characters that look similar but have different pronunciations to the character variant table is not effective. Name matching between such names is not possible because direct search and comparison of Chinese character names is not supported.
Chinese transliteration is currently only intended for personal names. Organization names in Chinese script often cannot be directly transliterated and require translation as well. Hanzi characters that are not Chinese personal names are transliterated to the Roman alphabet and are treated as personal names.

Chinese transliteration requires the file chineseTransRule.ibm. Updates to the configuration files for NameParser, Distributed Search, and NameWorks are also required if migrating from an earlier release of the product.

Feedback