The challenges of East Asian languages

Conversion Considerations

A number of different encodings exist for each Asian language. The same character set may be encoded several different ways depending on where the data originates and where it will be used. For example, a host system may use an EBCDIC encoding for Japanese Shift JIS code while a Windows workstation may use the same character set encoded differently. Many systems today use Unicode internally for text processing and storage. Unicode allows a single binary software installation to process text in all languages, and therefore is a good basis for a globalized system. In order for systems to communicate and share data often conversion is required. This is true not just for Asian languages but for all languages.

The best way to avoid data loss due to conversion is to avoid data conversion. The best way to avoid data conversion is to use Unicode whenever possible. Using Unicode will not eliminate the need for conversion in all cases however it should minimize the amount of conversion required. Whether Unicode or some other encoding is used, it is essential that the data be identified by some means of tagging. Knowing the encoding of the data makes identification and selection of an appropriate converter a much easier task.

Traditional and Simplified Chinese

Several decades ago, China introduced 'simplified' forms of some complex ideographs with the goal to make learning to read and write easier. In places like Taiwan and Hong Kong, the 'traditional' forms continue to be used. Legacy codepages for the different regions use either set of characters, while Unicode contains both. It is possible to transform text using one set of these characters into text using the other, but the relationship is not always one to one. As with text boundaries, good software implementations require a lot of development and are only available commercially. (See Basis Technology's 'Chinese Script Converter' at www.basistech.com/base-linguistics/asian

GB 18030

Software marketed in China is required to support the relatively new GB 18030 codepage standard. Software must be able to input and output GB 18030 text, which, for Unicode-based code, means support for conversion between GB 18030 and Unicode. The Chinese government tests software in a certification process to verify that this requirement has been met and that the software allows keyboard input and display of a certain set of characters.

Contact IBM

Need assistance with your globalization questions?

Topic Contents