The challenges of East Asian languages

The Challenges of Processing East Asian Languages

East Asian languages include Japanese, Chinese, Korean, and related languages and dialects. They are used in populous countries with large and growing internet usage as shown by the Internet World Stats. These languages are each written using several thousand characters. Compared with English and other languages which use alphabetic scripts, the large character sets present some challenges for software developers.

Each script also has some unique features. The Chinese script only uses ideographic symbols which represent things or ideas. Japanese uses some ideographs, but also uses phonetic characters (with one character per syllable) for suffixes, particles and individual words. Korean is actually an alphabetic script, but is written with the characters of each syllable forming a square glyph. Computers often handle each syllable as a unit, resulting in thousands of characters despite their composition out of a few dozen alphabetic components.

Computer Representation of Text

Computers represent characters using numeric codes. Each character is represented using one or more 8 bit byte. Alphabetic scripts such as Latin, which is used for English, French, German and many other languages, each character is usually represented by a single 8 bit byte. In order to handle the thousands of characters required for East Asian writing systems, each character must be represented by two or more bytes. Traditional East Asian codepages were designed for Latin characters as well as the most common characters for the script. Newer standards and systems were developed to handle many more of the less frequently used characters from the script.

Today, the Unicode standard and the international and national standards that define the same character set (ISO 10646, GB 13000, JIS X 0221, etc.) include all characters in modern use in East Asian and most other languages, as well as a growing number of historic characters and scripts with small user communities.

Historically, character sets for alphabetic scripts were called single-byte character sets (SBCS), in contrast to early East Asian character sets which were called double-byte character sets (DBCS). These terms are still widely used although many codes are now multi-byte (MBCS) using one, two, three or even four bytes per character.

Keyboard input and display

Keyboards for alphabetic languages require only one or two shift keys and sometimes a simple 'dead-key' mechanism for diacritics. Obviously, this does not work for selecting among thousands of characters. For East Asian language keyboard input, computer systems provide input method editors (IMEs). They show a selection of characters in a small window allowing the user to select from these characters using the mouse. The user may also be able to "type" phonetic syllables or special codes from a regular-size keyboard. Ambiguous input is sometimes resolved by selecting from a list of final character choices. Once the appropriate character has been composed or selected it is sent to the application.

The display and printing of East Asian text is mostly straightforward. The characters are selected one by one from large fonts, and they do not interact typographically. Text is either displayed in horizontal rows like in English or in vertical columns (top-down, with columns progressing from right to left). A more sophisticated complication is the annotation of ideographic text with a phonetic pronunciation guide that is printed using a smaller font parallel to the main text. Software developers rarely need to deal with these issues directly because they are provided by the operating system or other runtime environment (e.g., Java).

Text boundaries

East Asian languages are written without spaces between words. This means that for word selection, line breaking and similar operations, special algorithms need to be used to analyze the text. Such algorithms work best with language-specific dictionaries and by taking grammatical rules into account. While simple heuristic algorithms are readily available in various libraries (such as the Java and ICU) more sophisticated ones require substantial development and are only available commercially.

Contact IBM

Need assistance with your globalization questions?

Topic Contents