A successful global application must process text written in any language. In some cases, applications must even be able to process text written in more than one language. Many techniques that work well when processing English text will not work when processing languages like Arabic, Hindi and Thai. In this article, we will examine some reasons why processing multiple languages is more complicated, and then we'll look at the techniques needed to process them.
Processing text in applications
Each character processed by an application is represented by a number called a character code, or a code point. A collection of characters and their assigned character codes is called a coded character set, a code set, or a code page. Over the years, many different coded character sets have been defined for different hardware, communications environments and for different languages. One of the most recent coded character sets is called Unicode, which contains all of the characters needed to write all modern languages, and many older languages as well.
The code points that represent a particular piece of text are stored sequentially in memory, usually in spoken order. Spoken order is also called phonetic order, or logical order. For English text, this means that the left-most character will be stored in the lowest address and the right-most character will be stored in the highest address. For Arabic text, which is written right-to-left, the order is reversed. For right-to-left texts, such as Arabic and Hebrew, the order in which text is stored in memory does not correspond to the order in which it is displayed or printed (assuming that you’re using a display or printer which is designed for left-to-right text). Later in this article we'll discuss other situations where this can be true.