Complex text
Displaying English text is a straightforward process. For each character in the text, you retrieve the corresponding glyph image from the font, display it at the next display position, and advance the display position by the width of the glyph image. When the total width of the glyphs exceeds the length of the line, you start a new line. Of course, the process becomes more complex if your application supports multiple fonts and styles, but for basic text display it’s quite easy.
For some scripts and languages, however displaying text is more complicated. These scripts and languages are often referred to as “complex,” and text written in them is often called “complex text.” Here are the major ways in which complex text can be more complex than English text:
It can be bidirectional. For example, Arabic and Hebrew are written from right to left, but numbers are written from left to right, like they are in English. Sometimes an Arabic or Hebrew paragraph will contain a few words from a language which is written left to right, like English, and sometimes text in a left-to-right language will contain a few words of Arabic or Hebrew. This can get even more complicated. For example, you might have some English text which includes some Hebrew text containing some numbers
So, for bidirectional text, the relationship between the order that the code points are stored in memory and the order in which the glyphs are displayed can be quite complex. This makes selection and editing quite challenging since a contiguous run of code points in memory might not be contiguous on the display, and visa versa. Also, the Unicode Standard describes an algorithm for computing the display order of bidirectional text.
There can be reordering. Sometimes, even though a script is written in a single direction, there may be some cases where the order in which the letters are written does not match the order in which they are spoken. Many of the scripts used in India have this behavior. For example, in the Devanagari script, used to write Sanskrit and Hindi, the short “I” vowel is written before the consonants it modifies. (E.g. the word “Hindi” would be written as “iHndi” and the word “strict” would be written as “istrct.”)
This type of reordering also happens in the Thai script, but the Thai standard for text processing specifies that the text be stored in display order rather than logical order. This simplifies the display, selection and editing of Thai text, but it makes other operations, such as sorting, more complicated.
It can have contextual forms. Some letters are written differently depending on the letters around them. For example, in the Arabic script, which is cursive (written with the letters connected), each letter has four different forms: one when it is written in isolation, a second when it’s written at the start of a word, a third when it’s written in the middle of a word, and a fourth when it’s written at the end of a word. Sometimes a letter will take yet another form depending on the particular letters before or after it.
The number of code points and the number of glyphs may be different. Sometimes a sequence of code points will be transformed to a single glyph, called a “ligature.” For example, in the Latin script, the letter “f” followed by the letter “i” is sometimes written as “fi.” This particular ligature is optional, but for text written in some scripts, such as Arabic, there are some ligatures which must be used for the text to be considered correct. Another example is the use of accented letters in the Latin script. Most fonts will contain a single glyph for all of the commonly used accented letters, but the text might contain a code point for the base letter followed by a code point for the accent.
Sometimes a single code point will be transformed into more than one glyph. This can happen for two reasons:
- In some scripts, a single letter is written in two or more parts. For example, in the Tamil script, some of the vowels have two pieces, one of which is written before the consonant, and one which is written after the consonant. Even though the vowel is written in two pieces, there is only a single code point for the vowel, which is stored after the consonant.
- A single code point may be written with more than one glyph to simplify the font. For example, the Urdu language, spoken in Pakistan and India, is written in a particular style of the Arabic script called “Nastaliq.” Nastaliq uses many ligatures and contextual forms. Often an entire word will become a single ligature. Many Arabic letters differ from other letters only by the number of diacritical marks written above or below the letter. Letters which differ from each other in this way will form ligatures and contextual forms which differ from each other only in the number and placement of diacritic marks. Since there are literally thousands of such ligatures and contextual forms, the font designer will often draw a single, unmarked ligature and separate diacritical marks and require the software to render the base ligature and the marks separately. So, for text written in Urdu, the number of code points and the number of glyphs can be quite different.
It can require complex positioning. Consider the Urdu example in the previous paragraph. If the software is required to render diacritical marks separately, it must contain complex logic to draw them in the correct place with respect to the base ligature. This is required for many of the other scripts used in India as well. Sometimes the vowels are written above or below a consonant ligature. There may also be other letters of diacritical marks written above or below the consonants as well, and the positioning requirements can be quite complex.