Introduction to bidirectional languages

Bidi characteristics: shaping

Bidi scripts are characterized by bidirectionality and shaping.

The following are key aspects of shaping:

Shaping is the process by which characters are rendered in appropriate presentation forms. This can involve the presentation of characters in a form different from the one in which they are stored. To simplify processing, an unshaped (abstract or basic) representation is generally used internally. Shaping takes into account the character being shaped and the characters in its vicinity, and replaces its abstract representation (or that of its parts) with the proper shape. Shaping is a characteristic of many complex languages, particularly the cursive languages of the Middle East.

Shaping in cursive script languages
Arabic scripts are cursive. A writing system is cursive if it is more suited to handwriting than printing, with adjacent characters in a word connected to each other. This is the only way in which Arabic script is used, whether in books, newspapers, signs, or workstation displays. Although English can be handwritten in a cursive style, it is seldom published or displayed that way, which is why English is not considered a cursive script.

In cursive scripts, letters can assume different shapes according to their position in the word and to the connective properties of adjacent letters. There are as many as four shapes for each letter. For example, the letter Ghayn has four shapes: Isolated, Final, Initial, and Middle:

  • Isolated: the character is not linked to either the preceding or the following character

  • Final: the character is linked to the preceding character but not to the following one

  • Initial: the character is linked to the following character but not to the preceding one

  • Middle: the character is linked to both the preceding and following characters

Only one shape per letter is represented on Arabic keyboards, but all shapes must be available for presentation. Similarly, text in most languages is not stored with full shapes. Each character has a base form, an abstraction that allows the selection of a cursive character without specifying its shape.

The proper shape can be selected by a shape determination routine, which allows for automatic algorithmic selection of the appropriate shape according to its context. It may allow for user or software selection of any of the four shapes mentioned above. Or it may allow transparent pass-through of data and become temporarily deactivated under software or user control.

Whenever cursive language characters are processed into one shape, they must be reshaped using the same algorithm prior to presentation. In specific cases data may be corrupted by this process, as when the algorithm may not be perfectly reversible. As an analogy, in English "mono casing", 12Ab2 would change to 12AB2, and the return to lowercase would result in 12ab2, which is wrong. Although most cursive language text is stored as basic shapes only, there are cases when it may be stored with characters shaped as presented, as in messages or online help text.

Character composition, ligatures, and diacritics
In complex-text languages, there may not be a one-to-one correspondence between the number of characters of text stored for processing and the number of characters in the presented text. Sometimes two or more characters might be represented by a single glyph occupying one presentation cell:

Because of limitations in display devices and the number of code points available, bidirectional languages such as Hebrew have had to compromise on the ability to represent vowels by diacritics. Vowel sounds have to be surmised by the reader based on knowledge of the language and the semantics of the text. Here as an example, see how the first line of the Bible appears in Hebrew when no vowel points (diacritics) are used:

However, guesswork is not acceptable for specific applications, such as poetry or a classical text which requires the use of diacritics.
Here is the same text from the Bible when the vowel points (as well as some other diacritics used for cantillation purposes) looks like:

Arabic is also written without vowel signs. They are added in poetry and religious books such the Quran.

In Arabic, spacing diacritics are currently used as a compromise. In several Arabic systems, some or all the Arabic diacritics are implemented as separate characters to be rendered following the character to which the diacritic belongs.

National numbers
In both Latin-based languages and Hebrew, numbers are represented using the so-called "Arabic" digits (1,2,3,4,5,6,7,8,9,0). Cursive languages such as Arabic, Persian, and Urdu, as well as many other complex-text languages, have their own national glyphs for digits. The local name for digits used in the cursive languages is not "Arabic digits", but rather Hindi or Arabic-Indic digits.

In most cases, text stored for processing has numbers encoded in their Arabic (Western). When displayed, these numbers might use either national glyphs for digits or ordinary Arabic digits, according to the intent of the user or application developer.