Introduction to bidirectional languages

Arabic script

Languages that use Arabic script include Arabic, Urdu, Persian (or Farsi) and others.

With the spread of Islam, the Arabic alphabet was adopted by several non-Arab nations for writing their own languages. To accommodate sounds that do not exist in Arabic, alphabets of these languages added new characters based on Arabic letters with additional diacritic dots and signs positioned above or under them. Persian (Farsi), for example, uses Arabic characters with four additional characters to represent the phonetics that do not exist in Arabic: p, ch, zh, and g. Until 1929, the Ottoman Turks also used the Arabic alphabet with one additional letter.

This alphabet is used to write other Turkish languages and dialects, such as Uighur, Kazakh, Uzbek, and Tajik. Several other languages use the Arabic alphabet, including Urdu, Malay, Swahili, Hausa, Algerian Tribal, old Malay, Baluchi, Kashmiri, Sindhi, Pashto, Landha, Dargwa, Morrocan Arabic, Adighe, Ingush, Berber, Kurdish, and Jawi/Javanese.

Arabic text is cursive and characters are generally connected one to another so that they appear hand written, even when printed:

Arabic is a beautiful language
In English, this translates to 'Arabic is a beautiful language.

Arabic character shapes
Shape refers to the way a character is positioned relative to preceding and following characters. Depending on syntax, Arabic scripts can contain from one to four shapes for each character or ligature. The possible shapes for the Arabic character Ghayn are:

Isolated shapeIsolated: the character is not linked to either the preceding or the following character

Final shapeFinal: the character is linked to the preceding character but not to the following one

Initial shapeInitial: the character is linked to the following character but not to the preceding one

Middle shapeMiddle: the character is linked to both the preceding and following characters

In a text string, the shaping rules that govern a character, its neighbors, and its position within a word determine its presentation shape.

Arabic ligatures
A ligature is a glyph that replaces two or more characters; their use in Arabic is prevalent. The most widely used ligature is the ligature Lam-Aleph, which replaces the characters Lam and Aleph:

Arabic ligatures

The Lam-Aleph ligature is so widely used that it has its own set of shapes (Isolated, initial, middle, and final).

In some encodings (Code Pages), the ligature Lam-Aleph has its own character encoding. In others, there is no special encoding and the rendering engine algorithmically uses the appropriate Lam-Aleph ligature glyph for presentation. Be careful when transforming Arabic text from an encoding where the Lam-Aleph ligature has a defined code point (such as IBM Code Page 420) to an encoding where there are no code points for that ligature (such as MS Code Page 1256). In the latter case, the Lam-aleph ligature may need to be separated into two discrete characters: Lam and Aleph.

Arabic character set size
Arabic has 28 characters (including Lam-Aleph).

Numerals used in Arabic
In countries using Arabic script, numbers are represented either by the same "Arabic" digits used in the Western world, or by "national" digit shapes, known as Hindi shapes.

Mathematical expressions usage in Arabic script
In Arabic, mathematical expressions are written from right to left, even though numbers within them are still written from left to right, regardless whether Arabic or Hindi digits are used. In Persian (Farsi), mathematical expressions are written from left to right.