On the keytops of an Arabic keyboard, only one shape per character is represented, but all the shapes must be available for presentation during input and output. The proper shape is selected by a "shape-determination routine", which may allow for automatic or algorithmic selection of the appropriate shape according to context as directed by the software user.
This rule applies to any product that processes and presents Arabic text. In many situations, your product can call upon services that are available either from the underlying software or from the platform to ensure compliance to the rule.
Your design must provide an appropriate entry point and enough memory to invoke a shape determination routine. The routine can operate in different modes:
- Automatic shaping determination, where the routine selects the appropriate shape of each character automatically via an algorithm (for example, on Linux and Windows shaping occurs automatically on the font engine level. The iSeries i5/OS has keyboard functions specified within continued-entry fields that allow for automatic shape determination processing to occur for Arabic if the cursor direction is right-to-left);
- Specific shape selection, where the user or software can explicitly choose any of the four shapes;
- Pass-through, where the shaping algorithm is bypassed and no shaping is done.
For a list of Unicode shaping classes (such as joining_Type and joining_Group) for Arabic and Syriac positional shaping, see http://www.unicode.org/Public/UNIDATA/ArabicShaping.txt (TXT, 31.5KB).
A ligature is a glyph that replaces two or more characters; their use in Arabic is prevalent. The Lam-Alef (U+06D9) shown in the figure below is the most widely used ligature which replaces the characters Lam (U+0644) and Alef (U+0627).
In some code page encodings such as CCSIDs 420, 864 and 1046, the ligature Lam-Alef has its own character encoding. In others such as CCSIDs 425, 1256 and ISO8859-6, there is no special encoding and the rendering engine algorithmically uses the appropriate Lam-Aleph ligature glyph for presentation.
Tashkeel or diacritic marks
A Tashkeel or "diacritic mark", illustrated in the figure below, gives a different pronunciation to the Arabic letter it is associated to. An Arabic word could have different meanings depending on its pronunciation.
Example: AIX on the IBM System p has the -csd CharShape option for the bterm command that emulates terminals in bidirectional mode. The -csd option specifies the shape of Arabic characters through the CharShape variable which can be set to automatic, isolated, initial, middle, final, or passthru. The default is automatic shaping.
Since it is more convenient to process Arabic text when each character has only one representative value, one usually converts all (shaped) Arabic characters to their base shapes by a deshaping process for storage and processing. The base shape is an abstraction that allows the selection of a character without specifying the shape, and the use of only one code point per character. The shaping routine maps code points of the shaped characters to the corresponding code points of the base characters.
Example: Deshaping will occur when the Text Shaping attribute of the source Arabic CCSID is shaped but the Text Shaping attribute of the target Arabic CCSID is unshaped. When converting from one Arabic CCSID to another Arabic CCSID, DB2 UDB will employ the logic below to deshape (or expand) the lam-alef ligature:
- If the last character of the data stream is a blank character, then every character after the lam-alef ligature will be shifted to the end of the data stream, therefore making available an empty position for the current lam-alef ligature to be deshaped (expanded) into its two constituent characters: lam and alef.
- If the first character of the data stream is a blank character, then every character before the lam-alef ligature will be shifted to the beginning of the data stream, therefore making available an empty position for the current lam-alef ligature to be deshaped (expanded) into its two constituent characters: lam and alef.
- If there is no blank character at the beginning and end of the data stream, and the lam-alef ligature cannot be deshaped, then the lam-alef ligature remains as is if the target CCSID does have the lam-alef ligature; otherwise, the lam-alef ligature is replaced by the target CCSID's SUB (substitution) character.
Conversely when converting from an Arabic CCSID whose Text Shaping attribute is unshaped to an Arabic CCSID whose Text Shaping attribute is shaped, the source lam and alef characters will be contracted to one ligature character, and a blank character is inserted at the end of the target area data stream.