Introduction to Indic languages

Storage, input and display of Indic text

The visual re-ordering of Indic syllables presents an interesting problem in data handling. This section addresses and simplifies storage, display and input issues associated with Indic languages, such as the following:

Indic character     
    
    
    
consonant 'ha'
Indic character     
    
    
    
+ vowel sign 'i'
Indic character     
    
    
    
consonant 'na'
Indic character     
    
    
    
+ 'halant'
Indic character     
    
    
    
+ consonant 'da'
Indic character     
    
    
    
+ vowel sign 'ii'
Indic character     
    
    
    
= word 'hindii'

In older encoding mechanisms, visual encoding was often used directly for data encoding. This meant that there was a one-to-one correspondence between the visual appearance of text, the entry of text elements, and data storage. If a developer could solve the problem of creating unique basic units for conjunct formation, both data entry and storage were simplified. But the encoded data did not usually have any linguistic structure, making data analysis difficult.

The ISCII scheme made a quantum leap by localizing the complexity of display, simplifying input, storage, analysis, and manipulation. Since ISCII was created by the linguistic deconstruction of Indic languages, it preserved the linguistic integrity of the words. Unicode, having been based on ISCII-88, supports this also.

Storage mechanisms

As mentioned above, Unicode encoding does not use the pure consonant approach. It combines vowels, consonants (with inherent vowels 'a') and special-modifiers to form the basic character set for an alphabet. Unicode also uses 'vowel signs' to construct consonants-vowel syllables. To construct consonant-consonant syllables, Unicode uses the 'dead' consonant marker 'halant' to suppress the inherent vowel in the base consonants. This dead consonant combines with the following consonant to form a syllable complete with an inherent vowel. The special-modifiers combine with either vowels or consonants to create their modified forms.

All of these characters are stored in left-to-right sequence, and preserve the linguistic integrity of the data. Even with split vowels, for example, the split vowel-modifier is entered after the consonant. The rearrangement is performed by the display mechanism, but storage is unambiguous.

Unicode encoding is simplest because it uses the linguistic basis of Indic languages. If one ignores pure consonants, Unicode encoding is also the simplest for users to understand.

Input mechanisms

Input mechanisms consist of keyboard designs (or layouts) in which each keystroke represents an element of the script. The mechanism should allow users to enter the required key-strokes and have words rendered in a predictable and intuitive manner.

For most Indic input mechanisms, keystrokes do not correspond to the basic alphabet of the script, but are designed to ensure that they allow words to display as expected visually. For example, a matra could be applied before a consonant. When these characters are stored, the input mechanisms make no effort to convert them to a linguistically correct form, but implement direct encoding from display to storage. This is called visual encoding. Given the nature of Indic languages, it creates difficulty in data analysis. Most of the many input mechanisms used in India work this way.

Keyboard layouts used in India include Remington, Akruti, Phonetic, DoE, and Inscript. The Inscript layout has one-to-one mapping to ISCII (and Unicode characters) for each Indic script. Other layouts can also encode their output to Unicode, but the solution provider has to forward-map keyboard characters to their equivalent Unicode codes. This is done using an Input Method Editor (IME) engine, which does text processing and maps the visual elements to their linguistic equivalents.

Inscript constructs syllables using the following keystrokes:

Vowel / Vowel-Modifier / Consonant + Consonant / Vowel + Vowel / Vowel-Modifier / Consonant/ Special-modifier

Each consonant can be:

a. Consonant
b. Conjunct-cluster
c. Consonant + (Vowel-modifier/Special-modifier) + (Consonant)

Input methods for Indic scripts differ between implementations. This creates differing behavior due to the unavailability of clear specifications:

  1. It is not clear how text selection should happen. Some implementations allow character by character selection while others only allow it syllable by syllable.
  2. The backspace and delete keys of some input devices remove characters one by one while others delete whole syllables.
  3. Some implementations do not allow caret signs to be inserted in syllables while others do.
  4. When there is a break at the end of a line it can be unclear what should be used as the anchor for the break. This means that some text analysis has to be performed before the anchor is determined, but the rules are not clear.

We hope our effort to create a single source for defining technical standards for Indic languages will solve this problem.

Display mechanism

The display mechanism is an application used to convert keystrokes to visual elements rendered on computer screens. This system also performs other important functions. The input media editor is a part of this system.

The display mechanism consists of these components:

  1. A text pre-processor that analyzes the text
  2. A layout table that maps the keystrokes to the text units they represent. The information for re-arranging/reconstructing the characters is based on what characters suround it. True-type and open type fonts contain layout tables that can be used for text display.
  3. A layout engine that receives information from the layout tables and does the actual output formatting thatis sent to the rasterizer for display.

For Indic scripts, the text pre-processor and IME reduce a script to its basic parts, and the mapping table allocates the layout to be used for the characters, based on the context in which the characters appear (i.e. which characters precede and follow it). The layout engine then uses data from the layout table and displays it on the screen.

International Components for Unicode (ICU), published by IBM, (http://www.ibm.com/software/globalization/icu/index.html) are open source libraries that contain layout engines for developers to use.

The Open Type (http://www.microsoft.com/typography/SpecificationsOverview.mspx) libraries developed by Adobe, Apple and Microsoft also contain components for displaying Indic text.

For input and display mechanisms to address the shaping of Indic characters, a clear description of all possible combinations among the elements of the characters set is required. This data is gradually becoming available. Font developers inside and outside of India have access to the necessary expertise to ensure that users and solution providers get fonts containing the necessary ligatures. For a limited list of ligatures, please see the magazines released by the Technology Development for Indian Languages group (TDIL) of the Ministry of Communications and Information Technology (http://tdil.mit.gov.in/).