Introduction to Indic languages

Word formation in Indic languages

The first place where Indic languages differ from other languages is in the formation of words. In English, a simple left-to-right reading of a word is adequate to construct the sounds (or 'phonemes') that represent the word. In Indic languages there is reordering and reshaping of characters because the physical representation of Indic words is different from their pronunciation. This requires some an explanation.

Indic writing systems are 'orthographic,' which means that they are formed by a mixture of 'phonemic' and 'syllabic' forms. Although Indic words are comprised of syllables, a syllabic unit is also an individual visual unit (or glyph). In some cases the glyphs are completely reconstructed. In other cases, visual markers are applied above, below, to the left, and to the right of the glyph.

Syllable formation always centers around a single character (whether or not it is a conjunct-cluster) referred to as the 'base' or 'root' character. When two characters are combined, their component parts may be rearranged. Sometimes the result is identifiable, but at other times the resulting form can be identified only by a trained eye. This happens most often in Case 4, and sometimes in Case 3.

Most technical implementations of Indic syllable formation do not follow any linguistic logic, making application development more difficult. The Unicode implementation, with its ISCII roots, is close to the linguistic view of Indic languages, but still requires developers to apply rules. This article provides a parallel representation of the linguistic view and the Unicode approach to better explain the logic behind syllable formation scenarios. It is important that developers understand this logic.

The simplest way to understand syllable formation is to think of a consonant character as a 'dead' consonant. This is the approach taken by linguists. Dead consonants reduce each unit in a combining sequence to its most basic form, allowing it to combine and generate the resulting glyph for the syllable.

Types of Indic syllables
Case 1. An independent vowel
Case 2. An independent consonant
Case 3. Combination of a vowel and consonant
Case 4. Combination of consonant and consonant ('conjunct-cluster')
Case 5. Combination of vowel and special character
Case 6. Combination of consonant and special character

Cases 1 and 2
Cases 1and 2 are not issues in Indic word formation and will not be discussed.

Case 3
Vowels themselves can form syllables, so they must follow a consonant to form consonant-syllables. When this happens, the pure consonant acquires the sound of the vowel following it:

Linguistic Indic character pure consonant 'k' Indic character + vowel 'ii' Indic character = syllable 'kii'
Unicode Indic character Consonant 'ka' Indic character + vowel sign 'ii' Indic character = syllable 'kii'

Note glyph arrangement

The stroke that appears on the right of the base character represents a 'vowel sign' (called a 'matra' in Hindi). If you use the pure-consonant approach, the stroke appears only in the final syllable because, linguistically, the matra has no independent existence. It is used as an independent unit for Unicode implementation only.

Obviously, the pure consonant approach provides a simple solution: one character in the alphabet (a pure consonant), combines with another element of the alphabet (a vowel), to create a syllable. By extension, pure consonants, when modified by vowels, give rise to independent consonants. Applying vowels to consonants can be equated with creating syllables.

Since matras are needed only to represent vowel-modifications of base consonants, there are as many matras in a script as there are vowels.

Here is an example of a matra applied on top:

Linguistic Indic character pure consonant 'k' Indic character + vowel 'ay' Indic character = syllable 'kay'
Unicode Indic character consonant 'ka' Indic character + vowel sign 'ay' Indic character = syllable 'kay'

Here is an example of a matra applied below:

Linguistic Indic character pure consonant 'k' Indic character + vowel 'uu' Indic character = syllable 'khuu'
Unicode Indic character consonant 'kka' Indic character + vowel sign 'uu' Indic character = syllable 'khuu'

Here is a more complicated example of vowel modification:

Linguistic Indic character pure consonant 'g' Indic character + vowel 'i' Indic character = syllable 'gi'
Unicode Indic character consonant 'ga' Indic character + vowel sign 'i' Indic character = syllable 'gi'

Note glyph rearrangement

The matra in this case is placed to the left of the base consonant. For the reader, this means that the visual left-to-right format () of Devanagari is interrupted and must be phonetically read from right-to-left (first the 'ga' and then the 'i' to get 'gi').

Complicated vowel modification also occurs when vowel modifiers appear before and after the base consonant. These are called 'split' matras and are found in Bengali and Tamil, among other scripts. The reading of these Indic syllables is neither left-to-right, nor right-to-left, but is truly orthographic.

The following is an example in Tamil:

Unicode Indic character consonant 'ka' Indic character + vowel sign 'o' Indic character = syllable 'ko'

Matras are usually rendered at the same location for every character it is applied to. The following example shows a variation from this:

Unicode Indic character consonant 'ka' Indic character + sign 'uu' Indic character = syllable 'kuu'

However, notice how the matras are in a different location:

Unicode Indic character consonant 'ra' Indic character + sign 'uu' Indic character = syllable 'ruu'

Multiple vowel sounds (such as the 'ou' in 'house') do not mean that more than one vowel-modifier is applied to base consonant - this cannot be done. There are separate vowels, and separate vowel markers, for these sounds.

This complicated glyph rearrangement creates a data storage problem because the visual order of the characters does not match the phonetic order (that is, the pronunciation). Solutions are discussed later in this article.

Case 4
This is another case in which consonants combine to form a single syllable.
Consonant-syllables (called 'yuktakshar' in Hindi), are referred to as 'conjunct-clusters' in this discussion.

Continuing the pure consonant approach to understanding syllable-formation, pure consonants are the only consonants that can combine. It is easy to see why: it is impossible to pronounce a consonant syllable if the inherent vowel 'a' is not removed 'Hindi' would become 'hinadi' if the inherent vowel in 'na' were not removed.

There are two display rules for conjunct-clusters:

1. When the conjunct-cluster forms a single graphical unit or ligature (preferred)
2. When the conjunct-cluster does not form a single ligature

The choice does not affect the pronunciation of the conjunct-cluster:

Linguistic Indic character pure consonant 'k' Indic character + pure consonant 'ss' Indic character + vowel 'a' Indic character = conjunct-cluster consonant syllable 'kssa'
Unicode Indic character consonant 'ka' Indic character + halant Indic character + 'ssa' Indic character = syllable 'kssa'

Note single ligature formation

Conjunct-clusters can be displayed without ligatures, but this is restrictive and impossible in advanced cases. An inscriber can write the simplest representation by sequentially writing the consonants that need to be combined, with each followed by a halant. Within reasonable limits, this 'separated'/'split' series would be the equivalent of single ligature formed out of the combination of consonants. Here is an example:

Linguistic Indic character k Indic character ss Indic character + a Indic character +tt Indic character + i Indic character kssati Indic character kssati

Remember that the ligature form and split form of the conjunct 'kssa' are equivalent in every respect except display. The separated, or split, form of the conjunct 'kssa' is represented in Unicode using the 'zero-width non-joiner' or ZWNJ.

Two Unicode special formatting control characters play an important role in representing the different forms of Indic conjuncts: the 'zero-width joiner' (ZWJ), and the 'zero-width non-joiner' (ZWNJ).

  1. A Zero Width Joiner (ZWJ) is typically used to fuse two characters that normally do not form a ligature or a fused form, or do not join in a cursive script, to form a new shape or ligature or join in a cursive script.
  2. A Zero Width Non Joiner (ZWNJ) is typically used to represent the separated form of characters that normally fuse together to form a ligature or join in a cursive script.

In the context of Indic scripts, the halant representation has an implicit behaviour similar to ZWJ, as can be seen from the examples given earlier involving the halant. The ZWJ and ZWNJ, among other things, can be used to represent different forms of conjuncts as shown in the following example:

table of ZWJ and ZWNJ scriptThe ZWJ following the consonant+halant sequence ('sha'+ halant in the example) represents the half-consonant form of the syllable ('shva' in the above example). This behaviour is equivalent to the use of the 'invisible'(INV) character in ISCII. The ZWNJ, on the other hand, is used in representing the split or separated form of the conjunct. When neither the ZWJ nor the ZWNJ appears following the halant character, the conjunct is shown in the customary full ligature form. In some cases, the customary ligature form and half-consonant forms are identical.

Generally speaking, the ligatures available for conjunct-clusters in a script would be adequate for the 'original' words of a language. Split or separated conjunct-clusters are mostly used when writing foreign words, or when complex ligatures are not known to the inscriber.

Therefore it is possible to have multiple ligatures for the same conjunct-cluster. This is an alien concept in English and other Latin scripts.

Fully-formed conjunct-clusters can function as individual consonants. This enables them to combine with vowel modifiers in the same way as consonants do.

For those who know, a single complex ligature is the easiest inscription. But no ligature formation rule can be applied as a silver bullet to create graphical displays of conjunct-clusters. As experience is gained with the implementation of Indic languages with Unicode, additional characters or explanatory material are being added to the standard.

Case 5
Syllables of this type involve vowels and special characters. Some Indic scripts have special-modifiers that modify the sounds of the base character

Indic character vowel 'a' Indic character + special-modifier 'ng' Indic character = modified character 'ang'
Indic character vowel 'oo' Indic character + special-modifier 'nn' Indic character = Modified character 'oonn'
(as in ‘Henri’ in French)

Note glyph rearrangement

Case 6
Syllables of this type involve consonants and special characters. When applied to consonants, the same special-modifiers as in case 5 modify the sounds of the pure consonants:

Indic character pure consonant 'ch' Indic character + special-modifier 'ng' Indic character = modified character 'chng'
Indic character consonant 'ch' Indic character + special-modifier 'nn' Indic character = Modified character 'chn'

Note glyph rearrangements

These six cases address all instances of glyph formation for syllables, and are the main cause of difficulty in the development of computing support for Indic languages.