Introduction to Indic languages

Sorting sequences

The fact that some Indic languages share the same script creates interesting problems related to their sorting behavior. The absence of standard specifications for these languages means that it is unclear what the correct sorting sequence should be.

Indic scripts originate from one script, Brahmi. Consequently, some Indian languages share the same script (Hindi, Marathi, Sanksrit, Konkani), and others have scripts that are very similar (Tamil-Malayalam, Kannada-Telugu)

Unicode charts assigned to Indic scripts make no distinction between languages. Therefore, some charts use the same code chart for the following languages:

  1. Devanagari: Hindi, Marathi, Sanksrit, Konkani, Nepali
  2. Bengali: Bengali, Assamese, Manipuri
  3. Arabic: Urdu, Kashmiri, Sindhi

The ISCII-88 standard was based on phonetic commonality rather than correct sorting sequence. This distorted some traditional sorting conventions, and developers should not interpret the character sequence to be the same as their collation sequence. For example, though Hindi and Marathi use the Devanagari Unicode charts, the Hindi sorting sequence is not the same as Marathi. This requires that sorting be tailored to languages rather than scripts.

The Unicode Collation Algorithm (UCA) addresses Devanagari script as a whole, providing a default sort order that may be used only when no additional information is available. It can be found in the Unicode Technical Standard #10 document. The ISO counterpart to UCA, the International Standard (ISO/IEC 14651) provides a method for ordering text data independently of context, and provides a Common Template Table tailored to meet the requirements of a given language and culture while retaining universal properties for other scripts. ISO/IEC 14651 defines a reference comparison method applicable to two character strings in order to determine their respective order in a sorted list. This method can be applied to strings exploiting the full repertoire of ISO/IEC 10646-1. This allows for a further specification of a fully deterministic ordering. The table is a starting point for enabling the specification of an international string ordering adapted to different cultures, without requiring an implementer to have knowledge of all the different script already encoded in UCS.