What do you need to know to set up multi-byte or different locale databases in IBM Informix Servers?
You currently have a typical en_us (US English) database, which is a normal, default style database with no extra intervention needed for the setup. You want to expand your company database into another country and use a multi-byte character set, or a different locale for the database. What is needed for the setup?
There are several issues to consider when setting up a database with a locale which is non-US. Here is an overview of the considerations, which are discussed below:
- An understanding of code-sets and their impact on a database.
- An understanding of default vs nondefault locale code-sets.
- Unicode vs language code-set usage
Code-sets and their impact
An ASCII code-set is based on 128 printable characters that occupy 7 bits of a single byte representation. Some characters encoded in other languages can not be represented by a single byte value. One example of a multibyte code-set is the Japanese code set (sometimes known as Kanji). The technical name is SJIS (ja_jp.sjis). It is a multibyte code-set, with characters that are encoded in 2 or 3 bytes. Some code-sets can have up to 4 bytes to represent a printable character. There are several issues to consider in using any code-set:
- Column info: With variations in number of bytes per character, a column size for non-integers in a table must be either CHAR or NCHAR (or VARCHAR vs NVARCHAR). How the data type is defined for the column will determine collation order.
- CHAR data type follows code set order.
- NCHAR data type follows collation order determined by DB_LOCALE, and can be single or multibyte.
- LVARCHAR can be single or multibyte and can hold data in the code set of the client or database locale -- if you write the input and output support functions to interpret the LVARCHAR data in the correct locale.
Default vs nondefault Code-sets
A default code-set is relative to the language base in which is is used. Some languages have the same letter characters as English, but they also have character variations. The character variations cause the result that some code-sets are nondefault.. Some considerations for default and nondefault code-sets:
- Default code-sets depend on the platform, and allow for character variations. If your database connects with another default code-set , it is automatically supported.
- Nondefault code-sets may or may not support default code-sets. If your database connects with a nondefault code-set it may or ay not work correctly. Nondefault locales that will work with the default english UNIX code-set (8859-1) include British English (en_gb.8859-1), French (fr_fr.8859-1), Spanish (es_es.8859-1), and German (de_de.8859-1).
- Nondefault locales, such as Japanese SJIS (ja_jp.sjis), Korean (ko_kr.ksc), and Chinese (zh_cn.gb), contain multibyte code-sets. (The unified Chinese code-set is GB18030-2000.)
- When locales between client and server are nondefault locale, data movement can be complex. Characters may not have a mapped equivalent between client and server at the time of transfer. In some cases, the inserting of a nondefault character into a different locale could mean that the result is not going to display correctly, or it may have an incorrect character substitution. Informix Servers support only one locale per database. When words are transliterated to a different locale the process is handled by a conversion object (.cvo) file. Note that this can only happen if appropriate rules, locales, and .cvo files are present. See the IBM Informix GLS User's Guide (GLS=Global Language Service) for more information in the section Performing Code Set Conversion.
- Unicode code-sets permit much greater flexibility. Rather than using a single standard for each language, Unicode provides a unique number representation for every character, no matter what the platform, program, or language. If you want a database that handles 2 or more locales at the same time, you will want to use a Unicode locale. As an example, Unicode sets for English and European languages are mostly UTF-8 (Unicode, 8 bit). Some Asian locales are UTF-16, and some code-sets have allowances for UTF-32. Informix Servers only work with UTF-8.
- Collation order refers to the concept of how printable characters for the code-set will be arranged for sorting and index purposes. The default locale will determine how things will be arranged in terms of sorting. When Unicode is in use, GLS for Unicode (GLU) is a feature that allows your application to use the International Components for Unicode (ICU) libraries instead of the usual GLS libraries. The main advantage of using the ICU libraries is that they take the locale into account when collating Unicode characters, the GLS libraries do not. To force use of GLS for Unicode library collation, set GL_USEGLU=1 in the client and server environment, or compile your application using the -glu option with the esql command. For more information on compiling, see the ESQL/C Programmers Guide.
Code-set installation and setup
- If you want an entire software instance in a specific language, set the OS environment variable (LANG) for your specific language first. For example, when prompted, a user can specify the French language as spoken in Canada during the installation process. The code set automatically defaults to the ISO8859-1 code-set. With this information, the system sets the value of the default locale, specified by the LANG environment variable, to fr_CA (fr for ISO8859-1 French and CA for Canada). Every process uses this locale unless the LC_* or LANG environment variables are modified. (Note: The default locale assumes 7 bit ASCII character set, however, Extended Parallel Server (XPS) rejects any filename usage that is not 7 bit ASCII.)
- The available code-sets are in the $INFORMIXDIR/gls directory. Many additional languages, especially UTF-8 language sets, are included in the International Language Support (ILS) product, available as a separate purchase.
- There are 3 environment settings needed for any change in locale: CLIENT_LOCALE, DB_LOCALE, and SERVER_LOCALE.
- The setup steps for using different languages and code-sets is documented in the IBM Informix GLS User's Guide, available here:
Esql/c Programmers Guide, available here: