Unicode Support

The Unicode Standard is a standardized character code designed to encode international texts for display and storage. It uses a unique 16- or 32–bit value to represent each individual character, regardless of platform, language, or program. Using Unicode, you can develop a software product that will work with various platforms, languages, and countries or regions. Unicode also allows data to be transported through many different systems.

There are two different forms of Unicode support available from the compiler and run time. This section describes the two forms of Unicode support as well as some of the features of and considerations for using that support. To obtain additional information about Unicode, visit the Unicode Home Page at www.unicode.org.

The first type of Unicode support is UCS-2 support. When the LOCALETYPE(*LOCALEUCS2) option is specified on the compilation command, the compiler and run time use wide characters (that is, characters of the wchar_t type) and wide character strings (that is, strings of the wchar_t * type) that represent 2-byte Unicode characters. Narrow (non-wide) characters and narrow character strings represent EBCDIC characters, just as they do when the UCS-2 support is not enabled. The Unicode characters represent codepoints in CCSID 13488.

The second type of Unicode support is UTF-8 or UTF-32 support (also known as UTF support). When the LOCALETYPE(*LOCALEUTF) option is specified on the compilation command, the compiler and run time use wide characters and wide character strings that represent 4-byte Unicode characters. Each 4-byte character represents a single UTF-32 character. Narrow characters and narrow character strings represent UTF-8 characters. Each UTF-8 character is from 1 to 4 bytes in size. Most normal characters are a single byte in size, and, in fact, all 7-bit ASCII characters map directly to UTF-8 and are 1 byte in size. The UTF-8 characters represent codepoints in CCSID 1208.

When the UTF support is enabled, not only do the wide characters become UTF-32 Unicode, but the narrow characters become UTF-8 Unicode as well. As an example, consider the following HelloWorld program.

#include <stdio.h>

int main() {    
   printf("Hello World\n");    
   return 0; 
}

When this program is compiled with UTF support, the character string is stored within the program as UTF-8 characters and not EBCDIC characters. The printf() function knows this and is able to parse the UTF-8 characters and generate the output as expected. However, if this program called some other user-supplied routine that did not know how to handle UTF-8 characters, the other routine might yield incorrect results or behavior.

[ Top of Page | Previous Page | Next Page | Contents | Index ]