Guideline F: Coded character sets

F1: Coding graphic characters



Coding graphic characters

When a product has a dependency on a particular graphic character, or when some external standard creates such a dependency, users must be able to use that character, without any knowledge of its internal representation.


Guideline F1


Do not hard code the code points of characters required at the user interface.

The phrase hard code the code points refers to the practice of accepting only a predefined code point as the representation for a particular character. Although this incorrect programming style may not be practiced intentionally, it will happen if the developers assume their local environment is the same as their customers' environments.

Example: The following C segment illustrates the problem:

char cInput;
:
if ( cInput == '#' )
goto quit;

The '#' character has a code point X'7B' in CCSID 00293 , and a code point of X'B1' in CCSID 00297 . If the above C segment is compiled using CCSID 00293 to produce the executable code, then the program will fail in France where CCSID 00297 is used.

A solution is to use the X/Open XPG4 function mbtowc(...) or the ISO/IEC C function mbrtowc(...) to convert, or normalize cInput to wide character form prior to the comparison. The encoding of wide character (the C data type wchar_t) is strictly internal to the compiler, and thus is completely hidden from the application. Both functions, mbtowc(...) and mbrtowc(...), assume cInput is encoded in the user's currently active CCSID.

wchar_t wcInput;
char cInput;
(void)setlocale( LC_ALL, "" ); /* set to user's preference */
:
(void)mbtowc( &wcInput, &cInput, 1 ); /* char-->wide char */
if ( wcInput == L'#' ) /* the L prefix converts '#' to a wide char */
goto quit;

The above example ensures that the # character is recognized, regardless of its code point; but what if the active CCSID or the user's keyboard does not contain the # character, then your product is not useable. It is better if your product is designed such that the special termination character is not hard coded into the code. See Guideline F2 - Using Graphic Characters for more discussion.


Guideline F1-1


Avoid any dependency on coded character sets.

By not hard coding any code point, and not making any assumption about the coded character set of the data, the guideline is satisfied.

Example: The following C segment converts the input character to uppercase if necessary:

char cInput;
:
if ( cInput >= 'a' && cInput <= 'z' ) ) /* lowercase? */
cInput -= 0x20; /* convert */

Several assumptions were made in the above code segment:

  1. Only the characters between a and z are checked.
  2. The code points for all lowercase characters are between the code points for a and z.
  3. There is always an uppercase character that corresponds to every lowercase character, and its code point is always X'20' less than its lowercase counterpart.

To remove the coded character set and character dependency from the program, do the following:

  1. Convert cInput to wide character form.
  2. Use the X/Open XPG4 defined wide-character C functions for the validation and conversion.

The wide-character functions are sensitive to the active CCSID in the user's environment.

wchar_t wcInput;
char cInput;
:
(void)setlocale( LC_ALL, "" ); /* set to user's preference */
:
(void)mbtowc( &wcInput, &cInput, 1 ); /* char-->wide char */
if ( iswlower( (wint_t)wcInput ) ) /* lowercase ? */
wcInput = (wchar_t)towupper( (wint_t)wcInput ); /* convert */

Many compilers use some specific numbers to denote special meanings. This may cause problems when the program is used in regions where those specific numbers actually represent valid national characters.

Example: Many C compilers use the integer -1 to indicate the end-of-file (EOF) condition. This may create problems in some situations.

char cInput;
:
if ( cInput == EOF ) goto endOfFile;

If cInput has the graphic character ÿ, which has a code point of X'FF' in the ISO/IEC 8859-1 coded character set, then when comparing to EOF (which is -1 or X'FF...FF'), some C compilers will sign extend cInput to the integer X'FF...FF' before the comparison. The comparison would be incorrectly successful.