Guideline F: Coded character sets

F2: Using graphic characters



Using graphic characters

Your customers may require certain graphic characters that are different from those used by your product developers, and your product developers may require characters that are not required by your customers.

Example: French users require their national characters, such as, é and à, but have little need for the North American cent ¢ character. French keyboards must have provision for the users to type all their national characters, but not the ¢ character.

If a user must enter a specific graphic character to trigger one of your product functions, and if that character is not shown on the keyboard, then your product is not usable.

Guidelines F2


Provide users with a way of meeting the syntax requirements of the product, if the product requires usage of specific graphic characters.

Specific characters are often used to act as delimiters, as short-cut ways to initiate functions, as operators in functions, or to impart special meaning in a naming convention. Users must be able to enter all graphic characters that are critical to the proper use of your product.

Example: Your spreadsheet product requires all functions begin with the @ character as in @sum(...). The @ symbol is absent on some coded character sets and keyboards such as the French 251 keyboard (PDF, 100KB) , in which case the user must substitute another character for @, and your product must accept that substitution character.

Example: Some keyboards do not support the entire set of characters required by the C programming language. The C programming language standard has defined the following trigraph sequences to allow programmers to create C programs via those keyboards. A trigraph is a sequence of three graphic characters that represent another character.


Character # [ \ ] ^ { | } ~
Trigraph ??= ??( ??/ ??) ??' ??< ??! ??> ??-

IBM has defined the Syntactic Character Set 00640 whose 81 graphic characters and the implicit SPACE character are present in many IBM character sets. The set was initially created to be used by programmers for syntactic purposes towards maximizing portability and interchangeability across systems and country boundaries.


Alphabet ABCDEFGHIJKLMNOPQRSTUVWXYZ 26

abcdefghijklmnopqrstuvwxyz 26
Digits 0 1 2 3 4 5 6 7 8 9 10
Specials . , : ; ? ( ) ' " / - _ & + % * = < > 19
Total characters
81

Figure 2: IBM Syntactic Character Set 00640

Guideline F2-1


Do not assume that lowercase English letters are in all coded character sets.

Example: The Japanese Katakana character set 00332 does not contain the lowercase English letters.
Many scripts such as Chinese, Japanese, Thai, Arabic, and Hebrew, do not process the case attribute.

Guideline F2-2


Never assume a character has the same code point across coded character sets.

Example: The graphic character 's' has a code point of X'73' in the ISO/IEC 8859-1 coded character set used by the workstation. The same character is represented by X'A2' in the IBM EBCDIC coded character set 00500 used by the host. When transferring files between the host and the workstation, conversion of code points must occur to preserve character integrity.

Example: The double-byte SPACE character used in Asian coded character sets has the following code point assignments:


Script CCSID DBCS Code Page ID Code Point
Japanese 942 301 X'8140'
Korean 949 951 X'A1A1'
Simplified Chinese 1381 1389 X'A1A1'
Traditional Chinese 950 947 X'A140'

Even characters with control-like functions can have different code points on various platforms.

Example: The wild card character, usually represented by an asterisk '*', appears visually identical on the various platforms. It has a code point of X'2A' in the ISO/IEC 8859-1 coded character set, and X'5C' in the IBM EBCDIC coded character set 00500.

The reverse is also true in some situations. The single-byte code point X'5C' in a file name is interpreted as the path separator in both the USA and Japanese editions of the IBM PC. On the USA edition running coded character set 00437 or 00850 , X'5C' is the backslash '\' character, but on the Japanese IBM PC running coded character set 00897 or 01041 , X'5C' is the yen '¥' character.

Some organizations may change code page definitions over time, resulting in situations where the same code point in the code page can represent different characters. IBM practice is not to change the existing code page definition, although new characters may be assigned to the currently unassigned code points.

Example: In earlier versions of the ISO/IEC 646 international standard, 7-Bit Character Set for Information Interchange, the code point X'24' is defined to be either the dollar symbol '$' or the national currency symbol. North American systems used X'24' to represent the dollar sign, and systems in the United Kingdom use the same code point to represent the pound sterling sign '£'. To alleviate any confusion when interchanging financial figures, IBM practice is to assign unique coded character set identifiers (CCSIDs) to different versions of the ISO/IEC 646 international standard.

Fonts

Modern displays and printers can obtain the character shapes from files known as fonts. A font is a set of graphic characters with similar design characteristics; that is, a font is a designer's concept of how a set of graphic characters should appear. Fonts can reside in the device ROM, or be downloaded to the device RAM when needed. There are two types of fonts, bitmap fonts and vector fonts. Bitmap font is also called raster font. Each font contains the actual shape of every character, individually hand crafted and tuned to a particular resolution. Vector font, or outline font as it is sometimes called, does not contain the actual shapes, but rather the character shape is drawn by a series of vectors that can be scaled up or down.

Guideline F2-3


Do not assume a particular font contains all the characters in a coded character set.

All characters of a single design, regardless of their attributes such as width, weight, posture, and size, are grouped into a type family. Popular Latin-script based type family include Courier, Times New Roman**, and Helvetica**. Styles, such as, normal, bold, and italic, within a type family are called typefaces.

There is no standard naming scheme for the type family names. The same font can be known by different names on various platforms.

Example: Microsoft Windows** AvanteGarde is identical to IBM/Adobe** ITC Avante Garde Gothic.

Some font names are translated. For example, Courier is known as Courrier on French systems.

As a result of all these variations, it is important not to hard code the font type family name.

Guideline F2-4


Do not hard-code the font type family name.

A solution to selecting the most appropriate font is to specify the font via its attributes or properties, such as those documented in the ISO/IEC international standard 9541, Font Information Interchange Standard. Examples of font properties include:

Font selection should be driven by text script or language. Software must display the data with the appropriate fonts for each language and country.

For example, web browsers already have a mechanism that allows the user to select fonts that are appropriate in terms of code point range, acceptable size and glyphs. When mechanisms such as style sheets are used, the software needs to ensure that the script being displayed causes an appropriate font to be selected. This means that if Japanese text is being presented, then the style sheet should specify a Japanese font to be used by default.