A UCMAP source file defines UCS-2 (Unicode) conversion mappings
for input to the uconvdef command. Conversion mapping values are defined
using UCS-2 symbolic character names followed by character encoding
(code point) values for the multibyte code set. For example:
- <U0020>
- \x20 represents the mapping between the <U0020> UCS-2 symbolic
character name for the space character and the \x20 hexadecimal code
point for the space character in ASCII.
In addition to the code set mappings, directives are interpreted
by the uconvdef command to produce the compiled table. These directives
must precede the code set mapping section. They consist of the following
keywords surrounded by <> (angle brackets), starting in column
1, followed by white space and the value to be assigned to the symbol:
- <comment_char>
- Character used to denote start of escape sequence. Default escape
character is <number_sign> (#). In ucmap, source shipped by C/370 <percent_sign>
(%) is specified for <comment_char>.
- <escape_char>
- Character used to denote start of escape sequence. Default escape
character is <backslash> (\). In ucmap source shipped by C/370 <slash>
(/) is specified for <escape_char>.
- <code_set_name>
- The name of the coded character set, enclosed in quotation marks("),
for which the character set description file is defined.
- <mb_cur_max>
- The maximum number of bytes in a multibyte character. The default
value is 1.
- <mb_cur_min>
- An unsigned positive integer value that defines the minumum
number of bytes in a character for the encoded character set. The
value is less than or equal to <mb_cur_max>. If not specified,
the minimum number is equal to <mb_cur_max>.
- <char_name_mask>
- A quoted string consisting of format specifiers for the UCS-2
symbolic names. This must be a value of AXXXX, indicating an alphabetic
character followed by 4 hexadecimal digits. Also, the alphabetic character
must be a U, and the hexadecimal digits must represent the UCS-2 code
point for the character. An example of a symbolic character name based
on this mask is <U0020> Unicode space character.
- <uconv_class>
- Specifies the type of the code set. It must be one of the following:
- SBCS
- Single-byte encoding
- DBCS
- Stateless double-byte, single-byte, or mixed encodings
- EBCDIC_STATEFUL
- Stateful double-byte, single-byte, or mixed encodings
- MBCS
- Stateless multibyte encoding
This type is used to direct uconvdef on the type of table to build.
It is also stored in the table to indicate the type of processing
algorithm in the UCS conversion methods.
- <locale>
- Specifies the default locale name to be used if locale information
is needed.
- <subchar>
- Specifies the encoding of the default substitute character in
the multibyte code set.
The mapping definition section consists of a sequence of mapping
definition lines preceded by a CHARMAP declaration and terminated
by an END CHARMAP declaration. Empty lines and lines containing <comment_char>
in the first column are ignored.
Symbolic character names in mapping lines must follow the pattern
specified in the <char_name_mask>, except for the reserved symbolic
name, <unassigned>, that indicates the associated code points are
unassigned.
Each noncomment line of the character set mapping definition must
be in one of the following formats:
- This format defines a single symbolic character name and a corresponding
encoding.
"%s%s%s/n", <symbolic_name>, <encoding>, <comments>
For
example: <U3004> \x81\x57
The encoding part is
expressed as one or more concatenated decimal, hexadecimal, or octal
constants in the following formats:
- "%cd%d",<escape_char>, <decimal byte value>
- "%cx%x",<escape_char>,<hexadecimal byte value>
- "%c%o",<escape_char>,<octal byte value>
Decimal constants are represented by two or more decimal digits
preceded by the escape character and the lowercase letter d, as in \d97 or \d143.
Hexadecimal constants are represented by two or more hexadecimal digits
preceded by an escape character and the lowercase letter x, as in \x61 or \x8f.
Octal constants are represented by two or more octal digits preceded
by an escape character.
Each constant represents a single—byte
value. When constants are concatenated for multibyte character values,
the last value specifies the least significant octet and preceding
constants specify successively more significant octets.
- This format defines a range of symbolic character names and corresponding
encodings. The range is interpreted as a series of symbolic names
formed from the alphabetic prefix and all the values in the range
defined by the numeric suffixes.
"%s...%s %s %s/n",<symbolic-name>,<symbolic_name>,<encoding><comments>
For
example: <U3003><U3006> \x81\x56
The listed
encoding value is assigned to the first symbolic name, and subsequent
symbolic names in the range are assigned corresponding incremental
values. For example, the line:
<U3003>...<U3006> \x81\x56
is
interpreted as:
<U3003> \x81\x56
<U3004> \x81\x57
<U3005> \x81\x58
<U3006> \x81\x59
- This format defines a range of one or more unassigned encodings.
"<unassigned>"%s...%s %s/n",<encoding>,<comments>
For
example, the line
<unassigned> \x9b...\x9c
is interpreted
as:
<unassigned> \x9b <unassigned> \x9c