LC_COLLATE category
The LC_COLLATE category defines character or string collation information. Within LC_COLLATE you can specify a sort sequence to use using the cpysyscol keyword. The cpysyscol keyword value is used in place of the LC_COLLATE category definitions.
A collation element is the unit of comparison for collation. A collation element may be a character or a sequence of characters. Every collation element in the locale has a set of weights, which determine if the collation element collates before, equal to, or after the other collation elements in the locale. Each collation element is assigned collation weights by the CRTLOCALE command when the locale definition source file is created. These collation weights are then used by applications programs that compare strings.
Every character defined in the CCSID that is specified in the CRTLOCALE command is itself a collating element. Additional collating elements can be defined using the collating-element statement. The syntax is:
collating-element character-symbol from string
The LC_COLLATE category begins with the LC_COLLATE keyword and ends with the END LC_COLLATE keyword.
The following keywords are recognized in the LC_COLLATE category:
- cpysyscol
- This statement specifies that a system collating sequence table is to
be used for the collation information for the category. If the locale is intended
to be used to set the sort sequence table for the job, then it is required
that the CPYSYSCOL keyword be used. If the CPYSYSCOL keyword is specified,
no other keyword can be specified. The syntax for the CPYSYSCOL keyword is:
CPYSYSCOL sort sequence path name;langid
The sort sequence path name is a string specifying a fully expanded path name of an existing sort sequence table to use as the definition for this category. The path name delimiter must be a slash (/). Other valid values are strings containing one of the following sort sequence tables:
- *JOB
- The sort sequence of the job.
- *LANGIDUNQ
- The unique-weighted sort sequence table that is associated with the language identifier requested parameter.
- *LANGIDSHR
- The shared-weighted sort sequence table that is associated with the language identifier requested parameter.
- *HEX
- The sort sequence according to the hexadecimal value of the characters.
The langid is a string specifying the language identifier of the sort sequence table to be used. All langid must be in uppercase. Valid values are strings containing one of the following language identifiers:
- *JOB
- Use the language identifier of the job.
- language id
- A valid 3-character language identifier. For example, Danish can be DAN.
- Collating-element
- The collating-element statement specifies multi-character collating elements.
The syntax for the collating-element statement is:
collating-element symbolic-name from string
The symbolic-name value defines a collating element that is a string of one or more characters as a single collating element. The symbolic-name value cannot duplicate any system predefined symbolic name, or any other symbolic name defined in this collation definition. The string value specifies a string of two or more characters or character symbols that define the symbolic-name value. The following are examples of the syntax for the collating-element statement:
collating-element <ch> from "<c><h>" collating-element <e-acute> from "<acute><e>" collating-element <11> from "<1><1>"
A symbolic-name value defined by the collating-element statement is recognized only with the LC_COLLATE category.
- Order_start
- The order_start statement may be followed by one or more collation order
statements, assigning collation weights to collating elements. This statement
is required. The syntax for the order_start statement is:
order_start sort-rules;sort-rules;...sort-rules collation-order-statements order_end
The sort-rules have the following syntax:
directive, directive,...directive
where directive is one of the directives; forward, backward, and position.
The sort-rules directives are optional. If present, they define the rules to apply during string comparison. The number of specified sort-rules directives defines the number of weights each collating element is assigned (that is, the number of collation orders in the locale). If no sort-rules directives are present, one forward directive is assumed.
If present, the first sort-rules directive applies when comparing strings using primary weight, the second when comparing strings using the secondary weight, and so on. Each set of sort-rules directives is separated by a ; (semicolon). A sort-rules directive consists of one or more comma-separated directives. The following directives are supported:
- Forward
- Specifies that collation weight comparisons proceed from the beginning of a string toward the end of the string.
- Backward
- Specifies that collation weight comparisons proceed from the end of a string toward the beginning of the string.
- Position
- Specifies that collation weight comparisons consider the relative position of non-ignored elements in the string. That is, if strings compare equal, the element with the shortest distance from the starting point of the string collates first.
The forward and backward directives are mutually exclusive. The following example shows the syntax for the sort-rules directives:
order_start forward;backward
- Order_end
- This keyword ends collating order entries introduced by the order_start
keyword.
The order of the characters and elements specified between the order_start and order_end keywords defines the character order used in range expressions and regular expressions. If no weights are assigned to the characters, then the character order also becomes the collation sequence weight.
Special symbols
Special symbols are required to be all upper-case characters. The following special symbols can be used in the LC_COLLATE category:
- IGNORE
The optional operands for each collation element are used to define the primary, secondary, or subsequent weights for the collating element. The special symbol IGNORE is used to indicate a collating element that is to be ignored when strings are compared.
- UNDEFINED
All characters in the character set must be placed in the collation order, either explicitly or implicitly, by using the Undefined symbol. The UNDEFINED symbol includes all coded character set values not specified explicitly. These characters are inserted in the character collation order at the point indicated by the Undefined symbol in the order of their character code page values. If a collating weight is not explicitly specified for the UNDEFINED symbol, then by default, all of the undefined characters are assigned the same collating weight equal to the relative order of the first undefined character in the collating sequence. If no UNDEFINED special symbol exists and the collation order does not specify all collation elements from the coded character set, a warning is issued and all undefined characters are placed at the end of the character collation order and be given the same collating weight.
Example 1
Here is an example of a collation order statement in the LC_COLLATE locale definition source file category.
The text that follows the LC_COLLATE keywords has been added for clarity and does not appear in the locale source file.
order_start forward;backward
# The order_start has two sort rules specified:
# forward and backward
UNDEFINED IGNORE;IGNORE
# The UNDEFINED special symbol indicates that
# all characters in the CCSID of the locale
# that are not specified in the definition
# are ignored for collation purposes.
<LOW>
# <LOW> is a collating symbol that is ordered
# after all undefined characters. For example, if there
# were only two undefined characters, then the <LOW> symbol
# would be third in the order.
# All collating elements between <space> and <a> have the
# same primary equivalence class and individual secondary
# weights based on their coded character set values.
<a> <a>;<a>
<a-acute> <a>;<a-acute>
<a-grave> <a>;<a-grave>
<A> <a>;<A>
<A-acute> <a>;<A-acute>
<A-grave> <a>;<A-grave>
# All characters between <a> and <A-grave> belong to the
# same primary equivalence class because they have the same
# primary weight.
<ch> <ch>;<ch>
<Ch> <ch>;<Ch>
# The <c><h> multi-character collating element is
# represented by the <ch> collating symbol and belongs to the
# same primary equivalence class as the <Ch> multi-character
# collating element.
<s> <s>;<s>
<eszet> "<s><s>";<s>
# A one-to-many mapping is indicated by the <eszet>
# character collated as an <s><s> string. That is, one
# <eszet> character is expanded to <s><s> characters before
# comparing.
<HIGH>
order_end
Example 2
Here is an example of a CPYSYSCOL statement in the LC_COLLATE locale definition source file category.
LC_COLLATE
CPYSYSCOL "//QSYS.LIB//QLA10025S.TBL";"ENU"
END LC_COLLATE