Word Separator Tables

The single-byte character set (SBCS) EBCDIC code page is extracted from the CCSID of the current job. The characters in the word list are mapped from the job"s code page to the multinational code page 500 except for Greek and Turkish. Greek is mapped to code page 875; Turkish is mapped to code page 1026.


Delimiter Categories

Each character in a code page is assigned to a delimiter category (always, sometimes, and never a delimiter) as shown in the following tables.

Table 1. Always Delimiters

Category Context-Dependent Definition Description Examples (CP500)
A -- Basic delimiter1 Blank
D -- Other delimiters { ) &
1This category can only contain 1 character.

Table 2. Never Delimiters

Category Context-Dependent Definition Description Examples (CP500)
L -- Lowercase alphabetic characters a b c
U -- Uppercase alphabetic characters A B C
N - Numeric characters, not including superscripts or subscripts 0 1 2

Table 3. Sometimes Delimiters

Category Context-Dependent Definition Description Examples (CP500)
E { e | a < e < a }1, 2 Set of punctuation characters not treated as delimiters when surrounded by alphabetic characters "
F { f | n < f < n }3 Set of punctuation characters not treated as delimiters when surrounded by numerics . , : / -
G { g | p < g < n }4 Set of punctuation characters not treated as delimiters preceded by punctuation and followed by a numeric . , $
H { h | ((p < h) and (h = p) } Set of punctuation characters not treated as delimiters when immediately preceded by an identical character . , : / - * = # %
I { i | F and H } i is an element of F and H . , : / -
J { j | E and F and H j is an element of E and F and H None
K { k | F and G and H k is an element of F and G and H . ,
Note:
  1. a = any character in L or U

  2. < means precedes

  3. h = any character in N

  4. p = punctuation (nonalphabetic and nonnumeric)


Considerations for the Sometimes Delimiters Table

Categories E through K are usually called possible delimiters because they function as delimiters only in certain contexts.

Characters . (period), ! (exclamation point), and ? (question mark) have a special status. When identical characters from this category occur together in a sequence, the individual characters do not act as delimiters; the entire sequence of characters forms a single token. However, the sequence of characters taken together does act as a delimiter because the sequence forms a token separate from characters that precede and follow it. For example, the text streams:

A simple token table is a 256-element array of unsigned characters. The simple token category value for the character is found in the element indexed by the code point of each character. Each code point (character) must be assigned exactly one category. If, according to the above definition of sets, a character is a member of more than one category, it should be assigned to the highest level category (for example, the category with the letter name latest in alphabetical order).

The categories assigned to each character in the three code pages are shown in the following tables. See the topic that shows the code pages in i5/OS globalization to refer to the characters that match these tables.

Table 4. Simple Token Table for Code Page 500

Hex Digits
1st >
2nd V
4- 5- 6- 7- 8- 9- A- B- C- D- E- F-
-0 A D I L U D D D D D D N
-1 D L I U L L D G U U D N
-2 L L U U L L L G U U U N
-3 L L U U L L L E U U U N
-4 L L U U L L L D U U U N
-5 L L U U L L L D U U U N
-6 L L U U L L L D U U U N
-7 L L U U L L L D U U U N
-8 L L U U L L L D U U U N
-9 L L U D L L L D U U U N
-A D D D I D D D D D D D D
-B K G K H D D D D L L U U
-C D H H D L L U D L L U U
-D D D D E L D U D L L U U
-E D D D H L U U E L L U U
-E H D H D D D D D L L U D

Table 5. Simple Token Table for Code Page 875 Greek Support

Hex Digits
1st >
2nd V
4- 5- 6- 7- 8- 9- A- B- C- D- E- F-
-0 A D I D D D E G D D D N
-1 U U I U L L D L U U   N
-2 U U U U L L L L U U U N
-3 U U U U L L L L U U U N
-4 U U U D L L L L U U U N
-5 U U U U L L L L U U U N
-6 U U U U L L L L U U U N
-7 U U U U L L L L U U U N
-8 U U U U L L L L U U U N
-9 U U U D L L L L U U U N
-A D D D I L L L L D D D D
-B K G K H L L L L L D D D
-C D D H D L L L L L      
-D D D D E L L L L L E    
-E D D D H L L L L D D D D
-F H D H D L L L L D D D D

Table 6. Simple Token Table for Code Page 1026 Turkish Support

Hex Digits
1st >
2nd V
4- 5- 6- 7- 8- 9- A- B- C- D- E- F-
-0 A D I L U D L D L D L N
-1 D L I U L L L G U U D N
-2 L L U U L L L G U U U N
-3 L L U U L L L E U U U N
-4 L L U U L L L D U U U N
-5 L L U U L L L D U U U N
-6 L L U U L L L D U U U N
-7 L L U U L L L D U U U N
-8 D L D U L L L D U U U N
-9 L L U D L L L D U U U N
-A U U L I D D D D D D D D
-B K U K U D D D D L L U U
-C D H H U D L D D D D H D
-D D D D E G D G D L L U U
-E D D D H D U D E L L U U
-E H H H U D D D D L L U D


Notes:

  1. (1) PCFILE is a file assigned the document type of PCFILE (Document Interchange Architecture type of 14) by the Client Access program.
  2. (2) PCFILE is a file assigned the document type of PCFILE (Document Interchange Architecture type of 14) by the Client Access program.


[ Back to top | Office APIs | APIs by category ]