Regular expressions (regexp)

Related information

Many z/OS shell commands match strings of text in text files using a type of pattern known as a regular expression. A regular expression lets you find strings in text files not only by direct match, but also by extended matches, similar to, but much more powerful than the file name patterns described in sh.

The newline character at the end of each input line is never explicitly matched by any regular expression or part thereof.

expr and ed take basic regular expressions; all other shell commands accept extended regular expressions. grep and sed accept basic regular expressions, but will accept extended regular expressions if the –E option is used.

Regular expressions can be made up of normal characters or special characters, sometimes called metacharacters. Basic and extended regular expressions differ only in the metacharacters they can contain.

The basic regular expression metacharacters are:

¬ $ . * \( \) [ \{ \} \

The extended regular expression metacharacters are:

| ¬ $ . * + ? ( ) [ { } \

These have the following meanings:

.

A dot character matches any single character of the input line.

¬

The ¬ character does not match any character but represents the beginning of the input line. For example, ¬A is a regular expression matching the letter A at the beginning of a line. The ¬ character is only special at the beginning of a regular expression, or after a ( or |.

$

This does not match any character but represents the end of the input line. For example, A$ is a regular expression matching the letter A at the end of a line. The $ character is only special at the end of a regular expression, or before a ) or |.

[bracket-expression]

A bracket expression enclosed in square brackets is a regular expression that matches a single character, or collation element. This bracket expression applies not only to regular expressions, but also to pattern matching as performed by the fnmatch() function (used in file name expansion).

If the initial character is a circumflex (o), then this bracket expression is complemented. It matches any character or collation-element except for the expressions specified in the bracket expression. For pattern matching, as performed by the fnmatch function, this initial character is instead ! (the exclamation mark).
If the first character after any potential circumflex is either a dash (-), or a closing square bracket (]), then that character matches exactly that character—that is, a literal dash or closing square bracket.
You can specify collation sequences by enclosing their name inside square brackets and periods. For example, [.ch.] matches the multicharacter collation sequence ch (if the current language supports that collation sequence). Any single character is itself. Do not give a collation sequence that is not part of the current locale.
Equivalence classes can be specified by enclosing a character or collation sequence inside square bracket equals. For example, [=a=] matches any character in the same equivalence class as a. This normally expands to all the variants of a in the current locale—for example, a, \(a:, \(a', … On some locales it might include both the uppercase and lowercase of a given character. In the POSIX locale, this always expands to only the character given.
Within a character class expression (one made with square brackets), the following constructs can be used to represent sets of characters. These constructs are used for globalization and handle the different collation sequences as required by POSIX.
[:alpha:]

Any alphabetic character.

[:lower:]

Any lowercase alphabetic character.

[:upper:]

Any uppercase alphabetic character.

[:digit:]

Any digit character.

[:alnum:]

Any alphanumeric character (alphabetic or digit).

[:space:]

Any white space character (blank, horizontal tab, vertical tab).

[:graph:]

Any printable character, except the blank character.

[:print:]

Any printable character, including the blank character.

[:punct:]

Any printable character that is not white space or alphanumeric.

[:cntrl:]

Any nonprintable character.
For example, given the character class expression:
```
[:alpha:]
```
you need to enclose the expression within another set of square brackets, as in:
```
/[[:alpha:]]/
```
Character ranges are specified by a dash (–), between two characters, or collation sequences. These indicates all character or collation sequences that collate between two characters or collation sequences. It does not refer to the native character set. For example, in the POSIX locale, [a-z] means all the lowercase alphabetics, even if they don't agree with the binary machine ordering. However, because many other locales do not collate in this manner, use of ranges are not recommended, and are not used in strictly conforming POSIX.2 applications. An endpoint of a range can explicitly be a collation sequence; for example, [[.ch.]-[.ll.]] is valid. However, equivalence classes or character classes are not: [[=a=]-z] is not permitted.

\

This character turns off the special meaning of metacharacters. For example, \. only matches a dot character. Note that \\ matches a literal \ character. Also note the special case of “\d” described in the following paragraph.

\d

For d representing any single decimal digit (from 1 to 9), this pattern is equivalent to the string matching the dth expression enclosed within the () characters (or  for some commands) found at an earlier point in the regular expression. Parenthesized expressions are numbered by counting ( characters from the left.

Constructs of this form can be used in the replacement strings of substitution commands (for example, the sub function of awk), to stand for constructs matched by parts of the regular expression.

regexp*

A regular expression regexp followed by * matches a string of zero or more strings that matches regexp. For example, A* matches A, AA, AAA and so forth. It also matches the null string (zero occurrences of A). ).

regexp+

A regular expression regexp followed by + matches a string of one or more strings that matches regexp.

regexp?

A regular expression regexp followed by ? matches a string of one or zero occurrences of strings that matches regexp.

char{n} | char\{n\}

In this expression (and the ones to follow), char is a regular expression that stands for a single character—for example, a literal character or a period (.). Such a regular expression followed by a number in brace brackets stands for that number of repetitions of a character. For example, X\{3\} stands for XXX. In basic regular expressions, in order to reduce the number of special characters, { and } must be escaped by the \ character to make them special, as shown in the second form (and the ones to follow).

char{min,} | char\{min,\}

When a number, min, followed by a comma appears in braces following a single-character regular expression, it stands for at least min repetitions of a character. For example, X\{3,\} stands for at least three repetitions of X.

char{min,max} | char\{min,max\}

When a single-character regular expression is followed by a pair of numbers in braces, it stands for at least min repetitions and no more than max repetitions of a character. For example, X\{3,7\} stands for three to seven repetitions of X.

regexp1 | regexp2

This expression matches either regular expression regexp1 or regexp2.

(regexp) | $regexp$

This lets you group parts of regular expressions. Except where overridden by parentheses, concatenation has the highest precedence. In basic regular expressions, in order to reduce the number of special characters, ( and ) must be escaped by the \ character to make them special, as shown in the second form.

Several regular expressions can be concatenated to form a larger regular expression.

Summary

The commands that use basic and extended regular expressions are as follows:

Basic: ed, expr, grep, sed
Extended: awk, grep with -E option, sed with the -E option.

Table 1 summarizes the features that apply to the applicable shell commands.

Table 1. Regular Expression Features (regexp)
Notation	awk	ed	grep -E	expr	sed
.	Yes	Yes	Yes	Yes	Yes
^	Yes	Yes	Yes	No	Yes
$	Yes	Yes	Yes	Yes	Yes
[…]	Yes	Yes	Yes	Yes	Yes
[::]	Yes	Yes	Yes	Yes	Yes
re*	Yes	Yes	Yes	Yes	Yes
re+	Yes	No	Yes	No	No
re?	Yes	No	Yes	No	No
re\|re	Yes	No	Yes	No	No
\d	Yes	Yes	Yes	Yes	Yes
(…)	Yes	No	Yes	No	No
$…$	No	Yes	No	Yes	Yes
\<	No	No	No	No	No
\>	No	No	No	No	No
\{ \}	Yes	No	Yes	No	Yes

Examples

The following patterns are given as illustrations, along with descriptions of what they match:

abc: Matches any line of text containing the three letters abc in that order.
a.c: Matches any string beginning with the letter a, followed by any character, followed by the letter c.
^.$: Matches any line containing exactly one character (the newline is not counted).
a(b*|c*)d: Matches any string beginning with a letter a, followed by either zero or more of the letter b, or zero or more of the letter c, followed by the letter d.
.* [a–z]+ .*: Matches any line containing a word, consisting of lowercase alphabetic characters, delimited by at least one space on each side.
(morty).*\1
morty.*morty: These expressions both match lines containing at least two occurrences of the string morty.
[[:space:][:alnum:]]: Matches any character that is either a white space character or alphanumeric.