Actions taken by the lexical analyzer

Edit online

When the lexical analyzer matches one of the extended regular expressions in the rules section of the specification file, it executes the action that corresponds to the extended regular expression. Without sufficient rules to match all strings in the input stream, the lexical analyzer copies the input to standard output. Therefore, do not create a rule that only copies the input to the output. The default output can help find gaps in the rules.

When using the lex command to process input for a parser that the yacc command produces, provide rules to match all input strings. Those rules must generate output that the yacc command can interpret.

Null action

To ignore the input associated with an extended regular expression, use a ; (C language null statement) as an action. The following example ignores the three spacing characters (blank, tab, and new lline):

[ \t\n] ;

Same as next action

To avoid repeatedly writing the same action, use the | (pipe symbol). This character indicates that the action for this rule is the same as the action for the next rule. For instance, the previous example that ignores blank, tab, and new line characters can also be written as follows:

" "                     |
"\t"                    |
"\n"                    ;

The quotation marks that surround \n and \t are not required.

Printing a matched string

To determine what text matched an expression in the rules section of the specification file, you can include a C language printf subroutine call as one of the actions for that expression. When the lexical analyzer finds a match in the input stream, the program puts the matched string into the external character (char) and wide character (wchar_t) arrays, called yytext and yywtext, respectively. For example, you can use the following rule to print the matched string:

[a-z]+            printf("%s",yytext);

The C language printf subroutine accepts a format argument and data to be printed. In this example, the arguments to the printf subroutine have the following meanings:

%s: A symbol that converts the data to type string before printing
%S: A symbol that converts the data to wide character string (wchar_t) before printing
yytext: The name of the array containing the data to be printed
yywtext: The name of the array containing the multibyte type (wchar_t) data to be printed

The lex command defines ECHO; as a special action to print the contents of yytext. For example, the following two rules are equivalent:

[a-z]+       ECHO;
[a-z]+       printf("%s",yytext);

You can change the representation of yytext by using either %array or %pointer in the definitions section of the lex specification file, as follows:

%array: Defines yytext as a null-terminated character array. This is the default action.
%pointer: Defines yytext as a pointer to a null-terminated character string.

Finding the length of a matched string

To find the number of characters that the lexical analyzer matched for a particular extended regular expression, use the yyleng or the yywleng external variables.

yyleng: Tracks the number of bytes that are matched.
yywleng: Tracks the number of wide characters in the matched string. Multibyte characters have a length greater than 1.

To count both the number of words and the number of characters in words in the input, use the following action:

[a-zA-Z]+       {words++;chars += yyleng;}

This action totals the number of characters in the words matched and puts that number in chars.

The following expression finds the last character in the string matched:

yytext[yyleng-1]

Matching strings within strings

The lex command partitions the input stream and does not search for all possible matches of each expression. Each character is accounted for only once. To override this choice and search for items that may overlap or include each other, use the REJECT action. For example, to count all instances of she and he, including the instances of he that are included in she, use the following action:

she              {s++; REJECT;}
he               {h++}
\n               |
.                ;

After counting the occurrences of she, the lex command rejects the input string and then counts the occurrences of he. Because he does not include she, a REJECT action is not necessary on he.

Adding results to the yytext array

Typically, the next string from the input stream overwrites the current entry in the yytext array. If you use the yymore subroutine, the next string from the input stream is added to the end of the current entry in the yytext array.

For example, the following lexical analyzer looks for strings:

%s instring
%%
<INITIAL>\"     {  /* start of string */
         BEGIN instring;
         yymore();
        }
<instring>\"    {  /* end of string */
         printf("matched %s\n", yytext);
         BEGIN INITIAL;
        }
<instring>.     {
         yymore();
        }
<instring>\n    {
         printf("Error, new line in string\n");
         BEGIN INITIAL;
        }

Even though a string may be recognized by matching several rules, repeated calls to the yymore subroutine ensure that the yytext array will contain the entire string.

Returning characters to the input stream

To return characters to the input stream, use the following call:

yyless(n)

where n is the number of characters of the current string to keep. Characters in the string beyond this number are returned to the input stream. The yyless subroutine provides the same type of look-ahead function that the / (slash) operator uses, but it allows more control over its usage.

Use the yyless subroutine to process text more than once. For example, when parsing a C language program, an expression such as x=-a is difficult to understand. Does it mean x is equal to minus a, or is it an older representation of x -= a, which means decrease x by the value of a? To treat this expression as x is equal to minus a, but print a warning message, use a rule such as the following:

=-[a-zA-Z]      {
                printf("Operator (=-) ambiguous\n");
                yyless(yyleng-1);
                ... action for = ...
                }

Input/Output subroutines

The lex program allows a program to use the following input/output (I/O) subroutines:

input(): Returns the next input character
output(c): Writes the character c on the output
unput(c): Pushes the character c back onto the input stream to be read later by the input subroutine
winput(): Returns the next multibyte input character
woutput(C): Writes the multibyte character C back onto the output stream
wunput(C): Pushes the multibyte character C back onto the input stream to be read by the winput subroutine

The lex program provides these subroutines as macro definitions. The subroutines are coded in the lex.yy.c file. You can override them and provide other versions.

The winput, wunput, and woutput macros are defined to use the yywinput, yywunput, and yywoutput subroutines. For compatibility, the yy subroutines subsequently use the input, unput, and output subroutine to read, replace, and write the necessary number of bytes in a complete multibyte character.

These subroutines define the relationship between external files and internal characters. If you change the subroutines, change them all in the same way. These subroutines should follow these rules:

All subroutines must use the same character set.
The input subroutine must return a value of 0 to indicate end of file.
Do not change the relationship of the unput subroutine to the input subroutine or the look-ahead functions will not work.

The lex.yy.c file allows the lexical analyzer to back up a maximum of 200 characters.

To read a file containing nulls, create a different version of the input subroutine. In the normal version of the input subroutine, the returned value of 0 (from the null characters) indicates the end of file and ends the input.

Character set

The lexical analyzers that the lex command generates process character I/O through the input, output, and unput subroutines. Therefore, to return values in the yytext subroutine, the lex command uses the character representation that these subroutines use. Internally, however, the lex command represents each character with a small integer. When using the standard library, this integer is the value of the bit pattern the computer uses to represent the character. Normally, the letter a is represented in the same form as the character constant a. If you change this interpretation with different I/O subroutines, put a translation table in the definitions section of the specification file. The translation table begins and ends with lines that contain only the following entries:

%T

The translation table contains additional lines that indicate the value associated with each character. For example:

%T
{integer}       {character string}
{integer}       {character string}
{integer}       {character string}
%T

End-of-file processing

When the lexical analyzer reaches the end of a file, it calls the yywrap library subroutine, which returns a value of 1 to indicate to the lexical analyzer that it should continue with normal wrap-up at the end of input.

However, if the lexical analyzer receives input from more than one source, change the yywrap subroutine. The new function must get the new input and return a value of 0 to the lexical analyzer. A return value of 0 indicates that the program should continue processing.

You can also include code to print summary reports and tables when the lexical analyzer ends in a new version of the yywrap subroutine. The yywrap subroutine is the only way to force the yylex subroutine to recognize the end of input.