| Prev | Next | Start of Chapter | End of Chapter | Contents | Glossary | Index | Comments | (8 out of 11)

Parsing Strings into Tokens

A token is an atomic unit of a language, consisting of a syntax description and a type name. Every instance of a token is represented by a lexeme: a string whose sequence of characters conforms to the syntax characteristic of the token.

A parser is a utility that inputs a string and a set of token definitions, scans the string for lexemes, and outputs the token that each lexeme in the string represents. G2 provides two capabilities that can be used together to implement a parser:

The difference between tokens and lexemes is analogous to the difference between numbers and character strings that represent numbers. Where the meaning is clear, strings that represent numbers are often referred to as if they were the numbers themselves. Similarly, lexemes that represent tokens are often referred to informally as if they were the tokens themselves.

Specifying the Syntax for Extracting Tokens

To parse a string into tokens, a parser must know:

G2 uses tokenizers to specify this information.

To create a tokenizer:

The class-specific attributes of a tokenizer are:

Attribute Description
Patterns-definition
One or more named regular expressions.
Allowable values:
Pairs of the form name regular-expression
Default value:
No value


Tokens-definition
One or more regular expressions and action to take when each is encountered.
Allowable values:
Pairs of the form regular-expression action
Default value:
No value

Defining Patterns

A pattern is a named regular expression. Patterns allow you to:

A pattern-definition is a pair of the form:

where name is any symbol.

The Patterns-definition attribute of a tokenizer specifies a set of patterns that are available in a tokenizer. The value of the attribute is zero or more consecutive pattern definitions. No patterns need be defined in a tokenizer; they are strictly a convenience.

G2 reads and compiles pattern definitions in sequential order. Once a pattern has been defined, it is available for use in subsequent pattern definitions. The syntax for referencing a pattern definition is:

The following could be the value of a tokenizer's Patterns-definition attribute:

The rest of the examples in this section assume the preceding pattern definitions.

To specify a tokenizer's pattern definitions:

The scope of a pattern is the tokenizer that specifies it. Hence the same name can be used to represent different expressions in different tokenizers.

Defining Tokens

A token definition is a pair of the form:

where regular-expression specifies the syntax of some class of token, and response tells what to do on encountering a token that matches regular-expression. The regular-expression can use any regular expression construct, including patterns referenced via $(name). The possible types of response are:

The meaning of each of these responses is described under Responding to a Match.

The Tokens-definition attribute of a tokenizer specifies one or more token definitions. These define the tokens that the tokenizer is to scan for. The following could be the value of a Tokens-definition attribute.

The rest of the examples in this section assume the preceding token definitions.

To specify a tokenizer's token definitions:

Locating Tokens in a String

Once you have specified a tokenizer's pattern definitions (if any) and token definitions, you can use the tokenizer to locate the tokens that it recognizes. G2 provides a text manipulation function for this purpose:

Text functions in general are described under G2 Text Manipulation Functions and G2 Conventions for Manipulating Text. G2 functions in general are described in Chapter 25, Functions.

Searching for a Token

Get-next-token is similar to find-next-pattern, as described under Locating a Substring Using a Regular Expression, but is much more general. The action of get-next-token is as follows:

Responding to a Match

Every token definition specifies a response, as described under Defining Tokens. The possible types of response, and the meaning of each, are:

where:

Example

If a tokenizer has the pattern and token definitions described earlier in this section, the call:

returns:

and the call:

returns:

Note that start-index is 2 even though start-position was 1. This occurred because the tokenizer specifies that a blank (" ") has a response of do-nothing. Therefore get-next-token skipped over the initial blank at position 1 of " 10461 Steve Street", and continued scanning from position 2.

Extracting Tokens from a String

The get-next-token function does not return the lexeme that it found, because a token's type is usually all that is needed: returning the lexeme would be a needless overhead. In some cases, the lexeme is also needed.

To extract a token identified by get-next-token:

| Prev | Next | Start of Chapter | End of Chapter | Contents | Glossary | Index | Comments | (8 out of 11)

Copyright © 1997 Gensym Corporation, Inc.