A token is an atomic unit of a language, consisting of a syntax description and a type name. Every instance of a token is represented by a lexeme: a string whose sequence of characters conforms to the syntax characteristic of the token.
TOKENIZER: an object that contains regular expressions that define a set of tokens.
get-next-token: a function that uses a TOKENIZER to locate lexemes in strings and return the token that each represents.
To create a tokenizer:
Choose KB Workspace > New Definition > tokenizer
Defining Patterns
A pattern is a named regular expression. Patterns allow you to:
name regular-expression
The
Patterns-definition attribute of a tokenizer specifies a set of patterns that are available in a tokenizer. The value of the attribute is zero or more consecutive pattern definitions. No patterns need be defined in a tokenizer; they are strictly a convenience.G2 reads and compiles pattern definitions in sequential order. Once a pattern has been defined, it is available for use in subsequent pattern definitions. The syntax for referencing a pattern definition is:
"$(name)"
Patterns-definition attribute:
nonzero "{1-9}"
digit "{0-9}"
numseq "$(nonzero)$(digit)*
int "{\+\-}?$(numseq)"
real "($(int)\.$(digit)*)|({\+\-}?\.$(digit)+)"
name "{A-Z}{a-z}*|{A-Z}\."
To specify a tokenizer's pattern definitions:
Make the desired definitions the values of the tokenizer's Patterns-definition attribute.
regular-expression response
$(name). The possible types of response are:
do-nothing
do-nothing
"$(numseq) $(name) Street" address
"$(name) $(name)" person
" " do-nothing
"$(int) zip-code
To specify a tokenizer's token definitions:
Make the desired definitions the values of the tokenizer's Tokens-definition attribute.
get-next-token(tokenizer: class G2-tokenizer,source-text: text,start-position: integer)
->structure (token-type: symbol,start-index: integer,end-index: integer)
Searching for a Token
Get-next-token is similar to find-next-pattern, as described under Locating a Substring Using a Regular Expression, but is much more general. The action of get-next-token is as follows:
structure (token-type: FALSE, start-index: 0, end-index: 0)
do-nothing: continue scanning from the first character after the end of the matching substring.
do-nothing: return:
structure (token-type: symbol, start-index: integer, end-index: integer)
start-index: The character position where the matching token begins.
end-index: The character position where the matching token ends.
get-next-token (tokenizer, " 10461 Steve Street", 10)
structure (token-type: FALSE, start-index: 0, end-index: 0)
get-next-token (tokenizer, " 10461 Steve Street", 1)
structure (token-type: address, start-index: 2, end-index: 19)
start-index is 2 even though start-position was 1. This occurred because the tokenizer specifies that a blank (" ") has a response of do-nothing. Therefore get-next-token skipped over the initial blank at position 1 of " 10461 Steve Street", and continued scanning from position 2.
get-next-token function does not return the lexeme that it found, because a token's type is usually all that is needed: returning the lexeme would be a needless overhead. In some cases, the lexeme is also needed.
To extract a token identified by get-next-token:
Use get-next-text as described under Obtaining a Substring.