Regular Expression Syntax
G2 provides regular expressions that can be used to describe text flexibly and succinctly. The G2 regular expression syntax is similar to that used by UNIX utilities such as lex, grep, sed, awk, and others. G2 regular expressions are text strings (type text), and therefore appear in quotes when embedded in G2 code.
A regular expression is a sequence of characters and/or meta-characters that specifies a pattern that matches one or more possible character strings. The meta-characters used by G2 regular expressions are:
Characters represent themselves; meta-characters are operators that define expressions. For example, the meta-character '|' represents logical or, so the regular expression a|b matches either the character 'a' or the character 'b'. The following table gives the syntax of G2 regular expressions:
Basic Regular Expression Constructs
|
Example
|
Interpretation
|
|---|
abc
|
'a' followed by 'b' followed by 'c'.
|
(abc)
|
'a' followed by 'b' followed by 'c'; i.e., a synonym for abc. The difference is that, in larger strings, the meaning of abc can be affected by the context, whereas (abc) always means the same thing. See Precedence.
|
a|b
|
Either 'a' or 'b'. The vertical bar '|' means alternatives. A string with '|' as the first character is an invalid regular expression, as is a string with '|' as the last character, unless the next-to-last character is '\'.
|
a*
|
Zero or more occurrences of 'a'. A string with '*' as the first character is an invalid regular expression.
|
a+
|
One or more occurrences of 'a'. A string with '+' as the first character is an invalid regular expression.
|
a?
|
Zero or one occurrences of 'a'. A string with '?' as the first character is an invalid regular expression.
|
.
|
Any single character, including the newline character.
|
\.\|\\\*
|
'.' followed by '|' followed by '\', followed by '*'. The '\' character specifies that the character following the '\'is to be interpreted literally. It is typically used to escape meta-characters, although it will work for any character (e.g., "\a" is a synonym for "a".)
|
["@"]
|
This is just an ordinary string, with no meta-characters. It matches the sequence of characters '[' '"' '@' '"' ']'. However, remember that the standard meta-characters of G2 text will apply to strings even if they are intended to be regular expressions, so to enter this string in the editor, one would use the string "@[@"@@@"@]".
|
$(foo)
|
The '$' character is used in the attributes of the Tokenizer class. The construction has no special meaning in other contexts.
|
^abc
|
The sequence of characters 'a', 'b', 'c', but only if found at the start-position given. The caret "anchors" the search. This feature only has meaning for system-defined functions. The Tokenizer's search is always anchored.
|
Caution: Be careful not to confuse regular expressions with wildcard expressions for designating filenames. The two syntaxes use some of the same meta-characters, but their meanings are somewhat different.
Character Classes
Character classes provide a terse notation for indicating large sets of characters. G2 regular expressions use '{' and '}' as delimiters for character classes. G2 also provides several system-defined character classes, as described under System-Defined Character Classes.
Note: Character classes are unrelated to item classes, which are classes in the object-oriented sense. Character classes are just sets of characters specified with a terse notation.
Character Class Constructs
|
Example
|
Interpretation
|
|---|
{abc}
|
Either 'a', or 'b', or 'c'. Essentially, this usage is a shorthand for the notation (a|b|c).
|
{a-z}
|
Any character between 'a' and 'z' inclusive. Inside curly braces, the hyphen becomes a meta-character meaning a range of characters.
Since "between" refers to the numerical values assigned to the characters in the character encoding, problems may arise with encodings, such as EBCDIC, that intersperse alphabetic and non-alphabetic characters.
|
{^a-z}
|
Any character which is not between 'a' and 'z' inclusive. A caret ('^') immediately following a left curly brace introduces an inverted character class.
The inversion refers to the entire class; i.e., the characters following the caret determine a match space, and the match space for the class becomes the set difference between the full alphabet and the computed match space.
A caret inside a character class which is not the first character has no special meaning.
|
{}
|
The null string. A null string in a regular expression is legal but has no effect on the meaning of the string. Thus car, {}car, c{}ar, ca{}r and car{} are all equivalent.
|
<charclass>
|
Any alphabetic character in the system-defined character class named by charclass. All characters in between a '<' and its corresponding '>' are read as a symbol. Iif that symbol does not name a system-defined character class, an error results.
|
Caution: A null string matches anything, so searching for it can cause an infinite loop. Guard against inadvertently creating and searching for it in iterative constructs that assemble regular expressions dynamically.
System-Defined Character Classes
G2 provides several system-defined character classes. Such a classes consists of the union of the appropriate characters over all natural languages supported by our character encoding. To refer to a system-defined character class in a regular expression, give the name of the class in <brackets>.
The system-defined character classes are as follows. Ranges are specified with semicolons, because no semicolon appears in any character class.
Precedence
The constructs used to indicate G2 regular expressions have a precedence order. This order defines the correct interpretation when constructs appear consecutively. Several levels of precedence exist. Constructs at the same precedence level are evaluated left-to-right.
The following table shows the construct(s) at each precedence level, and gives the correct interpretation of a sample regular expression that uses the construct(s).
Precedence
|
Level
|
Example
|
Interpretation
|
|---|
\
|
\{a-z}
|
The sequence of characters '{', 'a', '-', 'z', '}'
|
{...}
|
c{ad}+r
|
The character 'c', followed by one or more occurrence of either character 'a' or character 'd', followed by character 'r'.
|
(...)
|
(dog)*
|
Zero or more consecutive occurrences of the character sequence 'd', 'o', 'g', e.g. "dogdogdog" or "".
|
*, +, ?
|
dog*
|
'd' followed by 'o' followed by zero or more 'g' characters, e.g. "dogggg" or "do".
|
Implicit Sequence
|
ca|dr
|
Either (1) 'c' followed by 'a' or (2) 'd' followed by 'r'.
|
|
|
a|b
|
Either 'a' or 'b'.
|
Copyright © 1997 Gensym Corporation, Inc.