Spirit

pirit

About Spirit:
Spirit is an object oriented recursive descent parser generator framework implemented using template meta-programming techniques. Expression templates [R1] allow us to approximate the syntax of Extended Backus Normal Form (EBNF) [R2] completely in C++. The Spirit framework enables a target grammar to be written exclusively in C++. EBNF grammar specifications can mix freely with other C++ code and, thanks to the generative power of C++ templates, are immediately executable. In retrospect, conventional compiler-compilers or parser-generators have to perform an additional translation step from the source EBNF code to C or C++ code.

A simple EBNF grammar snippet:
group   ::= '(' expr ')'
expr1   ::= integer | group
expr2   ::= expr1 (('*' expr1) | ('/' expr1))*
expr    ::= expr2 (('+' expr2) | ('-' expr2))*
is approximated using Spirit's facilities as seen in this code snippet:
group   = oppar >> expr >> clpar;
expr1   = integer | group;
expr2   = expr1 >> *((times >> expr1) | (divide >> expr1));
expr    = expr2 >> *((plus >> expr2) | (minus >> expr2));
Through the magic of expression templates, this is perfectly valid C++ code that can be executed. The production rule named expr is in fact an object that has a member function Parse(...) that does the work given a source code written in the grammar that we have just declared.

Certainly we have done some modifications to the original EBNF syntax. This is done so as to conform to C++ syntax rules. Most notably we see the abundance of shift (>>) operators. Since there are no 'empty' operators in C++, it is simply not possible to write something like:
a b
as seen in Math syntax, for example, to mean multiplication or, in our case, as seen in EBNF syntax to mean sequencing (b should follow a). The framework uses the shift (>>) operator instead for this purpose. We take the (>>) operator to mean "is followed by" [N1].

The alternative operator (|) and the parentheses (grouping) remain as is. The assignment operator (=) is used in place of BNF's (::=) Last but not least, the Kleene star (*) which used to be a postfix operator in EBNF becomes a prefix. Instead of:
a*   ... in EBNF syntax,
we write:
*a   ... in Spirit.
There are no postfix stars (*) in C/C++.

And, before I forget, the original literals such as '*', '+', etc., are replaced by identifiers such as times, plus, etc., the definitions of which are obvious and intentionally omitted, for now, for the sake of clarity. We shall deal with these later.
Now what makes Spirit tick?
First, some concepts. A parser object is the most basic entity in the framework. Basically, a parser accepts an input stream and returns a Match object as its result. The Match object evaluates to true if some criteria or algorithm matches a portion of the data at the current stream position. If a match is found, the stream is advanced accordingly.

All parsers inherit from the base template class, Parser:
template <typename Derived>
struct Parser {

    Derived&        GetDerived();
    Derived const&  GetDerived() const
};
This class is a protocol base class for all parsers. This is essentially an interface contract. The Parser class does not really know how to parse anything but instead relies on its sole template parameter Derived [N2] (which obviously is assumed to be a subclass) to do the actual parsing. Concrete subclasses inheriting from Parser must have a corresponding member function Parse(...) compatible with the conceptual interface:
template <typename Scanner>
Match
Parse(Scanner& scanner) const;
The Scanner template parameter is just another form of an iterator that points to the current input stream position. We shall elaborate on this later. For now, think of the scanner as just another STL compliant forward iterator. It is passed by reference to allow the function to 'move' its position accordingly when a match is found. The parse is considered successful if a portion of the input starting at the current scanner position is matched. The Parse function terminates as soon as the scanner finds anything that the parser does not recognize.

A Match object is returned by the Parse function. Basically the Match object reports the success of the parse; i.e. evaluates to true if the Parse function is successful, false otherwise. If the parse is successful, the Match object may also be queried to report the number of characters matched (match.Length()). The length is valid only if the match is successful. Here is a code snippet:
Match  match = Parse(scanner);
bool   success = match;
uint   length = match.Length();
Primitives:
The framework predefines some parser primitives. Here are some examples:
ChLit:  Character literal.  Example: ChLit<>('X');
Range:  Character range.    Example: Range<>('0','9');
StrLit: String literal.     Example: StrLit<>("Hello");
At this point, as evident in these examples, it is important to note that contrary to standard practice, the Spirit framework handles parsing tasks at both the character level as well as the phrase level [N3]. One may consider that a lexical analyzer is seamlessly integrated in the Spirit framework.

We may now fill in some of the gaps. Going back to our original example, we now define oppar, clpar, times, divide, plus and minus as:
ChLit<> plus   ('+');
ChLit<> minus  ('-');
ChLit<> times  ('*');
ChLit<> divide ('/');
ChLit<> oppar  ('(');
ChLit<> clpar  (')');
Here is the complete list of parser primitives:
Basics:

ChLit: Character literal. Example: ChLit<>('X');

Range: Character range. Example: Range<>('0','9');

StrLit: String literal. Example: StrLit<>("Hello");

NCChLit: Case insensitive character literal.

NCRange: Case insensitive character range.

NCStrLit: Case insensitive string literal.

Note: These classes are templates parameterized by character type and defaults to char.

Predefined parser primitives:

anychar: Matches any character.

nothing: Matches nothing.

alnum: Alpha-numeric characters.

alpha: Alphabetic characters.

cntrl: Control characters.

digit: Numeric digits.

graph: Non-space printing characters.

lower: Lower case letters.

print: Printable characters.

punct: Punctuation symbols.

space: Space, tab, return etc.

upper: Upper case letters.

xdigit: Hexadecimal digits.

Note: These classes may be used for char and wchar_t types

Operators:
Operators are used as a means for object composition and embedding. Simple parsers may be composed to form composites through operator overloading, crafted to approximate the syntax of an Extended Backus-Naur Form (EBNF) variant. An expression such as:
a | b
actually yields a new parser type which is a composite of its operands, a and b. Taking this example further, if a and b were of type ChLit<>, the result would have the composite type:
Alternative <ChLit<>, ChLit<>>
A couple of operators facilitate the composition of parsers:

Set-like operators on parsers [N4]:

a | b Union (match a or b. Also referred to as alternatives) [N5]

a & b Intersection (match a and b) [N6]

a - b Difference (match a but not b)

a ^ b XOR (match a or b, but not both).

Sequencing:

a >> b Matches a and b in sequence [N1].

Iteration:

*a Kleene star. Match a zero (0) or more times

+a Positive. Match a one (1) or more times

!a Optional. Match a zero (0) or one (1) time.

Operator precedence and grouping:

Since we are defining our meta-language in C++, we follow C/C++'s operator precedence rules. Groups override this (e.g., *(a | b) reads: match a or b zero (0) or more times).

For binary operators, one of the operands (but not both [N7]) may be a char, wchar_t, char const* or wchar_t const*. Examples:
a | 'x';
a - "Hello World";
A few simple classes, when composed and structured in a hierarchy, form a very powerful object oriented recursive descent parsing engine. These classes provide the infrastructure needed for the construction of more complex parsers. The final parser composite is a non-deterministic top-down parser with infinite look-ahead LL(INF). Top-down descent traverses the hierarchy, checking each object for a match, backtracking and checking all possible alternatives.

The flexibility of object embedding and composition combined with recursion opens up a unique approach to parsing. Subclasses are free to form aggregates and algorithms of arbitrary complexity. Complex parsers can be created with the composition of only a few primitive classes.

The framework is designed to be fully open-ended and extensible. New primitives or composites, from the trivial to the complex, may be added any time. Composition is totally static (happens at compile time). This is possible through the expressive flexibility of C++ template meta- programming.
The Production Rule:
Rule is a class that captures the type and behavior of any primitive or composite parser that is assigned to it either via its assignment operator or its copy constructor. In effect, a rule allows an arbitrarily complex composite parser to be named. Naming of a rule allows it to be referenced in another compositional expression. Essentially, Rule 'IS-A' (inherits from) Parser and adheres to Parser's conceptual interface.

The type and behavior of a parser assigned to a rule is encoded in a conventional polymorphic class (AbstractParser). The rule's assignment and copy constructor dynamically creates a concrete instance of AbstractParser. The rule's assignment and copy constructor are both member templates parameterized by the type of the parser passed. Example:
Rule<>   aRule = *(a | b) & +(c | d | e);
The type and functionality of the right-hand expression, which may be arbitrarily complex, is encoded in the rule named aRule [N8].

Forward declarations:
Rules may be declared before being defined to allow cyclic structures typically found in BNF declarations. Example:
    Rule<>   a, b, c;
			
    a = b | a;
    b = c | a;
Recursion:
The right-hand side of a rule may reference other rules, including itself. The limitation is that direct or indirect left recursion is not allowed (this is an unchecked run-time error that results in an infinite loop). This is typical of LL(k) parsers. Example:
    a = a | b;    //    infinite loop! 
Redefinition:
Redefinition of rules is not allowed. This could have been enforced statically by not supplying a default constructor or an assignment operator. Yet forward declared rules would not have been possible. The caveat is that redefinition is caught only at run time [N9]:
    a = b | c;
    a = b;        //    run-time error 
Rationale:
Rules straddle the border between static and dynamic C++. In effect, a rule transforms compile-time polymorphism (using templates) into run-time polymorphism (using virtual functions). This is necessary due to C++'s inability to automatically deduce the type of an arbitrarily complex type in the right-hand side (rhs) and pass it on to the left-hand side (lhs) of an assignment. Basically, we want to do something like:
    T   rule = an_arbitrarily_complex_expression;
without having to know the resulting type of the right-hand side of the assignment. Apart from this, we also need a facility to forward declare an unknown type:
    T   rule;
    ...
    rule = a | b;       
These limitations lead us to this implementation of rules. This comes at the expense of the overhead of a virtual function call, once through each invocation of a rule.
The Scanner:
Already mentioned in passing, the Spirit parser compiler, unlike traditional parser generators, can handle parsing tasks at both the character as well as the phrase level [N3]. There is no separate lexical analyzer. There is perfect integration between the character and the phrase levels.

The Scanner wraps an iterator and a white space parser (or filter, if you may). The scanner extracts data from the input stream and filters this data to exclude white spaces as recognized by the white space filter. The scanner invokes this filter when tasked to scan the next character from the input. The scanner conforms to a standard STL constant (immutable) forward iterator.

The scanner is not a full-blown lexical analyzer. It does not extract tokens such as reserved words and operators. Nor does it extract numbers and literal strings [N10].

Here's an example that demonstrates the scanner in action [N11]:
    Scanner<>  scanner(iter, space);
The identifier iter is an iterator to some data that the scanner will be working on. The identifier space is one of the predefined parser primitives in the framework. The space parser matches white spaces as defined in the std::isspace(ch) standard function. After having defined our scanner, we may now pass this to any Parse(...) function. Doing so will make the parser involved skip white spaces:
    parser.Parse(scanner);    // White spaces are skipped
As mentioned, the scanner conforms to an STL forward iterator. Any forward iterator may be used in place of a scanner and passed to a Parse function. Doing so will force the parser to work on the character level (white spaces are not skipped).
Directives:

Parser directives have the form: Directive[expression].
A directive modifies the behavior of its enclosed expression, essentially 'decorating' it (Decorator Pattern [R5]). The framework pre-defines a few directives. Clients of the framework are free to define their own directives as needed.

Predefined directives:

Lexeme:

Turns off white space skipping. By default the parser ignores white spaces and all things considered as white spaces, possibly including comments as parameterized by the scanner passed into the parser's Parse(...) member function.

Situations where we want to work at the character level instead of the phrase level call for a special construct. Rules can be directed to work at the character level by enclosing the pertinent parts of the grammar inside the Lexeme directive.

Longest:

Alternatives in the Spirit parser compiler are short-circuited [N5]. Sometimes, this is not what is desired. The Longest directive instructs the parser not to short-circuit alternatives enclosed inside this directive, but instead makes the parser try all possible alternatives and choose the one matching the longest portion of the input stream.

Shortest:

Opposite of the Longest directive.

Now, going back to our original example, the observant reader might notice that the integer rule was left undefined. Although its definition is quite obvious, here's how it is actually defined in the context of the framework. Continuing our original example:
    integer = Lexeme[ !(plus | minus) >> +digit ];
where digit is another predefined parser that calls the std::isdigit(ch) standard function.

Alas, we have our complete grammar specification:
    ChLit<>   plus   ('+');
    ChLit<>   minus  ('-');
    ChLit<>   times  ('*');
    ChLit<>   divide ('/');
    ChLit<>   oppar  ('(');
    ChLit<>   clpar  (')');

    Rule<>    integer, group, expr1, expr2, expr;

    integer   = Lexeme[ !(plus | minus) >> +digit ];
    group     = oppar >> expr >> clpar;
    expr1     = integer | group;
    expr2     = expr1 >> *((times >> expr1) | (divide >> expr1));
    expr      = expr2 >> *((plus >> expr2) | (minus >> expr2));
Yes, it's a calculator. This production rule expr in our grammar specification, traditionally called the start symbol [N12], can accept inputs such as:
    12345
    -12345
    +12345
    1 + 2
    1 * 2
    1/2 + 3/4
    1 + 2 + 3 + 4
    1 * 2 * 3 * 4
    (1 + 2) * (3 + 4)
    (-1 + 2) * (3 + -4)
    1 + ((6 * 200) - 20) / 6
    (1 + (2 + (3 + (4 + 5))))
Semantic actions:
Semantic actions have the form: action(expression)

Ultimately, after having defined our grammar and generating a corresponding parser, we more often than not need to produce some output or do some work besides syntax analysis. Unless of course what we want is merely to check for the conformance of an input with our grammar, which is very seldom the case.

What we need is a mechanism that will instruct the parser on what should be done as it traverses the grammar while in the process of parsing an input stream. This mechanism is put in place through semantic actions.

Semantic actions may enclose any expression at any level within the parser hierarchy. These actions are C/C++ functoids that will be called if a match is found in a particular context. These action functions may be used as hooks into the parser and may be used to, for example:

Generate output from the parser (ASTs, for example);

Report warnings or errors; or

Manage symbol tables.

Semantic functoids are subclasses of Action.

Action is a protocol base class for all actions. This is essentially an interface contract. The Action class does not really know how to process a match but instead relies on the template parameter, Derived (which, again, obviously is assumed to be a subclass) to do the actual semantic action. Concrete subclasses inheriting from Action must have a member function Process(...) compatible with the conceptual interface:
    template <typename Iterator>
   bool Process(Iterator scanner, Iterator last);
Example:
    struct MyAction : public Action {

        void Process(char const* begin, char const* end) { ... }
			
    } myAction;

    Rule<>   myRule = myAction(a | b | *(c >> d));
The member function MyAction::Process is called whenever the expression:
    a | b | *(c >> d)
matches a portion of the input stream upon parsing. Two iterators (begin and end) are passed into the function. These iterators point to the start and end, respectively, of the portion of input stream where the match is found.
Wrap up:

Accompanying the framework are a couple of test programs that follow up what we have started. There are two (2) short programs (test1.cpp and test2.cpp) that demonstrate the use and capabilities of the Spirit framework. These are basically refinements of our calculator example. The second program (test2.cpp) is of particular interest because it implements a byte-code compiler and interpreter for our simple calculator.

The framework itself is small. There are ten (10) header files and two (2) *.cpp files. The <Spirit.h> file is the master header file that includes all the necessary headers in the framework. The framework does not rely on any code or library except the standard C++ libraries.

Spirit.h
Spirit_Action.h
Spirit_Basics.h
Spirit_Composite.h
Spirit_Directives.h
Spirit_Operators.h
Spirit_Parser.h
Spirit_Primitives.h
Spirit_Rule.h
Spirit_Scanner.h

Spirit_Directives.cpp
Spirit_Primitives.cpp

I believe there are a couple of new innovations introduced in the Spirit framework worth looking into. The modeling of parsers as hierarchical composite objects coupled with the notion of composition of simple objects to create more complex objects as applied to parsers is a first of its kind. Another innovation is the application of template meta-programming techniques, expression templates in particular, to parsing. Ahh yes, one more tidbit: parser intersection &, difference -, and xor ^ might also prove to be an interesting research area. It has been noted that intersections let us define languages that are not context free [N6]. What about xor ^?

So far we have just scratched the surface of what we can do with the Spirit parser compiler. What we have so far is a prototype that demonstrates the concept. There are still avenues to explore that ultimately lead towards this new frontier. Have fun!

Annotations:

The comma operator as in a, b seems to be a better candidate, syntax-wise. But then the problem is with its precedence. It has the lowest precedence in C/C++, which makes it virtually useless.

This technique is known as the "Curiously Recurring Template Pattern" in meta-programming circles [R3].

Typical parsers regard characters (symbols that form words) and phrases (words that form sentences) as separate domains. Entities such as reserved words, operators, literal strings, numerical constants, etc., which constitute the terminals of a grammar are extracted first in a separate lexical analysis stage.

The complement operator ~ was originally put into consideration. Yet further understanding of its value and meaning leads us to uncertainty. The basic problem stems from the fact that ~a will yield U-a, where U is the universal set of all strings. Might this be a another fertile ground for future exploration?

Alternative operands are tried one by one on a first come first served basis starting from the leftmost operand. After a successfully matched alternative is found, the parser concludes its search, essentially short-circuiting the search for other potentially viable candidates. This short-circuiting implicitly gives the highest priority to the leftmost alternative.

Short-circuiting is done in the same manner as C or C++'s logical expressions; e.g. if (x < 3 || y < 2) where, if x evaluates to be less than 3, the y < 2 test is not done anymore. In addition to providing an implicit priority rule for alternatives which is necessary, given the non-deterministic nature of the Spirit parser compiler, short-circuiting improves the execution time.

Some researchers assert that the intersections let us define languages that are not context free [R4]. "The theory of defining a language as the intersection of a finite number of context free languages was developed by Leu and Weiner in 1973".

C++ forbids the overloading of operators on primitives such as ints, chars and pointers (e.g. char*). At least one of the operands should be a user-defined object.

Rule is a template class. The default has the form: Rule<>.

We could have allowed this by making use of the new definitions as alternatives to the old. Yet this comes at the expense of additional checks and template instantiations. We pay for this even if we don't use the feature.

Although the Spirit parser compiler does not need a separate lexical analyzer, there is no reason why we cannot have one. One can always have as many parser layers as needed. In theory, one may create a preprocessor, a lexical analyzer and a parser proper, all using the same framework.

Scanner is a template class. The default has the form: Scanner <>.

Typically, parsers have what is called a start symbol, chosen to be the root of the grammar where parsing starts. The Spirit parser compiler has no notion of a start symbol. Any rule can be a start symbol.

References:

Todd Veldhuizen. "Expression Templates". C++ Report, June 1995.

Peter Naur (ed.). "Report on the Algorithmic Language ALGOL 60". CACM, May 1960.

James Coplien. "Curiously Recurring Template Pattern". C++ Report, Feb. 1995.

Richard J. Botting, Ph.D. "XBNF" (citing Leu-Weiner, 1973). www.csci.csusb.edu/dick/maths/intro_ebnf.html, California State University, San Bernardino,1998.

Erich Gamma, Richard Helm, Ralph Jhonson, and John Vlissides. Design Patterns, Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995.

ISO/IEC. "ISO-EBNF", ISO/IEC 14977: 1996(E).

Dick Grune and Ceriel Jacobs. Parsing Techniques: A Practical Guide. Ellis Horwood Ltd.: West Sussex, England, 1990. (electronic copy, 1998)

T. J. Parr, H. G. Dietz, and W. E. Cohen. PCCTS Reference Manual (Version 1.00). School of Electrical Engineering, Purdue University, West Lafayette, August 1991.

Adrian Johnstone and Elizabeth Scott. RDP, A Recursive Descent Compiler Compiler. Technical Report CSD TR 97 25, Dept. of Computer Science, Egham, Surrey, England, Dec. 20, 1997.
 

ChLit:	Character literal. Example: ChLit<>('X');
Range:	Character range. Example: Range<>('0','9');
StrLit:	String literal. Example: StrLit<>("Hello");

NCChLit:	Case insensitive character literal.
NCRange:	Case insensitive character range.
NCStrLit:	Case insensitive string literal.

	Note: These classes are templates parameterized by character type and defaults to char.

anychar:	Matches any character.
nothing:	Matches nothing.
alnum:	Alpha-numeric characters.
alpha:	Alphabetic characters.
cntrl:	Control characters.
digit:	Numeric digits.
graph:	Non-space printing characters.
lower:	Lower case letters.
print:	Printable characters.
punct:	Punctuation symbols.
space:	Space, tab, return etc.
upper:	Upper case letters.
xdigit:	Hexadecimal digits.

	Note: These classes may be used for char and wchar_t types

a \| b	Union (match a or b. Also referred to as alternatives) [N5]
a & b	Intersection (match a and b) [N6]
a - b	Difference (match a but not b)
a ^ b	XOR (match a or b, but not both).

*a	Kleene star. Match a zero (0) or more times
+a	Positive. Match a one (1) or more times
!a	Optional. Match a zero (0) or one (1) time.