pirit

Version 1.1

Copyright © 2001, Joel de Guzman
Isis Technologies. All rights reserved.
Email: djowel@gmx.co.uk
[ Edited by Sylvia de Guzman ]

About Spirit:

Spirit is an object oriented recursive descent parser generator framework implemented using template meta-programming techniques. Expression templates [R1] allow us to approximate the syntax of Extended Backus Normal Form (EBNF) [R2] completely in C++.

The Spirit framework enables a target grammar to be written exclusively in C++. Inline EBNF grammar specifications can mix freely with other C++ code and, thanks to the generative power of C++ templates, are immediately executable. In retrospect, conventional compiler-compilers or parser-generators have to perform an additional translation step from the source EBNF code to C or C++ code.

A simple EBNF grammar snippet:

group   ::= '(' expr ')'
expr1   ::= integer | group
expr2   ::= expr1 (('*' expr1) | ('/' expr1))*
expr    ::= expr2 (('+' expr2) | ('-' expr2))*

is approximated using Spirit's facilities as seen in this code snippet:

group   = '(' >> expr >> ')';
expr1   = integer | group;
expr2   = expr1 >> *(('*' >> expr1) | ('/' >> expr1));
expr    = expr2 >> *(('+' >> expr2) | ('-' >> expr2));

Through the magic of expression templates, this is perfectly valid C++ code that can be executed. The production rule named expr is in fact an object that has a member function Parse(...) that does the work given a source code written in the grammar that we have just declared.

Certainly we have done some modifications to the original EBNF syntax. This is done so as to conform to C++ syntax rules. Most notably we see the abundance of shift (>>) operators. Since there are no 'empty' operators in C++, it is simply not possible to write something like:

a b

as seen in Math syntax, for example, to mean multiplication or, in our case, as seen in EBNF syntax to mean sequencing (b should follow a). The framework uses the shift (>>) operator instead for this purpose. We take the (>>) operator, with arrows pointing to the right, to mean "is followed by" [N1].

The alternative operator (|) and the parentheses (grouping) remain as is. The assignment operator (=) is used in place of BNF's (::=) Last but not least, the Kleene star (*) which used to be a postfix operator in EBNF becomes a prefix. Instead of:

a*   ... in EBNF syntax,
we write:
*a   ... in Spirit.

There are no postfix stars (*) in C/C++.

Now what makes Spirit tick?

First, some concepts. A parser object is the most basic entity in the framework. Basically, a parser accepts an input stream and returns a Match object as its result. The Match object evaluates to true if some criteria or algorithm matches a portion of the data at the current stream position. If a match is found, the stream is advanced accordingly.

All parsers inherit from the base template class, Parser. This class is a protocol base class for all parsers. This is essentially an interface contract. The Parser class does not really know how to parse anything but instead relies on its sole template parameter Derived [N2] (which obviously must be a subclass of Parser) to do the actual parsing. Concrete subclasses inheriting from Parser must have a corresponding member function Parse(...) compatible with the conceptual interface:
template <typename ScannerT>
Match
Parse(ScannerT& scanner) const;
The Scanner template parameter is just another form of an iterator that points to the current input stream position. We shall elaborate on this later. For now, think of the scanner as just another STL compliant forward iterator. It is passed by reference to allow the function to 'move' its position accordingly when a match is found. The parse is considered successful if a portion of the input starting at the current scanner position is matched. The Parse function terminates as soon as the scanner finds anything that the parser does not recognize.

A parser can be quite simple. Here is a sample parser that accepts all characters:
struct AllChars : public Parser<AllChars> {

    template <typename ScannerT>
    Match
    Parse(ScannerT& scanner) const
    {
        ++scanner;
        return Match(1);
    }
};
A Match object is returned by the Parse function. Most importantly, the Match object reports the success of the parse; i.e. evaluates to true if the Parse function is successful, false otherwise. If the parse is successful, the Match object may also be queried to report the number of characters matched (match.Length()). The length is positive if the match is successful, otherwise a negative value is reported (typically, -1). Here is a code snippet:
Match  match = Parse(scanner);
bool   success = match;
int    length = match.Length();

Primitives:

The framework predefines some parser primitives. Here are some examples:
ChLit:  Character literal.  Example: ChLit<>('X');
Range:  Character range.    Example: Range<>('0','9');
StrLit: String literal.     Example: StrLit<>("Hello");

At this point, as evident in these examples, it is important to note that contrary to standard practice, the Spirit framework handles parsing tasks at both the character level as well as the phrase level [N3]. One may consider that a lexical analyzer is seamlessly integrated in the Spirit framework.

We may now fill in some of the gaps. Going back to our original example, actually, the character literals '(', ')', '+', '-', '*' and '/' in the grammar declaration are ChLit objects that are implicitly created behind the scenes. One may prefer to declare these explicitly as:

ChLit<> plus   ('+');
ChLit<> minus  ('-');
ChLit<> times ('*');
ChLit<> divide ('/');
ChLit<> oppar ('(');
ChLit<> clpar (')');
Here is the complete list of parser primitives:

Basics:


ChLit Character literal. Example: ChLit<>('X');
Range Character range. Example: Range<>('0','9');
StrLit String literal. Example: StrLit<>("Hello");

NCChLit Case insensitive character literal.
NCRange Case insensitive character range.
NCStrLit Case insensitive string literal.

  Note: These classes are templates parameterized by character type and defaults to char.

Predefined parser primitives:


ch_p(ch) Functor version of ChLit
range_p(from, to) Functor version of Range
str_p(str) Functor version of StrLit
ncch_p(ch) Functor version of NCChLit
ncrange_p(from, to) Functor version of NCRange
ncstr_p(str) Functor version of NCStrLit

 

Note: These function objects are designed to be used within expressions [N17]. Example:

    helloworld = str_p("hello") >> str_p("world");

anychar Matches any character
nothing Matches nothing
epsilon The epsilon, always a sucessful match with 0 length

alnum Alpha-numeric characters
alpha Alphabetic characters
cntrl Control characters
digit Numeric digits
graph Non-space printing characters
lower Lower case letters
print Printable characters
punct Punctuation symbols
space Space, tab, return, etc.
upper Upper case letters
xdigit Hexadecimal digits

  Note: All predefined parser objects may be used for char and wchar_t types

Numerics:

Aside from the predefined primitives listed above, the framework also provides a couple of predefined objects for parsing integers and real numbers.

Here is our short list of numeric parsers [N17]:

uint_p(n) Unsigned integers
int_p(n) Signed integers
real_p(n) Floating point numbers
C/C++ format, e.g. 123.456E78

The parameter n is a reference to a numeric type that will act as a placeholder for the parsed number [N18]. Here's a sample usage:

int       n;
Rule<>    r = '[' >> int_p(n) >> ']';

Operators:

Operators are used as a means for object composition and embedding. Simple parsers may be composed to form composites through operator overloading, crafted to approximate the syntax of an Extended Backus-Normal Form (EBNF) variant. An expression such as:
a | b
actually yields a new parser type which is a composite of its operands, a and b. Taking this example further, if a and b were of type ChLit<>, the result would have the composite type:
Alternative <ChLit<>, ChLit<>>
A suite of operators facilitate the composition of parsers:
  1. Set-like operators on parsers [N4]:

    a | b Union (match a or b. Also referred to as alternatives) [N5]
    a & b Intersection (match a and b) [N6]
    a - b Difference (match a but not b)
    a ^ b XOR (match a or b, but not both)

  2. Sequencing:

    a >> b Matches a and b in sequence [N1]
    a && b Sequential-and operator. Same as >>. Matches a and b in sequence [N13]
    a || b Sequential-or. Matches a or b in sequence [N13]

  3. Iteration:

    *a Kleene star. Match a zero (0) or more times
    +a Positive. Match a one (1) or more times
    !a Optional. Match a zero (0) or one (1) time

Operator precedence and grouping:

Since we are defining our meta-language in C++, we follow C/C++'s operator precedence rules. Groups override this (e.g., *(a | b) reads: match a or b zero (0) or more times).

For binary operators, one of the operands (but not both [N7]) may be a char, wchar_t, char const* or wchar_t const*. Examples:
a | 'x';
a - "Hello World";
A few simple classes, when composed and structured in a hierarchy, form a very powerful object oriented recursive descent parsing engine. These classes provide the infrastructure needed for the construction of more complex parsers. The final parser composite is a non-deterministic recursive descent parser with infinite look-ahead LL(INF). Top-down descent traverses the hierarchy, checking each object for a match, backtracking and checking all possible alternatives.

The flexibility of object embedding and composition combined with recursion opens up a unique approach to parsing. Subclasses are free to form aggregates and algorithms of arbitrary complexity. Complex parsers can be created with the composition of only a few primitive classes.

The framework is designed to be fully open-ended and extensible. New primitives or composites, from the trivial to the complex, may be added any time. Composition is totally static (happens at compile time). This is possible through the expressive flexibility of C++ template meta- programming.

Iterators:

So far we have introduced a couple of EBNF operators that deal with iteration. We have the (+) operator, which matches the preceding symbol one (1) or more times, as well as the Kleene star (*) which matches the preceding symbol zero (0) or more times.

If we look more closely, take note that we generalized the optional expression of the form !a above in the same category as iterators. This is logical, considering that the optional closure matches the expression following it zero (0) or (1) time. Taking this further, we may want to have a generalized iteration operator. To some this may seem to be a case of overkill. Yet there are grammars that are impractical and cumbersome, if not impossible, for the basic EBNF iteration syntax to specify. Examples:
  1. A file name may have a maximum of 255 characters only.
  2. A specific bitmap file format has exactly 4096 RGB color information.
  3. A 32 bit binary string (1..32 1s or 0s).

Other than the Kleene star (*), the Positive closure (+), and the optional (!), a more flexible mechanism for iteration is provided for by the framework:

a.Repeat(); Repeat a, zero or more times.
Same as a.Repeat(0, more);
a.Repeat(n); Repeat a, exactly n times.
a.Repeat(n1, n2); Repeat a, at least n1 times and at most n2 times.
a.Repeat(n, more); Repeat a, n or more times.
a(n); Repeat a, exactly n times.
Same as a.Repeat(n);
a(n, to); Repeat a, at least n1 times and at most n2 times.
Same as a.Repeat(n1, n2);
a(n, more); Repeat a, n or more.
Same as a.Repeat(n, more);

The Production Rule:

Rule is a class that captures the type and behavior of any primitive or composite parser that is assigned to it either via its assignment operator or its copy constructor. In effect, a rule allows an arbitrarily complex composite parser to be named. Naming of a rule allows it to be referenced in another compositional expression. Essentially, Rule 'IS-A' (inherits from) Parser and adheres to Parser's conceptual interface.

The type and behavior of a parser assigned to a rule is encoded in a conventional polymorphic class (AbstractParser). The rule's assignment and copy constructor dynamically creates a concrete instance of AbstractParser. The rule's assignment and copy constructor are both member templates parameterized by the type of the parser passed. Example:
Rule<>   aRule = *(a | b) & +(c | d | e);
The type and functionality of the right-hand expression, which may be arbitrarily complex, is encoded in the rule named aRule [N8].

Forward declarations:
Rules may be declared before being defined to allow cyclic structures typically found in BNF declarations. Example:
    Rule<>   a, b, c;
			
    a = b | a;
    b = c | a;
Recursion:
The right-hand side of a rule may reference other rules, including itself. The limitation is that direct or indirect left recursion is not allowed [N14] (this is an unchecked run-time error that results in an infinite loop). This is typical of LL(k) parsers. Example:

    a = a | b;    //    infinite loop! 

Redefinition:

Redefinition of rules is not allowed. This could have been enforced statically by not supplying a default constructor or an assignment operator. Yet forward declared rules would not have been possible. The caveat is that redefinition is caught only at run time [N9]:

    a = b | c;
    a = b;        //    run-time error

Undefined Rules:

An undefined rule matches nothing. This can be used to turn on/off extensions to a language at the parser level [N19]. For example:

r = feature_a | feature_b;
if (allow_feature_b)
    feature_b = /*...define feature_b...*/;

If the flag allow_feature_b is false, feature_b will be left undefined and will match nothing when parsing proceeds. So, the rule r will have only one alternative.

Rationale:

Rules straddle the border between static and dynamic C++. In effect, a rule transforms compile-time polymorphism (using templates) into run-time polymorphism (using virtual functions). This is necessary due to C++'s inability to automatically declare a variable of a type deduced from an arbitrarily complex expression in the right-hand side (rhs) of an assignment [N15]. Basically, we want to do something like:

    T   rule = an_arbitrarily_complex_expression;

without having to know or care about the resulting type of the right-hand side of the assignment expression. Apart from this, we also need a facility to forward declare an unknown type:

    T   rule;
    ...
    rule = a | b;       

These limitations lead us to this implementation of rules. This comes at the expense of the overhead of a virtual function call, once through each invocation of a rule.

The Scanner:

Already mentioned in passing, the Spirit parser compiler, unlike traditional parser generators, can handle parsing tasks at both the character as well as the phrase level [N3]. There is no separate lexical analyzer. There is perfect integration between the character and the phrase levels.

The Scanner wraps an iterator and a white space parser (or filter, if you may). The scanner extracts data from the input stream and filters this data to exclude white spaces as recognized by the white space filter. The scanner invokes this filter when tasked to scan the next character from the input. The scanner conforms to a standard STL constant (immutable) forward iterator.

The scanner is not a full-blown lexical analyzer. It does not extract tokens such as reserved words and operators. Nor does it extract numbers and literal strings [N10].

Here's an example that demonstrates the scanner in action [N11]:

    Scanner<>  scanner(iter, space);

The identifier iter is an iterator to some data that the scanner will be working on. The identifier space is one of the predefined parser primitives in the framework. The space parser matches white spaces as defined in the std::isspace(ch) standard function. After having defined our scanner, we may now pass this to any Parse(...) function. This will make the parser involved skip white spaces:

    parser.Parse(scanner);    // White spaces are skipped

The Scanner class is parameterized by the type of white space production and the type of the iterator that it wraps. Although the type of the scanner's white space production defaults to Space_, one can definitely supply a more specific or elaborate white space skipping scheme for the scanner to use. To illustrate this, say we want C/C++ style comments to be considered as white space in addition to spaces, tabs, newlines and carriage returns, we may define a production rule, ignore:

ignore = space | comment;
comment = "//" >> *(anychar - (ch_p('\0') | '\n' | '\r'));
        | "/*" >> *(anychar - (ch_p('\0') | "*/")) >> "*/";

Now we can use this rule to create our scanner. Thus:

Scanner<Rule<> > scanner(iter, ignore);
parser.Parse(scanner); // White spaces and comments are skipped

As mentioned, the scanner conforms to an STL forward iterator. Any forward iterator may be used in place of a scanner and passed to a Parse function. Doing so will force the parser to work on the character level (white spaces are not skipped). If we will work solely at the character level, it is best to bypass the scanner this way instead of sprinkling the code with Lexeme directives.

On the other hand, the scanner can wrap any standard conforming iterator (as mentioned, at least a forward iterator is required). This gives us the flexibility to use a string, vector or even a list, for example, as input to our parser, instead of the default char*.

Directives:

Parser directives have the form: Directive[expression].

A directive modifies the behavior of its enclosed expression, essentially 'decorating' it (Decorator Pattern [R5]). The framework pre-defines a few directives. Clients of the framework are free to define their own directives as needed.

Predefined directives:

  1. Lexeme:

    Turns off white space skipping. By default the parser ignores white spaces and all things considered as white spaces, possibly including comments as parameterized by the scanner passed into the parser's Parse(...) member function.

    Situations where we want to work at the character level instead of the phrase level call for a special construct. Rules can be directed to work at the character level by enclosing the pertinent parts of the grammar inside the Lexeme directive.

  2. Longest:

    Alternatives in the Spirit parser compiler are short-circuited [N5]. Sometimes, this is not what is desired. The Longest directive instructs the parser not to short-circuit alternatives enclosed inside this directive, but instead makes the parser try all possible alternatives and choose the one matching the longest portion of the input stream.

  3. Shortest:

    Opposite of the Longest directive.

Now, going back to our original example, the observant reader might notice that the integer rule was left undefined. Although its definition is quite obvious, here's how it is actually defined in the context of the framework. Continuing our original example:

    integer = Lexeme[ !(ch_p('+') | '-') >> +digit ];

where digit is another predefined parser that calls the std::isdigit(ch) standard function.

Optionally, we can take advantage of the numeric parser int_p:

    int n;
    integer = int_p(n);

Alas, we have our complete grammar specification:

    Rule<>    integer, group, expr1, expr2, expr;

    integer   = Lexeme[ !(ch_p('+') | '-') >> +digit ];
    group     = '(' >> expr >> ')';
    expr1     = integer | group;
    expr2     = expr1 >> *(('*' >> expr1) | ('/' >> expr1));
    expr      = expr2 >> *(('+' >> expr2) | ('-' >> expr2));

Yes, it's a calculator. This production rule expr in our grammar specification, traditionally called the start symbol [N12], can accept inputs such as:

    12345
    -12345
    +12345
    1 + 2
    1 * 2
    1/2 + 3/4
    1 + 2 + 3 + 4
    1 * 2 * 3 * 4
    (1 + 2) * (3 + 4)
    (-1 + 2) * (3 + -4)
    1 + ((6 * 200) - 20) / 6
    (1 + (2 + (3 + (4 + 5))))

Semantic actions:

Semantic actions have the form: expression[action]

Ultimately, after having defined our grammar and generating a corresponding parser, we more often than not need to produce some output or do some work besides syntax analysis. Unless of course what we want is merely to check for the conformance of an input with our grammar, which is very seldom the case.

What we need is a mechanism that will instruct the parser on what should be done as it traverses the grammar while in the process of parsing an input stream. This mechanism is put in place through semantic actions.

Semantic actions may be attached to any expression at any level within the parser hierarchy. An action is a C/C++ function or function object that will be called if a match is found in the particular context where it is attached. The action function may be serve as a hook into the parser and may be used to, for example:

  1. Generate output from the parser (ASTs, for example);
  2. Report warnings or errors; or
  3. Manage symbol tables.

A semantic action can be any free function or function object that is compatible with the conceptual interface:

template <typename Iterator>
bool Action(Iterator scanner, Iterator last);

Example:

    void
    myAction(char const* begin, char const* end)
    {
        std::string(begin, end);
        std::cout << string << std::endl;
    }

    Rule<>   myRule = (a | b | *(c >> d))[&myAction];

The function myAction will be called whenever the expression:

    a | b | *(c >> d)

matches a portion of the input stream upon parsing. Two iterators (begin and end) are passed into the function. These iterators point to the start and end, respectively, of the portion of input stream where the match is found.

Here now is our calculator enhanced with semantic actions:

void	doInt(char const* str, char const* end)
{
    std::string	s(str, end);
    std::cout << s << std::endl;
}

void    doAdd(char const*, char const*)     { cout << "ADD\n"; }
void    doSubt(char const*, char const*)    { cout << "SUBT\n"; }
void    doMult(char const*, char const*)    { cout << "MULT\n"; }
void    doDiv(char const*, char const*)     { cout << "DIV\n"; }

Rule<>  expr;
Rule<>  integer = Lexeme[ (!(ch_p('+') | '-') >> +digit)[&doInt] ];
Rule<>  group   = '(' >> expr >> ')';
Rule<>  expr1   = integer | group;
Rule<>  expr2   = expr1 >> *(('*' >> expr1)[&doMult] | ('/' >> expr1)[&doDiv]);
expr            = expr2 >> *(('+' >> expr2)[&doAdd] | ('-' >> expr2)[&doSubt]);

Feeding in the expression "(-1 + 2) * (3 + -4)", for example, to the rule expr will produce the expected output:

-1
2
ADD
3
-4
ADD
MULT

which, by the way, is the Reverse Polish Notation (RPN) of the given expression, reminiscent of some primitive calculators and the language Forth.

Future Directions:

Spirit is an on-going development. It is designed as a parser for the common programmer. Daily programming chores necessitate parsing more and more. Most of the time, sometimes without realizing it, we are coding a parser by hand. Many stay away from full blown applications such as YACC because of the jumbo-jet syndrome. Spirit's intent is to be a library-based parser in between regex and a full-blown parser. Further developments will focus on this prime motivation.

    1. Character and String sets. A pre-release version of Spirit actually had character sets. That was before support for character types other than 8 bits (16 and 32 bits) was considered. While Spirit is usable as is, the performance of the framework will definitely be improved with the addition of character and string sets.

    2. Perfect hash keyword lookup. For fixed string lookup, a perfect hash lookup table will give a O(1) performance. This is particularly suited for keywords and operators.

    3. More predefined parsers. Parsers for comments, delimited lists, strings, character literals, identifiers, and other commonly used utilities.

    4. Semantic predicates. Borrowed from PCCTS, this will allow finer algorithmic control over the disambiguation of alternatives. This is much like Spirit's semantic actions but geared towards syntax, rather than semantics.

    5. Symbol tables. Support for contexts, namespaces for the management of identifiers.

    6. ASTs. Abstract syntax tree generation.

    7. Parallel semantic action framework. A semantic action framework based on functional programming techniques in C++. This will be a separate entity and built as a layer on top of Spirit. Spirit will not rely on this nor even know of its existence. The semantic action facility of Spirit is the bare metal where a more extensive foundation will be laid down.

Wrap up:

Accompanying the framework are a couple of test programs that follow up what we have started. There are four (4) short programs (calc1.cpp, calc2.cpp, feat.cpp and micro.cpp) that demonstrate the use and capabilities of the Spirit framework. The first two are basically refinements of our calculator example. The second program (calc2.cpp) is of particular interest because it implements a byte-code compiler and interpreter for our simple calculator. The program feat.cpp is a test program that determines if the compiled program is working as expected. The outputs of both calc1.cpp and feat.cpp may be checked against calc1.txt and feat.txt, respectively, to ensure that the framework is working well. Finally, micro.cpp demonstrates the micro-parser functionality and the numeric parsers in action.

The framework itself is small. There are twelve (12) header files (apart from configuration files). The <Spirit.h> file is the master header file that includes all the necessary headers in the framework. The framework does not rely on any code or library except the standard C++ libraries.

Spirit.h
Spirit_Action.h
Spirit_Basics.h
Spirit_Composite.h
Spirit_Directives.h
Spirit_Iterators.h
Spirit_Numerics.h
Spirit_Operators.h
Spirit_Parser.h
Spirit_Primitives.h
Spirit_Rule.h
Spirit_Scanner.h

I believe there are a couple of new innovations introduced in the Spirit framework worth looking into. The modeling of parsers as hierarchical composite objects coupled with the notion of composition of simple objects to create more complex objects as applied to parsers is one of the first of its kind. Another innovation is the application of template meta-programming techniques, expression templates in particular, to parsing.

After version Spirit 1.0 was released, I had the chance, thanks to Greg Colvin of Boost, to look into a prior work of Dr. Damian Conway [R10]. It came to my surprise that Spirit shares some similarities with Dr. Conway's work [N16], albeit using modern C++ features and techniques such as static polymorphism through partial template specializations versus dynamic dispatch through virtual functions. Still, in some ways, Spirit could benefit from learning more about Dr. Conway's work especially with its implementation of deferred expressions [R11] which admitedly is superior to Spirit's semantic actions.

What else? Ah yes, one more tidbit: the abundance of oparators such as parser intersection &, difference -, and xor ^ might also prove to be an interesting research area. It has been noted that intersections allow us to define languages that are not context free [N6]. What about xor ^?

So far we have just scratched the surface of what we can do with the Spirit parser compiler. What we have so far is a prototype that demonstrates the concept. There are still avenues to explore that ultimately lead towards this new frontier. I invite you now to trek this exciting journey with me. Have fun!

Annotations:

  1. The comma operator as in a, b seems to be a better candidate, syntax-wise. But then the problem is with its precedence. It has the lowest precedence in C/C++, which makes it virtually useless. There was some talk in Boost's mailing list about the possibility of writing a template meta-program that overrides the precedence of the comma operator. Yet I found that there is no way, using any meta-program, to distinguish explicit grouping using parentheses. From the viewpoint of the meta-program, r = a, b | c; is the same as r = a, (b | c);.

    The initial release of Spirit [V1.0] has caused a bit of commotion and controversy over the >> and the prefix * and + operators. While some find the operators liveable, likeable, and even lovable, some respected people in the field find them confusing and unacceptable. Some have strong opinions while some are somewhat neutral but with varied preferences. The author maintains that we cannot possibly view the syntax issue in a strictly logical perspective because subjectivity inevitably enters into the issue. It is not about right or wrong. Ultimately, it is plainly and simply a matter of preference.

  2. This technique is known as the "Curiously Recurring Template Pattern" in meta-programming circles [R3].

  3. Typical parsers regard characters (symbols that form words) and phrases (words that form sentences) as separate domains. Entities such as reserved words, operators, literal strings, numerical constants, etc., which constitute the terminals of a grammar are extracted first in a separate lexical analysis stage.

  4. The complement operator ~ was originally put into consideration. Yet further understanding of its value and meaning leads us to uncertainty. The basic problem stems from the fact that ~a will yield U-a, where U is the universal set of all strings. Might this be a another fertile ground for future exploration?

  5. Alternative operands are tried one by one on a first come first served basis starting from the leftmost operand. After a successfully matched alternative is found, the parser concludes its search, essentially short-circuiting the search for other potentially viable candidates. This short-circuiting implicitly gives the highest priority to the leftmost alternative.

    Short-circuiting is done in the same manner as C or C++'s logical expressions; e.g. if (x < 3 || y < 2) where, if x evaluates to be less than 3, the y < 2 test is not done anymore. In addition to providing an implicit priority rule for alternatives which is necessary, given the non-deterministic nature of the Spirit parser compiler, short-circuiting improves the execution time.

  6. Some researchers assert that the intersections let us define languages that are not context free [R4]. "The theory of defining a language as the intersection of a finite number of context free languages was developed by Leu and Weiner in 1973".

  7. C++ forbids the overloading of operators on primitives such as ints, chars and pointers (e.g. char*). At least one of the operands should be a user-defined object.

  8. Rule is a template class. The default has the form: Rule<>.

  9. We could have allowed this by making use of the new definitions as alternatives to the old. Yet this comes at the expense of additional checks and template instantiations. We pay for this even if we don't use the feature.

  10. Although the Spirit parser compiler does not need a separate lexical analyzer, there is no reason why we cannot have one. One can always have as many parser layers as needed. In theory, one may create a preprocessor, a lexical analyzer and a parser proper, all using the same framework.

  11. Scanner is a template class. The default has the form: Scanner <>.

  12. Typically, parsers have what is called a start symbol, chosen to be the root of the grammar where parsing starts. The Spirit parser compiler has no notion of a start symbol. Any rule can be a start symbol.

  13. The sequencing operator >> can alternatively be thought of as the sequential-and operator. An expression:
    a && b

    reads as: match a and b in sequence. Continuing this logic, we can also have a sequential-or operator where the expression:

    a || b;

    reads as: match a or b and in sequence. That is, if both a and b match, it must be in sequence; this is equivalent to a | b | a >> b.

  14. There are ways to rewrite the meta-program hierarchy to circumvent this limitation. This will be implemented in a future release of Spirit in the form of a directive.

  15. David Abrahams proposed in comp.std.c++ to reuse the auto keyword for this purpose. Example:
  16. auto r = an_arbitrarily_complex_expression;

  17. Greg Colvin noted Spirit's similarities to a prior work of Dr. Damian Conway [R10]. Any similarity to Dr. Conway's work is purely coincidental. Yet, after reading Dr. Conway's papers, I have to acknowledge that some similarities do exist.

  18. In reality, these are actually objects that create parsers.

  19. The numeric parsers interface was suggested by Vesa Karvonen. Vesa envisions the use of such a facility in implementing micro-parsers. Such facilities "make implementing micro-parsers an order of magnitude more intuitive and simpler compared to traditional methods".

  20. Adding new productions. Some BNF variants allow:
    r ::= a
    r ::= b

    which is equivalent to:

    r ::= a | b

    Douglas Gregor noted that allowing the addition of new productions would be extremely useful to be able to, for instance, turn on/off extensions to a language at the parser level. Example:

    if (allow_typeof)
        typespec = TYPEOF >> '(' >> expr >> ')'
                 | TYPEOF >> '(' >> type_id >> ')';
    
    if (allow_template_typedefs)
        type_decl = template_header typedef;


    Essentially, the declarative nature of (E)BNF productions when inlined with imperative C++ statements yield an uncanny mix. The C++ assignment operator substituting as BNF's ::= operator when applied to the addition of new productions might cause confusion.

    The |= operator, as suggested by George Heintzelman, may be used for this purpose. Yet this too might cause confusion as one might expect &=, -= and ^= to be equally applicable.

    Also, after some deliberation, both cases impose runtime
    and code size penalty [N9].

References:

  1. Todd Veldhuizen. "Expression Templates". C++ Report, June 1995.
  2. Peter Naur (ed.). "Report on the Algorithmic Language ALGOL 60". CACM, May 1960.
  3. James Coplien. "Curiously Recurring Template Pattern". C++ Report, Feb. 1995.
  4. Richard J. Botting, Ph.D. "XBNF" (citing Leu-Weiner, 1973). www.csci.csusb.edu/dick/maths/intro_ebnf.html, California State University, San Bernardino,1998.
  5. Erich Gamma, Richard Helm, Ralph Jhonson, and John Vlissides. Design Patterns, Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995.
  6. ISO/IEC. "ISO-EBNF", ISO/IEC 14977: 1996(E).
  7. Dick Grune and Ceriel Jacobs. Parsing Techniques: A Practical Guide. Ellis Horwood Ltd.: West Sussex, England, 1990. (electronic copy, 1998)
  8. T. J. Parr, H. G. Dietz, and W. E. Cohen. PCCTS Reference Manual (Version 1.00). School of Electrical Engineering, Purdue University, West Lafayette, August 1991.
  9. Adrian Johnstone and Elizabeth Scott. RDP, A Recursive Descent Compiler Compiler. Technical Report CSD TR 97 25, Dept. of Computer Science, Egham, Surrey, England, Dec. 20, 1997.
  10. Damian Conway. Parsing with C++ Classes. ACM SIGPLAN Notices, 29:1, 1994.
  11. Damian Conway. Parsing with C++ deferred expressions. ACM SIGPLAN Notices, 29:1, 1994.