About Spirit:

Spirit is an object oriented recursive descent parser generator framework implemented using template meta-programming techniques. Expression templates [R1] allow us to approximate the syntax of Extended Backus Normal Form (EBNF) [R2] completely in C++.

The Spirit framework enables a target grammar to be written exclusively in C++. Inline EBNF grammar specifications can mix freely with other C++ code and, thanks to the generative power of C++ templates, are immediately executable. In retrospect, conventional compiler-compilers or parser-generators have to perform an additional translation step from the source EBNF code to C or C++ code.

A simple EBNF grammar snippet:
group   ::= '(' expr ')'
expr1   ::= integer | group
expr2   ::= expr1 (('*' expr1) | ('/' expr1))*
expr    ::= expr2 (('+' expr2) | ('-' expr2))*
is approximated using Spirit's facilities as seen in this code snippet:
group   = '(' >> expr >> ')';
expr1   = integer | group;
expr2   = expr1 >> *(('*' >> expr1) | ('/' >> expr1));
expr    = expr2 >> *(('+' >> expr2) | ('-' >> expr2));
Through the magic of expression templates, this is perfectly valid C++ code that can be executed. The production rule named expr is in fact an object that has a member function Parse(...) that does the work given a source code written in the grammar that we have just declared.

Certainly we have done some modifications to the original EBNF syntax. This is done so as to conform to C++ syntax rules. Most notably we see the abundance of shift (>>) operators. Since there are no 'empty' operators in C++, it is simply not possible to write something like:
a b
as seen in Math syntax, for example, to mean multiplication or, in our case, as seen in EBNF syntax to mean sequencing (b should follow a). The framework uses the shift (>>) operator instead for this purpose. We take the (>>) operator, with arrows pointing to the right, to mean "is followed by" [N1].

The alternative operator (|) and the parentheses (grouping) remain as is. The assignment operator (=) is used in place of BNF's (::=) Last but not least, the Kleene star (*) which used to be a postfix operator in EBNF becomes a prefix. Instead of:
a*   ... in EBNF syntax,
we write:
*a   ... in Spirit.
There are no postfix stars (*) in C/C++.

Now what makes Spirit tick?

First, some concepts. A parser object is the most basic entity in the framework. Basically, a parser accepts an input stream and returns a Match object as its result. The Match object evaluates to true if some criteria or algorithm matches a portion of the data at the current stream position. If a match is found, the stream is advanced accordingly.

All parsers inherit from the base template class, Parser. This class is a protocol base class for all parsers. This is essentially an interface contract. The Parser class does not really know how to parse anything but instead relies on its sole template parameter Derived [N2] (which obviously must be a subclass of Parser) to do the actual parsing. Concrete subclasses inheriting from Parser must have a corresponding member function Parse(...) compatible with the conceptual interface:
template <typename ScannerT>
Match
Parse(ScannerT& scanner) const;
The Scanner template parameter is just another form of an iterator that points to the current input stream position. We shall elaborate on this later. For now, think of the scanner as just another STL compliant forward iterator. It is passed by reference to allow the function to 'move' its position accordingly when a match is found. The parse is considered successful if a portion of the input starting at the current scanner position is matched. The Parse function terminates as soon as the scanner finds anything that the parser does not recognize.

A parser can be quite simple. Here is a sample parser that accepts all characters:
struct AllChars : public Parser<AllChars> {

    template <typename ScannerT>
    Match
    Parse(ScannerT& scanner) const
    {
        ++scanner;
        return Match(1);
    }
};
A Match object is returned by the Parse function. Most importantly, the Match object reports the success of the parse; i.e. evaluates to true if the Parse function is successful, false otherwise. If the parse is successful, the Match object may also be queried to report the number of characters matched (match.Length()). The length is positive if the match is successful, otherwise a negative value is reported (typically, -1). Here is a code snippet:
Match  match = Parse(scanner);
bool   success = match;
int    length = match.Length();

Primitives:

The framework predefines some parser primitives. Here are some examples:
ChLit:  Character literal.  Example: ChLit<>('X');
Range:  Character range.    Example: Range<>('0','9');
StrLit: String literal.     Example: StrLit<>("Hello");
At this point, as evident in these examples, it is important to note that contrary to standard practice, the Spirit framework handles parsing tasks at both the character level as well as the phrase level [N3]. One may consider that a lexical analyzer is seamlessly integrated in the Spirit framework.

We may now fill in some of the gaps. Going back to our original example, actually, the character literals '(', ')', '+', '-', '*' and '/' in the grammar declaration are ChLit objects that are implicitly created behind the scenes. One may prefer to declare these explicitly as:
ChLit<> plus   ('+');
ChLit<> minus  ('-');
ChLit<> times  ('*');
ChLit<> divide ('/');
ChLit<> oppar  ('(');
ChLit<> clpar  (')');
Here is the complete list of parser primitives:

Basics:

ChLit Character literal. Example: ChLit<>('X');

Range Character range. Example: Range<>('0','9');

StrLit String literal. Example: StrLit<>("Hello");

NCChLit Case insensitive character literal.

NCRange Case insensitive character range.

NCStrLit Case insensitive string literal.

Note: These classes are templates parameterized by character type and defaults to char.

Predefined parser primitives:
ch_p(ch) Functor version of ChLit

range_p(from, to) Functor version of Range

str_p(str) Functor version of StrLit

ncch_p(ch) Functor version of NCChLit

ncrange_p(from, to) Functor version of NCRange

ncstr_p(str) Functor version of NCStrLit
Note: These function objects are designed to be used within expressions [N17]. Example:
    helloworld = str_p("hello") >> str_p("world");
anychar Matches any character

nothing Matches nothing

epsilon The epsilon, always a sucessful match with 0 length
alnum Alpha-numeric characters

alpha Alphabetic characters

cntrl Control characters

digit Numeric digits

graph Non-space printing characters

lower Lower case letters

print Printable characters

punct Punctuation symbols

space Space, tab, return, etc.

upper Upper case letters

xdigit Hexadecimal digits

Note: All predefined parser objects may be used for char and wchar_t types

Numerics:

Aside from the predefined primitives listed above, the framework also provides a couple of predefined objects for parsing integers and real numbers.

Here is our short list of numeric parsers [N17]:

uint_p(n) Unsigned integers

int_p(n) Signed integers

real_p(n) Floating point numbers
C/C++ format, e.g. 123.456E78

The parameter n is a reference to a numeric type that will act as a placeholder for the parsed number [N18]. Here's a sample usage:
int       n;
Rule<>    r = '[' >> int_p(n) >> ']';

Operators:

Operators are used as a means for object composition and embedding. Simple parsers may be composed to form composites through operator overloading, crafted to approximate the syntax of an Extended Backus-Normal Form (EBNF) variant. An expression such as:
a | b
actually yields a new parser type which is a composite of its operands, a and b. Taking this example further, if a and b were of type ChLit<>, the result would have the composite type:
Alternative <ChLit<>, ChLit<>>
A suite of operators facilitate the composition of parsers:

Set-like operators on parsers [N4]:

a | b Union (match a or b. Also referred to as alternatives) [N5]

a & b Intersection (match a and b) [N6]

a - b Difference (match a but not b)

a ^ b XOR (match a or b, but not both)

Sequencing:

a >> b Matches a and b in sequence [N1]

a && b Sequential-and operator. Same as >>. Matches a and b in sequence [N13]

a || b Sequential-or. Matches a or b in sequence [N13]

Iteration:

*a Kleene star. Match a zero (0) or more times

+a Positive. Match a one (1) or more times

!a Optional. Match a zero (0) or one (1) time

Operator precedence and grouping:

Since we are defining our meta-language in C++, we follow C/C++'s operator precedence rules. Groups override this (e.g., *(a | b) reads: match a or b zero (0) or more times).

For binary operators, one of the operands (but not both [N7]) may be a char, wchar_t, char const* or wchar_t const*. Examples:
a | 'x';
a - "Hello World";
A few simple classes, when composed and structured in a hierarchy, form a very powerful object oriented recursive descent parsing engine. These classes provide the infrastructure needed for the construction of more complex parsers. The final parser composite is a non-deterministic recursive descent parser with infinite look-ahead LL(INF). Top-down descent traverses the hierarchy, checking each object for a match, backtracking and checking all possible alternatives.

The flexibility of object embedding and composition combined with recursion opens up a unique approach to parsing. Subclasses are free to form aggregates and algorithms of arbitrary complexity. Complex parsers can be created with the composition of only a few primitive classes.

The framework is designed to be fully open-ended and extensible. New primitives or composites, from the trivial to the complex, may be added any time. Composition is totally static (happens at compile time). This is possible through the expressive flexibility of C++ template meta- programming.

Iterators:

So far we have introduced a couple of EBNF operators that deal with iteration. We have the (+) operator, which matches the preceding symbol one (1) or more times, as well as the Kleene star (*) which matches the preceding symbol zero (0) or more times.

If we look more closely, take note that we generalized the optional expression of the form !a above in the same category as iterators. This is logical, considering that the optional closure matches the expression following it zero (0) or (1) time. Taking this further, we may want to have a generalized iteration operator. To some this may seem to be a case of overkill. Yet there are grammars that are impractical and cumbersome, if not impossible, for the basic EBNF iteration syntax to specify. Examples:

A file name may have a maximum of 255 characters only.

A specific bitmap file format has exactly 4096 RGB color information.

A 32 bit binary string (1..32 1s or 0s).

Other than the Kleene star (*), the Positive closure (+), and the optional (!), a more flexible mechanism for iteration is provided for by the framework:

a.Repeat(); Repeat a, zero or more times.
Same as a.Repeat(0, more);

a.Repeat(n); Repeat a, exactly n times.

a.Repeat(n1, n2); Repeat a, at least n1 times and at most n2 times.

a.Repeat(n, more); Repeat a, n or more times.

a(n); Repeat a, exactly n times.
Same as a.Repeat(n);

a(n, to); Repeat a, at least n1 times and at most n2 times.
Same as a.Repeat(n1, n2);

a(n, more); Repeat a, n or more.
Same as a.Repeat(n, more);

The Production Rule:

Rule is a class that captures the type and behavior of any primitive or composite parser that is assigned to it either via its assignment operator or its copy constructor. In effect, a rule allows an arbitrarily complex composite parser to be named. Naming of a rule allows it to be referenced in another compositional expression. Essentially, Rule 'IS-A' (inherits from) Parser and adheres to Parser's conceptual interface.

The type and behavior of a parser assigned to a rule is encoded in a conventional polymorphic class (AbstractParser). The rule's assignment and copy constructor dynamically creates a concrete instance of AbstractParser. The rule's assignment and copy constructor are both member templates parameterized by the type of the parser passed. Example:
Rule<>   aRule = *(a | b) & +(c | d | e);
The type and functionality of the right-hand expression, which may be arbitrarily complex, is encoded in the rule named aRule [N8].

Forward declarations:
Rules may be declared before being defined to allow cyclic structures typically found in BNF declarations. Example:
    Rule<>   a, b, c;
			
    a = b | a;
    b = c | a;
Recursion:
The right-hand side of a rule may reference other rules, including itself. The limitation is that direct or indirect left recursion is not allowed [N14] (this is an unchecked run-time error that results in an infinite loop). This is typical of LL(k) parsers. Example:
    a = a | b;    //    infinite loop! 
Redefinition:
Redefinition of rules is not allowed. This could have been enforced statically by not supplying a default constructor or an assignment operator. Yet forward declared rules would not have been possible. The caveat is that redefinition is caught only at run time [N9]:
    a = b | c;
    a = b;        //    run-time error
Undefined Rules:
An undefined rule matches nothing. This can be used to turn on/off extensions to a language at the parser level [N19]. For example:
r = feature_a | feature_b;
if (allow_feature_b)
    feature_b = /*...define feature_b...*/;
If the flag allow_feature_b is false, feature_b will be left undefined and will match nothing when parsing proceeds. So, the rule r will have only one alternative.
Rationale:
Rules straddle the border between static and dynamic C++. In effect, a rule transforms compile-time polymorphism (using templates) into run-time polymorphism (using virtual functions). This is necessary due to C++'s inability to automatically declare a variable of a type deduced from an arbitrarily complex expression in the right-hand side (rhs) of an assignment [N15]. Basically, we want to do something like:
    T   rule = an_arbitrarily_complex_expression;
without having to know or care about the resulting type of the right-hand side of the assignment expression. Apart from this, we also need a facility to forward declare an unknown type:
    T   rule;
    ...
    rule = a | b;       
These limitations lead us to this implementation of rules. This comes at the expense of the overhead of a virtual function call, once through each invocation of a rule.

The Scanner:

Already mentioned in passing, the Spirit parser compiler, unlike traditional parser generators, can handle parsing tasks at both the character as well as the phrase level [N3]. There is no separate lexical analyzer. There is perfect integration between the character and the phrase levels.

The Scanner wraps an iterator and a white space parser (or filter, if you may). The scanner extracts data from the input stream and filters this data to exclude white spaces as recognized by the white space filter. The scanner invokes this filter when tasked to scan the next character from the input. The scanner conforms to a standard STL constant (immutable) forward iterator.

The scanner is not a full-blown lexical analyzer. It does not extract tokens such as reserved words and operators. Nor does it extract numbers and literal strings [N10].

Here's an example that demonstrates the scanner in action [N11]:
    Scanner<>  scanner(iter, space);
The identifier iter is an iterator to some data that the scanner will be working on. The identifier space is one of the predefined parser primitives in the framework. The space parser matches white spaces as defined in the std::isspace(ch) standard function. After having defined our scanner, we may now pass this to any Parse(...) function. This will make the parser involved skip white spaces:
    parser.Parse(scanner);    // White spaces are skipped
The Scanner class is parameterized by the type of white space production and the type of the iterator that it wraps. Although the type of the scanner's white space production defaults to Space_, one can definitely supply a more specific or elaborate white space skipping scheme for the scanner to use. To illustrate this, say we want C/C++ style comments to be considered as white space in addition to spaces, tabs, newlines and carriage returns, we may define a production rule, ignore:
ignore = space | comment;
comment = "//" >> *(anychar - (ch_p('\0') | '\n' | '\r'));
        | "/*" >> *(anychar - (ch_p('\0') | "*/")) >> "*/";
Now we can use this rule to create our scanner. Thus:
Scanner<Rule<> > scanner(iter, ignore);
parser.Parse(scanner);    // White spaces and comments are skipped
As mentioned, the scanner conforms to an STL forward iterator. Any forward iterator may be used in place of a scanner and passed to a Parse function. Doing so will force the parser to work on the character level (white spaces are not skipped). If we will work solely at the character level, it is best to bypass the scanner this way instead of sprinkling the code with Lexeme directives.

On the other hand, the scanner can wrap any standard conforming iterator (as mentioned, at least a forward iterator is required). This gives us the flexibility to use a string, vector or even a list, for example, as input to our parser, instead of the default char*.

Directives:

Parser directives have the form: Directive[expression].

A directive modifies the behavior of its enclosed expression, essentially 'decorating' it (Decorator Pattern [R5]). The framework pre-defines a few directives. Clients of the framework are free to define their own directives as needed.

Predefined directives:

Lexeme:

Turns off white space skipping. By default the parser ignores white spaces and all things considered as white spaces, possibly including comments as parameterized by the scanner passed into the parser's Parse(...) member function.

Situations where we want to work at the character level instead of the phrase level call for a special construct. Rules can be directed to work at the character level by enclosing the pertinent parts of the grammar inside the Lexeme directive.

Longest:

Alternatives in the Spirit parser compiler are short-circuited [N5]. Sometimes, this is not what is desired. The Longest directive instructs the parser not to short-circuit alternatives enclosed inside this directive, but instead makes the parser try all possible alternatives and choose the one matching the longest portion of the input stream.

Shortest:

Opposite of the Longest directive.

Now, going back to our original example, the observant reader might notice that the integer rule was left undefined. Although its definition is quite obvious, here's how it is actually defined in the context of the framework. Continuing our original example:
    integer = Lexeme[ !(ch_p('+') | '-') >> +digit ];
where digit is another predefined parser that calls the std::isdigit(ch) standard function.

Optionally, we can take advantage of the numeric parser int_p:
    int n;
    integer = int_p(n);
Alas, we have our complete grammar specification:
    Rule<>    integer, group, expr1, expr2, expr;

    integer   = Lexeme[ !(ch_p('+') | '-') >> +digit ];
    group     = '(' >> expr >> ')';
    expr1     = integer | group;
    expr2     = expr1 >> *(('*' >> expr1) | ('/' >> expr1));
    expr      = expr2 >> *(('+' >> expr2) | ('-' >> expr2));
Yes, it's a calculator. This production rule expr in our grammar specification, traditionally called the start symbol [N12], can accept inputs such as:
    12345
    -12345
    +12345
    1 + 2
    1 * 2
    1/2 + 3/4
    1 + 2 + 3 + 4
    1 * 2 * 3 * 4
    (1 + 2) * (3 + 4)
    (-1 + 2) * (3 + -4)
    1 + ((6 * 200) - 20) / 6
    (1 + (2 + (3 + (4 + 5))))

Semantic actions:

Semantic actions have the form: expression[action]

Ultimately, after having defined our grammar and generating a corresponding parser, we more often than not need to produce some output or do some work besides syntax analysis. Unless of course what we want is merely to check for the conformance of an input with our grammar, which is very seldom the case.

What we need is a mechanism that will instruct the parser on what should be done as it traverses the grammar while in the process of parsing an input stream. This mechanism is put in place through semantic actions.

Semantic actions may be attached to any expression at any level within the parser hierarchy. An action is a C/C++ function or function object that will be called if a match is found in the particular context where it is attached. The action function may be serve as a hook into the parser and may be used to, for example:

Generate output from the parser (ASTs, for example);

Report warnings or errors; or

Manage symbol tables.

A semantic action can be any free function or function object that is compatible with the conceptual interface:
template <typename Iterator>
bool Action(Iterator scanner, Iterator last);
Example:
    void
    myAction(char const* begin, char const* end)
    {
        std::string(begin, end);
        std::cout << string << std::endl;
    }

    Rule<>   myRule = (a | b | *(c >> d))[&myAction];
The function myAction will be called whenever the expression:
    a | b | *(c >> d)
matches a portion of the input stream upon parsing. Two iterators (begin and end) are passed into the function. These iterators point to the start and end, respectively, of the portion of input stream where the match is found.

Here now is our calculator enhanced with semantic actions:
void	doInt(char const* str, char const* end)
{
    std::string	s(str, end);
    std::cout << s << std::endl;
}

void    doAdd(char const*, char const*)     { cout << "ADD\n"; }
void    doSubt(char const*, char const*)    { cout << "SUBT\n"; }
void    doMult(char const*, char const*)    { cout << "MULT\n"; }
void    doDiv(char const*, char const*)     { cout << "DIV\n"; }

Rule<>  expr;
Rule<>  integer = Lexeme[ (!(ch_p('+') | '-') >> +digit)[&doInt] ];
Rule<>  group   = '(' >> expr >> ')';
Rule<>  expr1   = integer | group;
Rule<>  expr2   = expr1 >> *(('*' >> expr1)[&doMult] | ('/' >> expr1)[&doDiv]);
expr            = expr2 >> *(('+' >> expr2)[&doAdd] | ('-' >> expr2)[&doSubt]);
Feeding in the expression "(-1 + 2) * (3 + -4)", for example, to the rule expr will produce the expected output:
-1
2
ADD
3
-4
ADD
MULT
which, by the way, is the Reverse Polish Notation (RPN) of the given expression, reminiscent of some primitive calculators and the language Forth.

About Spirit:

Now what makes Spirit tick?

Primitives:

Numerics:

Operators:

Iterators:

The Production Rule:

The Scanner:

Directives:

Semantic actions:

Future Directions:

Wrap up:

Annotations:

References:

ChLit	Character literal. Example: ChLit<>('X');
Range	Character range. Example: Range<>('0','9');
StrLit	String literal. Example: StrLit<>("Hello");

NCChLit	Case insensitive character literal.
NCRange	Case insensitive character range.
NCStrLit	Case insensitive string literal.

ch_p(ch)	Functor version of ChLit
range_p(from, to)	Functor version of Range
str_p(str)	Functor version of StrLit
ncch_p(ch)	Functor version of NCChLit
ncrange_p(from, to)	Functor version of NCRange
ncstr_p(str)	Functor version of NCStrLit

anychar	Matches any character
nothing	Matches nothing
epsilon	The epsilon, always a sucessful match with 0 length

alnum	Alpha-numeric characters
alpha	Alphabetic characters
cntrl	Control characters
digit	Numeric digits
graph	Non-space printing characters
lower	Lower case letters
print	Printable characters
punct	Punctuation symbols
space	Space, tab, return, etc.
upper	Upper case letters
xdigit	Hexadecimal digits

uint_p(n)	Unsigned integers
int_p(n)	Signed integers
real_p(n)	Floating point numbers C/C++ format, e.g. 123.456E78

a \| b	Union (match a or b. Also referred to as alternatives) [N5]
a & b	Intersection (match a and b) [N6]
a - b	Difference (match a but not b)
a ^ b	XOR (match a or b, but not both)

a >> b	Matches a and b in sequence [N1]
a && b	Sequential-and operator. Same as >>. Matches a and b in sequence [N13]
a \|\| b	Sequential-or. Matches a or b in sequence [N13]

*a	Kleene star. Match a zero (0) or more times
+a	Positive. Match a one (1) or more times
!a	Optional. Match a zero (0) or one (1) time

a.Repeat();	Repeat a, zero or more times. Same as a.Repeat(0, more);
a.Repeat(n);	Repeat a, exactly n times.
a.Repeat(n1, n2);	Repeat a, at least n1 times and at most n2 times.
a.Repeat(n, more);	Repeat a, n or more times.
a(n);	Repeat a, exactly n times. Same as a.Repeat(n);
a(n, to);	Repeat a, at least n1 times and at most n2 times. Same as a.Repeat(n1, n2);
a(n, more);	Repeat a, n or more. Same as a.Repeat(n, more);