Iterators and Parsing

Simple parser classes are composed and structured to form more complex and powerful parsers. Such parser structures form the basic infrastructure of Spirit. Behind the scenes, iterators interweave this infrastructure. Iterators extract data from the input, parceling, potentially modifying or filtering, and then finally relegating the result to individual parser elements on demand until the input is exhausted.

The iterator concept and interface adheres to the formal definitions of the C++ Standard Template Library (STL). We can use iterators supplied by STL's containers (e.g. list, vector, string, etc.) as is, or perhaps write our own or use those pre-defined by the Spirit framework. Iterators can be very simple and typically defaults to a pointer (e.g. char const*). Iterators can be moderately complex such as the scanner and the multi_pass; these are iterator adapters that modify the behavior of other iterators that they are enclosing. At the other end of the spectrum, iterators can be quite complex; for instance, an iterator adapter that wraps a lexer such as LEX (In the near future, there is a plan to have a complete lexer iterator).

Input Iterators and multi_pass

Input iterators may still be used with backtracking iterators. The multi_pass iterator transforms an input iterator into a forward iterator.

In general, Spirit is a backtracking parser. This is not a requirement though. In the near future, we shall see more deterministic parsers that require no more than 1 character (token) of lookahead. Such parsers allow us to use input iterators such as the istream_iterator as is.

Generally speaking, Spirit is a backtracking parser. The implication of this is that at some point, the iterator position will have to be saved to allow the parser to backtrack to that point. Thus, for backtracking to work, the framework requires at least a forward iterator.

The parse member function may be called directly. For extremely simple parsing tasks, it is advisable not to use the rule at all. Apart from the fact that there's an unavoidable virtual function call overhead for each rule, the rule tightly couples a parser to an iterator. A rule once declared can only work with a specific type of iterator. Any parser linked to a rule or references a rule will in turn be coupled to the rule's specific iterator type.

Parsers, rules and iterators:

To clarify, let's start with an example: we want to parse a file path (to avoid burying ourselves in details, the following simplistic grammar is meant only as an example to get the message across clearly).

*alnum >> *('/' >> *alnum) >> '.' >> *alnum

This expression evaluates to a parser. Since this parser is not tied yet to a specific rule, it can use just about any iterator. Here's an example using the expression above to parse from C strings.

char const* str = "boost/spirit/spirit.hpp"; // our input
char const* first = str;
char const* last = str + strlen(str);

/***/
       
match hit = (*alnum >> *('/' >> *alnum) >> '.' >> *alnum).parse(first, last);

Here's an example using the same expression above, this time, the input comes from a list<char>:

list<char>              v; // our input
list<char>::iterator    first = v.begin();
list<char>::iterator    last = v.end();

/***/

match hit = (*alnum >> *('/' >> *alnum) >> '.' >> *alnum).parse(first, last);

If we were to use rules in the examples above, we need to explicitly parametize and couple the rule to the iterator type. In the first example, we have:

char const* str = "boost/spirit/spirit.hpp"; // our input
char const* first = str;
char const* last = str + strlen(str);

/***/

typedef char const* iterator_t;
typedef rule<iterator_t> rule_t;
rule_t  path = *alnum >> *('/' >> *alnum) >> '.' >> *alnum;
match   hit = path.parse(first, last);

The rule path is tied to char const* iterator type and cannot be used to parse input from another iterator type. We have to re-declare the rule to use list<char>::iterator to make it usable in our second example:

list<char>              v; // our input
list<char>::iterator    first = v.begin();
list<char>::iterator    last = v.end();

/***/

typedef list<char>::iterator iterator_t;
typedef rule<iterator_t> rule_t;
rule_t  path = *alnum >> *('/' >> *alnum) >> '.' >> *alnum;
match   hit = path.parse(first, last);

Notice the use of typedefs for iterator_t and rule_t. Sooner or later, we will have to use more complex iterators. It is best to use typedefs earlier on. That way, we can easily change the iterator_t type anytime without rewriting the whole grammar.

Wrapping the grammar in a template class:

Another interesting technique that avoids early coupling of the rule to a specific iterator type is to wrap the grammar in a template. Following our trivial example:

template <typename IteratorT>
struct my_grammar {

my_grammar()
{
path = *alnum >> *('/' >> *alnum) >> '.' >> *alnum;
}

rule<IteratorT> path;
};

Although this might seem trivial at first, this approach makes a grammar so much more flexible. Decoupling the iterator type from the rules that form a grammar allows the grammar to be used in different contexts possibly using different iterators. We don't care what iterator we are dealing with.

Also, as the grammar gets quite complicated, it is a good idea to group parts of the grammar into logical modules. For instance, when writing a language, it might be wise put expressions and statements into separate grammar structs (or classes). We may go further and declare the rules as private to the class while allowing const access to only some of the pertinent productions through inline member functions. This approach allows us to conveniently publish grammars in header files.