The Multi Pass

Input Iterators

In general, Spirit is a backtracking parser. This is not a requirement though. In the near future, we shall see more deterministic parsers that require no more than 1 character (token) of lookahead. Such parsers allow us to use input iterators such as the istream_iterator as is.

Backtracking in Spirit requires the use of the following types of iterator: forward, bidirectional, or random access. Because of backtracking, input iterators cannot be used. Therefore, the stardard library classes istreambuf_iterator and istream_iterator, that fall under the category of input iterators, cannot be used. Another input iterator that is of interest is one that wraps a lexer, such as LEX.

Unfortunately, with an input iterator, there is no way to save an iterator position, and thus input iterators will not work with backtracking in Spirit. One solution to this problem is to simply load all the data to be parsed into a container, such as a vector or deque, and then pass the begin and end of the container to Spirit. This method can be too memory intensive for certain applications, which is why the multi_pass iterator was created.

The multi_pass iterator will convert any input iterator into a forward iterator suitable for use with Spirit. multi_pass will buffer data when needed and will discard the buffer when only one copy of the iterator exists.

A grammar must be designed with care if the multi_pass iterator is used. If the start rule contains an alternative, multi_pass will simply slow down the parser with no improvement in memory consumption whatsoever. Any rule that may need to backtrack will cause data to be buffered. The rules that are good to use are sequence and repetition. Sequences of the form a >> b will not buffer data at all. Any rule that repeats, such as kleene_star (*a) or positive such as (+a), will only buffer the data for the current repetition.

In typical grammars, ambiguity and therefore lookahead is often localized . In fact, many well designed languages are fully deterministic and require no lookahead at all. Peeking at the first character from the input will immediately determine the alternative branch to take. Yet, even with highly ambiguous grammars, alternatives are often of the form *(a | b | c | d). The input iterator moves on and is never stuck at the beginning. Let's look at a Pascal snippet for example:

program = 
programHeading >> block >> '.';

block = *( labelDeclarationPart | constantDefinitionPart | typeDefinitionPart | variableDeclarationPart | procedureAndFunctionDeclarationPart ) >> statementPart;

Notice the alternatives inside the Kleene star in the rule block. The rule gobbles the input in a linear manner and throws away the past history with each iteration. As this is fully deterministic LL(1) grammar, each failed alternative only has to peek 1 character (token). The alternative that consumes more than 1 character (token) is definitely a winner. After which, the Kleene star moves on to the next.

Again, following up the example we started to use in the section on the scanner. Here's an example using the multi_pass: This time around we are extracting our input from the input stream using an istreambuf_iterator.

#include <boost/multi_pass.hpp> // the multi_pass has to be included
using boost::multi_pass;
using boost::make_multi_pass; /***/ ifstream in("input_file.txt"); // we get our input from this file
typedef char char_t;
typedef multi_pass<istreambuf_iterator<char_t> > iterator_t;
typedef scanner<iterator_t>     scanner_t;
typedef rule<scanner_t>         rule_t;
typedef parse_info<iterator_t>  parse_info_t;
iterator_t  first(make_multi_pass(istreambuf_iterator<char_t>(in)));
iterator_t  last(make_multi_pass(istreambuf_iterator<char_t>()));
rule_t          n_list = real_p >> *(',' >> real_p);
parse_info_t    info = parse(first, last, n_list, space);