The Scanner

Already mentioned in passing, the Spirit parser compiler, unlike traditional parser generators, can handle parsing tasks at both the character as well as the phrase level. The lexical analyzer is not a separate concept. There is perfect integration between the character and the phrase levels.

The scanner conforms to a standard STL constant (immutable) forward iterator. The scanner is not a full-blown lexical analyzer. It does not extract tokens such as reserved words and operators. Nor does it extract numbers and literal strings.

The scanner is a template class parametized by the iterator type (IteratorT, defaults to char const*) and the skipper type (SkipT, defaults to skipper<IteratorT>). It is an iterator adapter. The scanner wraps an iterator and a reference to a skipper object. The scanner extracts data from the input, skipping characters in between words or lexemes that form sentences and phrases in a language as directed by the skipper object. The scanner delegates the skipping of characters to the skipper object when tasked to scan the next character from the input.

The skipper is a utility class that facilitates the skipping of characters. The skipper class is constructed by supplying a skip-parser that does the actual skipping, and an end iterator that points to the end of the input. It has a single member function skip that when invoked, enters a loop incrementing the 'current' iterator position until the supplied skip-parser fails to match.

Specifying another Skip-Parser

Although the example uses space as the skip-parser, one can supply a more specific or elaborate skip-parser for the skipper to use. To illustrate this, say we want C/C++ style comments to be considered as white space in addition to spaces, tabs, newlines and carriage returns, we may define a production rule, ignore:

ignore = space | comment;
comment = "//" >> *(anychar - ('\n' | '\r'))
  | "/*" >> *(anychar - "*/") >> "*/";

Now we can use this rule to create our skipper, then subsequently, our scanner. Thus:

skipper<> my_skipper(ignore, end_iter);
scanner<> my_scanner(iter, &my_skipper);

Here's an example that demonstrates the scanner in action:

skipper<> my_skipper(space, end_iter);
scanner<> my_scanner(iter, &my_skipper);

The identifier iter is an iterator to some data on which the scanner will be working. The end_iter points to the end of the input. The identifier space is one of the predefined parser primitives in the framework that matches white spaces (see Primitives).

In order to initiate parsing, we need to create two scanner objects. One points to the start of input data while another points to the end. This is a requirement of the parser's parse member function (see The Parser). Here's an example:

// Iterators to input data
char const* str_begin, str_end;
	
skipper<> skip(space, str_end);
scanner<> first(str_begin, &skip);
scanner<> last(str_end, &skip);
Bypassing the Scanner

As mentioned, the scanner conforms to an STL forward iterator. Any forward iterator may be used in place of a scanner and passed to a parse function. Doing so will force the parser to work on the character level (where white spaces are not skipped). If we will work solely at the character level, it is best to bypass the scanner this way instead of sprinkling the code with lexeme directives (see Directives).

After having defined our scanners first and last, we may now pass the iterators to any parse function. This will make the parser involved skip white spaces:

match hit = parser.parse(first, last);
// White spaces are skipped

The rule meets the scanner:

We have seen in the previous section that once declared, the rule is coupled to an iterator. The rule uses a scanner<> by default. For simple char* iterators, the rule, scanner and skipper classes are quite easy to declare and use. Here's a complete parser code that extracts a comma separated list of numbers:

char const* str = "3, 4.5, 6e20, .0001"; // our input
char const* str_begin = str;
char const* str_end = str + strlen(str);

rule<>      n_list = real_p >> *(',' >> real_p);
skipper<>   skip(space, str_end);
scanner<>   first(str_begin, &skip);
scanner<>   last(str_end, &skip);

match       hit = n_list.parse(first, last);

All is well until one day we decide to use a different iterator type, say for instance a wchar_t const*. Our code above will simply fail to compile. We have seen in the previous section that a bit of foresight wouldn't hurt especially when dealing with rules, the scanner and its skipper. Let us recode the snippet above. This time using typedefs instead of hard-coding and relying on the default template parameters:

typedef char const*         iterator_t;
typedef skipper<iterator_t> skipper_t;
typedef scanner<iterator_t> scanner_t;
typedef rule<scanner_t>	    rule_t;

rule_t      n_list = real_p >> *(',' >> real_p);
skipper_t   skip(space, str_end);
scanner_t   first(str_begin, &skip);
scanner_t   last(str_end, &skip);

match       hit = n_list.parse(first, last);

Now, changing the iterator type to wchar_t const* will simply involve rewriting a single line:

typedef wchar_t const*  iterator_t;

The observant may ask: "so what good are the default parameters then?". In a nutshell, the default parameters are only good for quick and dirty coding. For practical, real life parsing tasks, it is better not to hard-code the types. Use typedefs. Better yet, wrap the grammar in a template class.

Free parse functions:

The framework provides a couple of free functions to make it a bit easier to use to use a parser with (or without) the scanner/skipper combo. These parser functions have two forms. The first form works on the phrase level and asks for a skip parser. The second needs no skip parser and works on the character level. In general, two iterators should be passed in (first/last) as usual. There are also convenience functions provided for char const* and wchar_t const* strings (The strings are assumed to be null terminated).

The parse_info structure:

These functions return a parse_info structure parametized by the iterator type passed in. The parse_info struct has these members:

stop: points to the final parse position (i.e parsing processed the input up to this point).

match: true if parsing is successful.
This may be full: the parser consumed all the input,
or partial: the parser consumed only a portion of the input.

full: true when we have a full match (i.e the parser consumed all the input).

length: The number of characters consumed by the parser. This is valid only if we have a successful match (either partial or full). A negative value means that the match is unsucessful.

Generic parse functions:

first form (phrase level):

template <typename IteratorT, typename ParserT, typename 
SkipT>
parse_info<IteratorT>
parse(
IteratorT const& first,
IteratorT const& last,
ParserT const& parser,
SkipT const& skip);
second form (character level):

template <typename IteratorT, typename ParserT> parse_info<IteratorT> parse( IteratorT const& first, IteratorT const& last, ParserT const& parser);

Parse functions for null terminated strings:

first form (phrase level):
template <typename CharT, typename ParserT, typename SkipT>
parse_info<CharT const*>
parse(
CharT const* str,
ParserT const& parser,
SkipT const& skip);
second form (character level):

template <typename CharT, typename ParserT> parse_info<CharT const*> parse( CharT const* str, ParserT const& parser);

Let us continue our previous example. This time, we will be using the free function that accepts null terminated strings above (the first form):

char const* str = "3, 4.5, 6e20, .0001"; // our input

typedef char const*             iterator_t;
typedef scanner<iterator_t>     scanner_t;
typedef rule<scanner_t>         rule_t;
typedef parse_info<iterator_t>  parse_info_t;

rule_t          n_list = real_p >> *(',' >> real_p);
parse_info_t 	info = parse(str, n_list, space);

We simply pass the null-terminated string, the rule and a skip-parser to the parse function above. The parse function takes care of the details. After parsing has concluded, the result is passed back to us packaged in a parse_info structure.