The Scanner and Parsing

The Scanner and Parsing

The scanner's task is to feed the sequential input data stream to the parser. The scanner extracts data from the input, parceling, potentially modifying or filtering, and then finally relegating the result to individual parser elements on demand until the input is exhausted. The scanner is composed of two STL conforming forward iterators, first and last, where first is held by reference and last, by value. The first iterator is held by reference to allow it to be re-positioned. The following diagram illustrates what's happening:

The scanner manages various aspects of the parsing process through a set of policies. There are three sets of policies that govern:

Iteration and filtering
Recognition and matching
Handling semantic actions

These policies are mostly hidden from view and users generally need not know about them. Advanced users might however provide their own policies that override the ones that are already in place in order to fine tune the parsing process to fit her own needs. We shall see how this can be done. This will be covered in further detail later.

The scanner is a template class expecting two parameters: IteratorT, the iterator type and PoliciesT, its set of policies. IteratorT defaults to char const* while PoliciesT defaults to scanner_policies<>, a predefined set of scanner policies that we can use straight out of the box.

    template
    <
        typename IteratorT  = char const*,
        typename PoliciesT  = scanner_policies<>
    >
    class scanner;

Spirit uses the same iterator concepts and interface formally defined by the C++ Standard Template Library (STL). We can use iterators supplied by STL's containers (e.g. list, vector, string, etc.) as is, or perhaps write our own. Iterators can be as simple as a pointer (e.g. char const*). At the other end of the spectrum, iterators can be quite complex; for instance, an iterator adapter that wraps a lexer such as LEX.

Forward iterators

In general, the scanner expects at least a standard conforming forward iterator. Forward iterators are needed for backtracking where the iterator needs to be saved and restored later. Generally speaking, Spirit is a backtracking parser. The implication of this is that at some point, the iterator position will have to be saved to allow the parser to backtrack to a previous point. Thus, for backtracking to work, the framework requires at least a forward iterator.

Some parsers might require more specialized iterators (bi-directional or even random access). Perhaps in the future, deterministic parsers when added to the framework, will perform no backtracking and will need just a single token lookahead, hence will require input iterators only.

The Free Parse Functions

The framework provides a couple of free functions to make parsing a snap. These parser functions have two forms. The first form works on the character level. The second works on the phrase level and asks for a skip parser.

The skip parser is just about any parser primitive or composite. Its purpose is to move the scanner's first iterator to valid tokens by skipping what are considered white spaces. In C for instance, the tab '\t', the newline '\n', return '\r', space ' ' and characters inside comments /*...*/ are considered as white spaces.

Character level parsing

    template <typename IteratorT, typename DerivedT>
    parse_info<IteratorT>
    parse
    (
        IteratorT const&        first,
        IteratorT const&        last,
        parser<DerivedT> const& p
    );

    template <typename CharT, typename DerivedT>
    parse_info<CharT const*>
    parse
    (
        CharT const*            str,
        parser<DerivedT> const& p
    );

There are two variants. The first variant accepts a first, last iterator pair like you do STL algorithms. The second variant accepts a null terminated string. The last argument is a parser p which will be used to parse the input.

Phrase level parsing

    template <typename IteratorT, typename ParserT, typename SkipT>
    parse_info<IteratorT>
    parse
    (
        IteratorT const&        first,
        IteratorT const&        last,
        parser<ParserT> const&  p,
        parser<SkipT> const&    skip
    );

    template <typename CharT, typename ParserT, typename SkipT>
    parse_info<CharT const*>
    parse
    (
        CharT const*            str,
        parser<ParserT> const&  p,
        parser<SkipT> const&    skip
    );

Like above, there are two variants. The first variant accepts a first, last iterator pair like you do STL algorithms. The second variant accepts a null terminated string. The argument p is the parser which will be used to parse the input. The last argument skip is the skip parser.

The parse_info structure

The functions above return a parse_info structure parameterized by the iterator type passed in. The parse_info struct has these members:

parse_info
stop	Points to the final parse position (i.e The parser recognized and processed the input up to this point)
hit	True if parsing is successful. This may be full: the parser consumed all the input, or partial: the parser consumed only a portion of the input.
full	True when we have a full match (i.e The parser consumed all the input).
length	The number of characters consumed by the parser. This is valid only if we have a successful match (either partial or full).

Direct parsing with Iterators

On some occassions, we may wish to go low-level and call the parser's parse member function directly. Since we have the free parse functions to begin with, the truth is, we might not even need to know about these scanner intricacies. We might come across it, however, sometime in the future when we need to get under the hood. So, it's nice that we know what we are dealing with when that need comes.

If we wish to work on the character level, the procedure is quite simple:

    scanner<IteratorT> scan(first, last);

    if (p.parse(scan))
    {
        //  Parsed successfully. If first == last, then we have
        //  a full parse, the parser recognized the input in whole.
    }
    else
    {
        //  Parsing failure. The parser failed to recognize the input
    }

Where p is the parser we want to use, and first/last are the iterator pairs referring to the input. We just create a scanner given the iterators. The scanner type we will use here uses the default scanner_policies<>.

The situation is a bit more complex when we wish to work on the phrase level:

    typedef skip_parser_iteration_policy<SkipT> iter_policy_t;
    typedef scanner_policies<iter_policy_t> scanner_policies_t;
    typedef scanner<IteratorT, scanner_policies_t> scanner_t;

    iter_policy_t iter_policy(skip);
    scanner_policies_t policies(iter_policy);
    scanner_t scan(first, last, policies);

    if (p.parse(scan))
    {
        //  Parsed successfully. If first == last, then we have
        //  a full parse, the parser recognized the input in whole.
    }
    else
    {
        //  Parsing failure. The parser failed to recognize the input
    }

Where SkipT is the type of the skip-parser, skip. Again, p is the parser we want to use, and first/last are the iterator pairs referring to the input. Given a skip-parser type SkipT, skip_parser_iteration_policy creates a scanner iteration policy that skips over portions that are recognized by the skip-parser. This may then be used to create a scanner. The scanner_policies class wraps all scanner related policies including the iteration policies.

The Scanner Business

As much as possible, we do not want to complicate matters. However, dealing with some nitty gritty details is sometimes unavoidable. While the free functions above are indeed easy to use, sometimes you'll want to pass in a rule to one of the functions. The problem is that the rule is a template class that is parameterized by the scanner type. This is rather awkward but unavoidable: the rule is tied to a scanner. What's not obvious is that this scanner must be compatible with the scanner that is ultimately passed to the rule's parse member function. Otherwise, the compiler will complain. Many times, the problem is subtle. For example:

    rule<> r = *anychar_p;
    parse("hello world", r);            //  OK  [character level parsing]
    parse("hello world", r, space_p);   //  BAD [attempts phrase level parsing]

Why does the second call to parse not compile? Because of scanner incompatibility. Behind the scenes, the free parse function creates a scanner from the iterators passed in. In the first call to parse, the scanner created is a plain vanilla scanner<>. This is compatible with the default scanner type of rule<> [see default template parameters of the rule]. The second call creates a scanner of type phrase_scanner_t:

    typedef skipper_iteration_policy<>                  iter_policy_t;
    typedef scanner_policies<iter_policy_t>             scanner_policies_t;
    typedef scanner<char const*, scanner_policies_t>    phrase_scanner_t;

Thus, in order for the second call to succeed, the rule must be parameterized as rule<phrase_scanner_t>:

    rule<phrase_scanner_t> r = *anychar_p;
    parse("hello world", r, space_p);       //  OK [phrase level parsing]

Take note however that phrase_scanner_t is compatible only when you are using char const* iterators and space_p as the skip parser. Other than that, you'll have to find the right type of scanner. This is tedious to do correctly. In light of this issue, it is best to avoid rules as arguments to the parse functions. We shall see how this can be done in the next section. Keep in mind that this happens only with rules. The rule is the only parser that has to be tied to a particular scanner type. For instance:

    parse("hello world", *anychar_p);           //  OK  [character level parsing]
    parse("hello world", *anychar_p, space_p);  //  OK  [phrase level parsing]

Copyright © 1998-2002 Joel de Guzman

Permission to copy, use, modify, sell and distribute this document is granted provided this copyright notice appears in all copies. This document is provided "as is" without express or implied warranty, and with no claim as to its suitability for any purpose.