Mar 05

Tracking the Input Position While Parsing

By Peter Schüller Add comments

The following example is about tracking the parsing position with Spirit V2. This is useful for generating error messages which tell the user exactly where an error has occurred. We also show how to use Spirit V2 to parse from an input stream without first reading the whole stream into a std::string.

In our example, we want to parse a list of real numbers and return them in a std::vector and throw an exception if an error occurs. We will start with a naive way of achieving our goal and then improve the program in two steps. This naive way could be the following small program. It provides basic and not very helpful error handling and scales bad for large inputs:

#include <vector>
#include <istream>
#include <sstream>
#include <iostream>
#include <boost/include/spirit/qi.hpp>

namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;

// parse list of doubles from input stream
// throw exception (perhaps including filename) on errors
std::vector<double>
  parse(std::istream& input, const std::string& filename);

// main function
int main(int, char**)
{
  try
  {
    parse(std::cin, "STDIN");
  }
  catch(const std::exception& e)
  {
    std::cerr << "Exception: " << e.what() << std::endl;
    return -1;
  }
  return 0;
}

// implementation
std::vector<double>
  parse(std::istream& input, const std::string& filename)
{
  // get input into string
  std::ostringstream buf;
  buf << input.rdbuf();
  std::string str = buf.str();

  // prepare iterators
  std::string::iterator first = str.begin();
  std::string::iterator last = str.end();

  // parse
  std::vector<double> output;
  bool r = qi::phrase_parse(
    // iterators over input
    first, last,
    // recognize list of doubles
    qi::double_ >> *(',' >> qi::double_) >> qi::eoi,
    // comment skipper
    ascii::space | '#' >> *(ascii::char_ - qi::eol) >> qi::eol,
    // store into this object
    output);

  // error detection
  if(!r || first != last)
    throw std::runtime_error("parse error in " + filename);

  // return result
  return output;
}

On the following input, parsing succeeds (and there is no output):

# good input
1.3,2.3, 5

With the following faulty input

# bad input
123,42.0,a,1.4

we get an exception:

Exception: parse error in STDIN

If you want to try out this version of the program you may download it here: naive_parsing.cpp, the two input files are goodinput.txt and badinput.txt.

To make the above scalable for large files, we now change the example to parse from an STL stream, and not buffer everything into a std::string. For that, we need the boost::spirit::multi_pass class which wraps a stream iterator and exposes an iterator usable by Qi parsers. The reason we need this adaptor is, that a std::istream_iterator is an input iterator, and Qi requires a forward iterator which allows certain backtracking on the input.

So we include the following file

#include <boost/include/spirit/support_multi_pass.hpp>

and replace

   // get input into string
   std::ostringstream buf;
   buf << input.rdbuf();
   std::string str = buf.str();

   // prepare iterators
   std::string::iterator first = str.begin();
   std::string::iterator last = str.end();

by

   // iterate over stream input
   typedef std::istreambuf_iterator<char> base_iterator_type;
   base_iterator_type in_begin(input);

   // convert input iterator to forward iterator, usable by spirit parser
   typedef boost::spirit::multi_pass<base_iterator_type> forward_iterator_type;
   forward_iterator_type fwd_begin =
       boost::spirit::make_default_multi_pass(in_begin);
   forward_iterator_type fwd_end;

Now we only need to rename all occurrences of first to fwd_begin and all occurrences of last to fwd_end and we are done! (The complete second program can be downloaded from here: stream_iterator_parsing.cpp.)

This new program has the same behavior as the first one, but it does not need to buffer the whole input.

The final task is to get more useful error messages if a parsing problem occurs. To that end we wrap the iterator yet another time, using the boost::spirit::classic::position_iterator2 class which records line number, column number, and filename of the parsed input.

First we need one of additional include file:

#include <boost/include/spirit/classic_position_iterator.hpp>
namespace classic = boost::spirit::classic;

This file in fact provides several possibilities for position iterators: boost::spirit::classic::position_iterator is a bit more efficient than boost::spirit::classic::position_iterator2, but the latter allows to extract the whole line of input from an iterator in that line (and we will use this feature). Additionally, one can specify a ‘position iterator policy’ to only store line numbers and not column numbers. This increases efficiency but might decrease usefulness of the position information.

Wrapping the forward iterator with the position iterator is not difficult, we create another typedef and initialize the iterators as follows:

   // wrap forward iterator with position iterator, to record the position
   typedef classic::position_iterator2<forward_iterator_type>
     pos_iterator_type;
   pos_iterator_type position_begin(fwd_begin, fwd_end, filename);
   pos_iterator_type position_end;

Hint: make sure there are no brackets after position_end – this often happens and creates compile errors which are difficult to find: with brackets you do not declare a variable of type pos_iterator_type, but a function returning pos_iterator_type!

We now use the new position iterator for parsing, but the grammar needs to be adjusted a bit. We will replace the following fragment:

  // parse
  bool r = qi::phrase_parse(
    // iterators over input
    fwd_begin, fwd_end,
    // recognize list of doubles
    qi::double_ >> *(',' >> qi::double_) >> qi::eoi,
    // comment skipper
    ascii::space | '#' >> *(ascii::char_ - qi::eol) >> qi::eol,
    // doubles are stored into this object
    output);

  // error detection
  if (!r || fwd_begin != fwd_end)
    throw std::runtime_error("parse error in " + filename);

In our new code, we first call qi::phrase_parse as before, but this time we disallow backtracking by using > instead of >> for certain parser sequence operators in the grammar: if the parser would backtrack over such a sequence, it throws a qi::expectation_failure<pos_iterator_type> exception.

   try
   {
     qi::phrase_parse(
       // iterators over input
       position_begin, position_end,
       // recognize list of doubles
       qi::double_ > *(',' > qi::double_) >> qi::eoi,
       // comment skipper
       ascii::space | '#' >> *(ascii::char_ - qi::eol) >> qi::eol,
       // store into this object
       output);
   }
   catch(const qi::expectation_failure<pos_iterator_type>&amp; e)
   {
   }

To process the caught exception we add the following code:

     const classic::file_position_base<std::string>& pos =
         e.first.get_position();
     std::stringstream msg;
     msg <<
         "parse error at file " << pos.file <<
         " line " << pos.line << " column " << pos.column << std::endl <<
         "'" << e.first.get_currentline() << "'" << std::endl <<
         std::setw(pos.column) << " " << "^- here";
     throw std::runtime_error(msg.str());

(The complete final program can be downloaded from here: stream_iterator_errorposition_parsing.cpp.)

This code creates an error message from the filename, the line and column number and the currently parsed line.

The faulty input now creates the following useful error message:

Exception: parse error at file STDIN line 1 column 10
'123,42.0,a,1.4'
          ^- here

The programs of this example can be downloaded here: naive_parsing.cpp, stream_iterator_parsing.cpp, and stream_iterator_errorposition_parsing.cpp. The two input files are goodinput.txt and badinput.txt.

35 Responses to “Tracking the Input Position While Parsing”

  1. Olaf says:

    stream_iterator_errorposition_parsing.cpp doesn’t compile at least on g++4.1.1

    What’s wrong here?

    • Which version of boost are you using?

      Due to a bug which crept into 1.42, this example will work for 1.41 and for the current trunk, but – unfortunately – not for version 1.42.

      If this does not solve the problem: What is the error message?

      Does the second example “stream_iterator_parsing.cpp” compile?

    • Finally I got myself a g++-4.1.3 to test, and I can confirm that it does not work.

      I will forward this issue to the mailing list.

      It works with g++-4.2.4 though, between these versions I have not tested.

      Is your error message similar to

      /usr/include/c++/4.1.3/bits/stl_iterator_base_types.h:129:
       error: no type named ‘iterator_category’ in ‘_ForwardIterator2’
      /usr/include/c++/4.1.3/bits/stl_iterator_base_types.h:130:
       error: no type named ‘value_type’ in ‘_ForwardIterator2’
      /usr/include/c++/4.1.3/bits/stl_iterator_base_types.h:131:
       error: no type named ‘difference_type’ in ‘_ForwardIterator2’
      /usr/include/c++/4.1.3/bits/stl_iterator_base_types.h:132:
       error: no type named ‘pointer’ in ‘_ForwardIterator2’
      /usr/include/c++/4.1.3/bits/stl_iterator_base_types.h:133:
       error: no type named ‘reference’ in ‘_ForwardIterator2’
      
  2. Thomas Heller says:

    Hi,
    I was wondering if something like this is possible when using an iterator provided by the lexer?

    • This is a good question – in fact I wrote this example to build something connected to a lexer.

      As a lexer (at the moment) requires a random access iterator and not a forward iterator, the first technique (i.e., parsing from a stream using multi_pass) is not applicable to lexers.

      However, the input position tracking should be possible with a lexer, so the error message output should also be possible.

      Best Regards,
      Peter

      • Hartmut Kaiser says:

        Peter,

        FWIW, the iterator type restriction for a lexer has already been relaxed for Boost V1.41. The lexer requires a forward iterator as it’s underlying data stream, only. Does the documentation still say you need a random access iterator?

        Regards Hartmut

        • Sorry, I recently saw this “random access requirement” on the mailing list, so I assumed it still is in the same state.

          Although, this must have been a post from quite some while ago if it has been fixed in v1.41.

    • Hartmut Kaiser says:

      Thomas,

      I’m planning to write an example demonstrating how this can be done. Stay tuned!

      Regards Hartmut

      • Thomas Heller says:

        Thanks. I am really looking forward to this. As all my experiments with that ended up in long compile errors.

  3. Wieger says:

    Hi Peter,

    your example is much appreciated! When playing with it I noted the following inconsistent behavior (with Visual C++ express 2008):

    ’1, 2 a’ parses OK
    ’1, 2 aa’ results in the error message

    Exception: parse error at file STDIN line 1 column 2
    ’1, 2 a’
    ^- here

    Is there a logical explanation for this?

    Regards,

    Wieger

    • With g++-4.4 it is as follows (and as I would expect it to be):

      ’1, 2 a’ returns the error at the ‘,’
      ’1, 2 aa’ likewise returns the error at the ‘,’

      Both errors say “line 1 column 2″ which is expected in the following way:

      The *(‘,’ > qi::double_) Kleene-star is entered and traversed correctly for the “, 2″ part.

      But there is no subsequent ‘,’, and as ‘a’ does not match sp::eoi there is an attempted backtracking out of the Kleene-star to the previous double_ which fails and produces the above error message.

      Best,
      Peter

      • Thomas says:

        …but consider parsing “1,2,3,4 a”…

        My first (beginners’) expectation (literally, no pun!) would be that the error is reported at the last comma before the “4 a”, since the expression in the kleene star is ‘,’ > qi::double_, and I read this as an expectation that can’t backtrack past the last comma parsed by the kleene star repetition. However, instead I get the error reported at the first comma after the “1″, as if the parser backtracked out of several repetitions of the kleene star, despite the expectation operator inside the kleene star’s expression, only to be stopped by the outer-most expectation right at the beginning of the grammar.

        Can someone explain this to me?

        • In my opinion the parser should not be able to backtrack through a “>”.

          My only explanation is, that your locale specifies “,” as decimal separator. Could this be the case?

          Two ways to find out:

          1) replace double_ by int_ and only use integers in your input. if the parser does less backtracking the “,” is probably interpreted as decimal point of the double.

          2) replace “,” by some character like “x” in the grammar and in the input and see if the parser also backtracks further.

          • Thomas says:

            Thanks for the fast response. Both suggestions make no difference. My locale settings are sane. Were you able to reproduce my situation with your environment? I’m using Boost V1.46 and GCC 4.4.3.

            I have been thinking about it hard though.

            Firstly, as a comparison, the input “1,2,3,a” does give the error in the expected place. In this case I am guessing that when the “a” is read, the double_ parser doesn’t match and the expectation_failure is thrown as the parser tries to backtrack to the comma preceding it.

            So considering the input “1,2,3,4a”, I realised that of course there is no expectation implied by the kleene star operator that a comma follows a double, or else the kleene would only match zero or an infinite number of repetitions. Therefore I must assume that the kleene expression matches fully on “,2,3,4″ and therefore there is no backtracking at this point. Now I guess the parser sees “a” instead of eoi and now starts to backtrack. I think I have been using a confused definition of backtracking. I think now that the parser backtracks up from a dead-end in the parse tree, looking for alternative constructions, rather than backtracking through chars in the input stream. In this case, I believe it is not backtracking through the chars “4″, “,”, “3″, “,”, etc., but rather it is backtracking up the parse tree by first popping the subtree from the successful kleene expression match (“,2,3,4″), and then attempts to pop the initial double_ match (“1″) but is forbidden by the expectation operator there and throws the expectation_failure at that point.

            I would be happy for an expert to validate my analysis because I’m flying by the seat of my pants here!

            I think this confusion came about since backtracking was mentioned in the documentation for the multi_pass iterator with regards to specifying ‘flushing points’ for the buffer.

            As a solution, I replaced the final sequence with an expectation also (replace “>> qi::eoi” with “> qi::eoi”). This causes that expectation to throw instead, as once the parser sees “a” instead of eoi it cannot even pop off the kleene expression subtree due to this new expectation.

            Any comments?

          • I do not think that it should be possible to backtrack over a kleene operator that contains no-backtrack-instructions.
            It may very well be the case that this is possible though.
            I would suggest to post this question to the spirit mailing list spirit-general@lists.sourceforge.net

  4. Wieger says:

    It seem that on Windows the last character gets lost. If I do ‘type input.txt | stream_iterator_errorposition_parsing.exe’ the last character of input.txt is ignored. But even if I modify the program as follows the behavior on Windows and linux is different(!)

    std::string s = “1, 2, a”;
    std::istringstream in(s);
    try
    {
    parse(in, “STDIN”);
    }

    On linux I get

    Exception: parse error at file STDIN line 1 column 2
    ’1, 2 a’
    ^- here

    and on Windows I get

    Exception: parse error at file STDIN line 1 column 7
    ’1, 2, ‘
    ^- here

    • This is really strange. Are you sure that you recompiled your testcase correctly on Linux?

      Because it seems that the second ‘,’ character gets lost on Linux and this would be a very “untypical” kind of error (compared to losing the last character).

  5. Korval says:

    So, how do you combine this with qi::on_error? While you can print the line and even point to the character causing the offense, you can’t describe the actual problem in anything more than the most general terms: something went wrong exactly here.

    qi::on_error is supposed to be the way to produce more descriptive results. But how do you combine on_error with these stateful iterators?

    • I would say that the “>” operator describes exactly the situation which is reported – that something unexpected occured.

      However, you can derive from the iterator class and add new data members, so you could use a semantic action to store additional information like “parsed shopping list, now expecting weather forecast” which could be displayed by the on_error handler as it can access the iterators.

  6. Patrick says:

    Hi Peter,

    Thanks a lot for this nice example. When I try your third example code with the files you posted, everything works as decribed.
    If I introduce now an error before the first number in the file the code simple exits, without any message. Is this behaviour the normal one? And is there a way to catch this error with a meaningfull message aswell?

    Thanks a lot!

    System: Standart Ubuntu 10.04, 64 bit, boost-1.43.0, g++ 4.4.3

    • Hi Patrick,

      I think you found some bug in my example, because I simply removed the following code and replaced it by the catch(…):

        // error detection
        if (!r || fwd_begin != fwd_end)
          throw std::runtime_error("parse error in " + filename);
      

      This “error detection” detects if the parse failed because of a non-complete match. The expectation_failure-exception only happens if a forbidden backtracking occurred.

      So if you add something unparseable at the beginning, you will get a non-complete match, so the (forgotten) code above will throw a generic error.

      If you want to track the position of this problem, I think the easiest way would be to add an epsilon parser “eps_p” at the beginning of your grammar, and afterwards a non-backtracking operator “>”. Then at least the first part of the grammar will match any input and you will get an exception if something wrong is in the input at the beginning.

      However, you will also need the above code, in case something wrong is at the end of your input (after some successful parse).

      Please let me know if this solved your problem. :)

      Cheers,
      Peter

      • Patrick says:

        Hi Peter,

        That was, what I was looking for it works perfect for me!

        Thanks a lot for your fast response
        Patrick

  7. tb says:

    I tried to compile the naive_parsing.cpp and it goes through. The stream_iterator_errorposition_parsing.cpp does not compile with boost 1.44 and g++ 4.1.2.

    There’s a bunch of errors with instantiation of iterators.

    • Unfortunately I cannot test with 4.1.2 at the moment.

      However, I re-tested with boost 1.44 and g++ 4.3.1 and g++ 4.4.5 and clang++ and it succeeded.

      I tend to blame this on the compiler version (although of course such problems should not happen).

  8. maheshb.home says:

    Hi,
    I am beginner, trying to use the boost spirit library, I have a problem which is in some ways similar to tracking input position. I have gigabytes of data over which I run the parser and instead of copying the parsed data to strings I would like to store the start and end of the string to two char * ptrs in a struct. My iterator is a const char * iterator. How do I achieve that. I was trying to look for examples where we do not use std::string but did not find any.
    So I need to store my position at the time a rule is satisfied.

  9. dariomt says:

    I need to know where the failure is in the original sequence (apart from line and column)

    Is it possible to access the underlying iterator inside the position_iterator?

  10. Charlie Ye says:

    Hi, All:

    I have to tell you that I really like the old way in classic spirit for tracking pattern position. See the below code for SQL case statement:

    CRule& whenExpr = m_logic.GetRule();
    CRule& resultExpr = m_pParser->NumExpressionPattern->GetRule(); //m_logic.GetNumExpressionRule();
    CRule& elseExpr = m_pParser->NumExpressionPattern->GetRule(); //m_logic.GetNumExpressionRule();
    
    CStrAction	newCase(boost::bind(&CCaseExpressionPattern::NewCase, this, _1, _2));
    CStrAction	newWhen(boost::bind(&CCaseExpressionPattern::NewWhen, this, _1, _2));
    CStrAction	newElse(boost::bind(&CCaseExpressionPattern::NewElse, this, _1, _2));
    CStrAction	newEnd(boost::bind(&CCaseExpressionPattern::NewEnd, this, _1, _2));
    CStrAction	newCaseObjectName(boost::bind(&CCaseExpressionPattern::NewCaseObjectName, this, _1, _2));
    
    expression =lexeme_d[CASE[newCase] >> !(+space_p >> OBJECT_NAME[newCaseObjectName]) >> +space_p]
    			>> +(WHEN[newWhen] >> whenExpr >> THEN >> resultExpr)
    			>> !(ELSE[newElse] >> elseExpr)
    			>> END[newEnd];
    

    By this way, I can track all of positions of all of key words with classic spirit. How can I do the similar work with QI of the new spirit easily?

    • Vishal says:

      Did you find a way to do this? Even I have to port an existing grammar which keeps track of positions to qi and not understanding how to do the same.

  11. I’m working on parsing a binary piece of data, and need to keep track of positions in the data when I find something of interest. How does one get that byte offset? Can this iterator be used for that?

    • If you use an iterator into a std::string str, it is easy to get the location within the string by simply computing it - str.begin(). I would suspect that it is not too difficult to adapt a stream-based iterator to count the number of bytes it has advanced so far.

  12. James says:

    In this approach the the results reported have the skip parser replied. So the columns, lines, and the get_currentline() properties results are for the modified input with the skipped tokens removed. For my use case I need the skip functionality in the parser but I need the results of these properties to match up positional and contextually with the unmodified stream. Any suggestions?

    • James says:

      After much frustration I found out my original input iterator was skipping whitespace and causing the behavior I was seeing.

Leave a Reply

To use RetinaPost you must register at http://www.RetinaPost.com/register