The Spirit recursive descent parser compiler

Copyright Ó 1999, 2000, Joel de Guzman

InterXys. All rights reserved.

 

Introduction:

 

Knowledge of parsing is a must for every programmer, not only for compiler writers and computer language designers. A good parser tool is invaluable and applicable to many aspects of software development. Certainly, the most obvious application is when we need to create a parser for a computer language as a front end to a compiler or interpreter. Other than that, software development involves a lot of parsing around. Most of the time, sometimes without realizing it, we are coding a parser by hand. Whenever we invent a file format we unknowingly create a parser for it. File I/O, String searching, TCP/IP, HTTP, FTP, HTML, SGML, XML; almost all Internet protocols require various degrees of parsing. SGML and XML are special because they require building the parser dynamically on the fly, based on a Document Type Declaration (DTD) that is essentially a grammar that defines the underlying document structure. Even the trivial conversion of a string to a number involves parsing.

 

Granted that most of the time we use some tool or perhaps a library that does the dirty work for us. Yet time and again we may find ourselves in a situation where we cannot find a tool sufficient for our needs. Yes, there are existing tools out there. YACC and LEX come to mind. But these are specialized and full-blown tools geared towards computer language development. At the other end of the spectrum, we have simple regular expression tools such as regex. But this is geared towards very simple string matching and searching. What we need is a parsing tool that falls somewhere in between.

 

The goals are:

 

1.                  It should be easy to use even for non-computer scientists or computer language specialists. The requirement that Spirit should be easy to use abides by the goals of a complete class library and framework named Phoenix where the Spirit parser compiler is just a part. The mantra we adhere to is: “Less is more”.

 

2.                  It should have a broad range and should be capable of doing simple regular expression-type string searches as well as creating full computer languages.  The motivating factor here is the author’s interest in programming languages compilers, interpreters and translators. Also, being part of the broader framework Phoenix, it must be generic and applicable to a broad range of uses.

 

3.                  It should be library-based and linkable with a client program. Spirit is just a part of a generic framework.

 

4.                  It should be dynamic such that there will be no need for an additional compilation step. The generated parser must be immediately ‘runnable’ after feeding in a grammar. Work on XML by the author necessitated Spirit to be dynamically extensible. The author foresees that there will be more uses (such as the case for XML) for dynamic parser compilers in the future. Traditional compiler-compilers such as YACC that compile to an underlying language such as C will simply be inappropriate in such cases.

 

5.                  It should be memory-based as opposed to file-based. The parser acts on in-memory strings. Helper tools will be supplied if one wishes to work on files or streams. Partly due to goal 2, it should be usable on strings and raw data bytes.

 

Memory space is not a concern for current computer architectures especially with virtual memory. It is quite convenient in many cases to load a complete source file (which are typically small to begin with) completely in memory. As an aside, getting input from memory is many times faster than getting input from files and the parser can easily look ahead up to the end of the source.

 

The Spirit parser compiler presented fits all the requirements of the enumerated goals. It has been influenced and inspired by previous works on parsers and regular expressions by 1) Dr. Adrian Johnstone (RDP) , 2) Dr. Terence Parr (PCCTS), 3) Dr. Quinn Tyler Jackson (LPM) and 4) Dr. John Maddock       (Regex++).

 

The Parser Proper:

 

The C++ Recursive descent parser compiler dynamically compiles Extended BNF (EBNF) production rules into a working parser. The parser acts on the character level and thus obviates the need for a separate lexical analyzer stage. Extra care and attention was given to keep the syntax from being cluttered. The syntax is totally separated from the semantics.

 

The complete specification of the parser follows the meta-language:

           

grammar         ::= ( identifier '::=' expr ';' )*;

expr            ::= expr1 ( ('|' | '%') expr1 )*;

expr1           ::= expr2 ( expr2 )*;

expr2           ::= expr3 { ':' identifier } { '*' | '+' };

 

expr3           ::=  '.'

                |    set

                |    string

                |    identifier

                |    group

                |    optional

                |    contiguous;

 

group           ::= '(' expr ')';

optional        ::= '{' expr '}';

contiguous      ::= '<' expr '>';

 

set             ::= { '~' } '[' setItems ']';

setItems        ::= ( setChar {'-' setChar } )*;

setChar         ::= ~[\0\]];

 

string          ::= '\'' <stringChar>* '\'';

stringChar      ::= ~[\0'];

          

identifier      ::= <[a-zA-Z_]*[a-zA-Z_0-9]*>;

          

whiteSpace      ::= ([ \t\n\r] | comment)*;

comment         ::= ('//'~[\0\n\r]*) | ('/*' .* '*/');

 

 

Where:

 

            []                                  Is a Regular Expression style set. Example: [a-z].

           

.                                   Matches any character except the null.

 

            x-y                               Matches all characters from x to y.

 

            string                            A Pascal style string. Example: 'hello world'

 

identifier                       A case sensitive C/C++ style identifier. Example: foo_123.

 

: identifier                      Where 'identifier' specifies a semantic action. Named semantic actions can be attached to any level in the grammar. These actions are C/C++ hook functions that will be called immediately if a match is found.

           

*                                  Kleene star operator. Matches the preceding expression zero or more times. For example:

 

The expression 'A'*

Matches “A”, “AAA” or “”.

 

+                                  Plus operator. Matches the preceding expression one or more times. For example:

 

The expression 'A'+

Matches “A”, “AAA” but not “”.

 

{}                                Optional construct. Matches the enclosed expression zero or one time. For example:

 

The expression 'A' {'B'}

Matches “AB” or “A”.

 

|                                   (Leftmost) Alternative operator. Matches either of the operands. Ambiguous alternatives are short-circuit evaluated and the leftmost alternative that matches wins.

 

The expression 'A' | 'B'

Matches  “A” or “B”.

 

%                                 Longest-alternative operator. Similar to the ‘|’ operator. Matches either of the operands. Unlike the former, alternatives are NOT short-circuit evaluated and the longest alternative that matches wins.

 

()                                  Grouping construct. Orders the evaluation of sub- expressions. For example:

 

The expression ('A' | 'B') 'C' evaluates ('A' | 'B') first.

 

Whereas the expression 'A' | 'B' 'C' evaluates 'A' 'B' first due to the implicit precedence of operators as defined in the grammar.

 

<>                                Contiguous construct. Inhibit white space (and comment) skipping. By default white spaces and comments are skipped. The construct can be used to inhibit white space and comment skipping. This can be useful in situations where we want to work on the character level instead of the phrase level.

 

For instance, in parsing numbers such as integers we need to enclose the expression in order to regard the expression as contiguous:

 

The expression <{'+' | '-'}[0-9]+>;

matches “123” or “456” but not “1 2 3”.

           

White spaces are not totally invisible as these are still parsed by the same phrase/character level parser. Semantic actions can be associated with white spaces for example to recognize pragmas.

 

 

The back-slash can be used as an escape character in sets and strings. The back-slash matches the character following the '\' unless it is one of:

 

                        '\n'        new-line NL (LF)

                        '\t'         horizontal tab HT

                        '\r'        carriage return CR

                        '\v'        vertical tab VT

                        '\f'         form feed FF

                        '\a'        alert BEL

                        '\b'        backspace BS

                        '\oooo'  octal number

                        '\xhhh'   hex number

 

           

An EBNF source that defines a target language is parsed by the parser compiler and outputs a fully working parser for that language. The generated parser is a hierarchical structure composed of Match objects. Given a source written in the target language, the top-down descent traverses the hierarchy, checking each object for a match, backtracking and checking all possible alternatives. Object aggregation allows access to any of the steps in the recursive descent. In addition, named semantic actions may be attached to any level within the hierarchy. These actions are C/C++ functions that will be called if a match is found in a particular context within the grammar.

 

The parser is non-deterministic in nature and allows backtracking, back-referencing and is capable of infinite look-ahead LL(inf). It can parse rather ambiguous grammars such as:

 

expr     ::= 'A'  'B'  'C'  'D'  'E'  |  'A'  'B'  'C'  'D'  'F';

           

Left recursion is not suitable with recursive parsers in general. The Spirit Parser is no exception. Left recursion is an error and will be reported as such. Otherwise it would result in an infinite loop when the offending rule is invoked. Example:

 

                        a          ::= a  ‘A’;                     //          Error, left recursion

 

Ambiguous alternatives are short-circuit evaluated and the leftmost alternative that matches wins. If this is not desired, the longest-alternative operator ‘%’ may be used. In this case ambiguous alternatives are NOT short-circuit evaluated and the longest alternative that matches wins. Consider the parsing of integers and real numbers:

 

            number ::= real % integer;

                        integer              ::= < { '+'  |  '-' }[0-9]+ >;

                        real                   ::= < integer { '. ' integer } { (' e'  | ' E' ) integer } >;

 

            A number can be a ‘real’ or an ‘integer’. Given an input:

 

“12.36e-6”

 

This will match a ‘real’ since it is the alternative that matches a longer part of input string. While the ‘%’ operator is called the longest-alternative operator, the standard ‘|’ operator may be called leftmost-alternative operator.

 

The order of rules within a grammar does not matter. Rules that reference other rules are resolved just-in-time when needed. Redefinition of rules are not allowed and will result in an error.

 

Match Objects:

 

These seemingly simple classes when composed and structured in a hierarchy form a very powerful object-oriented recursive descent parsing engine. These classes provide the infrastructure needed for the construction of parsers.

           

The power of this technique lies in its ability to form rather complex hierarchical parser structures dynamically at runtime, unlike function-based parsers that require compilation to object code. The hierarchical composition of objects as opposed to the hierarchical composition of functions defines the top-down structure of the parser. Typical compiler-compilers or parser generators generate code, usually C/C++ that has to be compiled.

 

The Match class declaration (Private members are omitted for brevity):

 

class Match : public Allocator {

 

public:

                     Match();

     virtual         ~Match();

     virtual bool    Parse(Scanner& str) const = 0;

};

 

As stated, the parser compiler generates a hierarchical structure composed of Match objects. This object hierarchy does the actual parsing. The base Match class has one pure virtual Parse member function that accepts a reference to a Scanner object. This function returns true if a match is found at the current scanner position and the scanner is advanced. Otherwise, the scanner is not touched if there is no match. A match is found if a portion of the string starting at the current scanner position matches a certain pattern or algorithm.

 

The flexibility of object embedding and composition combined with recursion opens up a unique approach to parsing. Subclasses are free to form aggregates and algorithms of arbitrary complexity. Complex parsers can be created with the composition of only a few simple classes.

 

Examples of Match Sub-classes:

 

StringMatch

Match subclass.

Matches strings of the form ‘string’

UnaryMatch

Match subclass. Abstract. Has one child (subject).

BinaryMatch

Match subclass. Abstract. Has two children (left and right).

AndMatch

BinaryMatch subclass.

Matches expressions of the form: A B

OrMatch

BinaryMatch subclass.

Matches expressions of the form: A | B

ZeroOrOneMatch

BinaryMatch subclass.

Matches expressions of the form: { A }

 

And the corresponding class declarations (Again, private members are omitted for brevity):

 

class StringMatch : public Match {

 

//   Matches strings of the form ‘string’

 

public:

                     StringMatch(

Char8 const*    str,

UInt            len);

 

     virtual         ~StringMatch();

     virtual bool    Parse(Scanner& str) const;

     Char8 const*    String() const;

};

 

 

class UnaryMatch : public Match {

 

public:

                     UnaryMatch(Match const* subject);

     virtual         ~UnaryMatch();

     Match const*    Subject() const;

};

 

 

class BinaryMatch : public Match {

 

public:

                     BinaryMatch(

Match const* left,

Match const* right);

 

     virtual         ~BinaryMatch();

     Match const*    Left() const;

     Match const*    Right() const;

};

 

 

class AndMatch : public BinaryMatch {

 

//   Matches expressions of the form: A B

 

public:

                     AndMatch(

Match const* left,

Match const* right);

 

     virtual bool    Parse(Scanner& str) const;

};

 

 

class OrMatch : public BinaryMatch {

 

//   Matches expressions of the form: A | B

 

public:

                     OrMatch(

Match const* left,

Match const* right);

 

     virtual bool         Parse(Scanner& str) const;

};

 

 

class ZeroOrOneMatch : public UnaryMatch {

 

//   Matches expressions of the form: { A }

 

public:

                     ZeroOrOneMatch(Match const* subject);

     virtual bool    Parse(Scanner& str) const;

};

 

 

As a simple illustration, given these examples of Match classes, a production prod ::= {‘A’ | ‘B’} ‘C’; will generate a parser composed of a hierarchical structure made up of Match objects as basic building blocks. This hierarchical structure corresponds neatly to the parse tree (AST or Abstract Syntax Tree) of the given grammar:

 

 

prod ::= {‘A’ | ‘B’} ‘C’;

 
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


The AndMatch object at the top of the hierarchy has two children: an ZeroOrOneMatch object and StringMatch(‘C’) object. The ZeroOrOneMatch object has a child: OrMatch which in turn has two children: StringMatch(‘A’) and StringMatch(‘B’).

 

Invoking the Parse method of the production named ‘prod’ will activate the parser. A more detailed illustration of what actually happens when a generated parser is at work will be discussed later. For now suffice it to say that this is now a fully working parser which accepts a source code and simply returns true if the source conforms to the given grammar prod ::= {‘A’ | ‘B’} ‘C’; otherwise returns false in case of a syntax error.

 

The Scanner Object:

 

Parsing comments (and all things considered as white spaces including pragmas etc.) with a BNF parser is a bit tricky. The usual practice is to have a pre-processor or a separate lexical analyzer that suppresses comments to make them invisible to the parser.

 

The scanner presented here handles this task. The scanner is a simple yet integral part of the parser and works in conjunction with the parser and is not a separate program. It is not a full-blown lexical analyzer. It does not extract tokens such as reserved words and operators. Nor does it extract numbers and literal strings.

           

A production named 'whiteSpace' may be defined. This will be the rule used to skip comments and white spaces in the parse. The scanner invokes this rule when tasked by the parser to scan the next character from the source. If the production 'whiteSpace' is not declared, it will default to:  whiteSpace    ::= []; In such cases, white spaces will not be skipped.

 

For instance, if we want to skip ‘ ‘, tabs, carriage returns and newlines, we may declare the production whiteSpace as:

 

whiteSpace ::= [ \t\n\r]*;

 

As mentioned earlier, the same phrase/character level parser parses white spaces. This implies that Semantic actions may also be associated with white spaces for example to recognize pragmas:

 

whiteSpace ::=

(    [ \t\n\r]

|    comment

|    pragma:doPragma

)*;    

 

The scanner uses the same Match object that does the actual parsing and subsequently, skipping. The scanner is highly optimized and uses a deferred comment/white space skipping algorithm to minimize scanning overhead and eliminate redundant skipping when parsing alternatives.

 

By default white spaces and comments are skipped. The angular brackets ‘<’ and ‘>’ enclosing an expression may be used to inhibit white space and comment skipping. This can be useful in situations where we want to work on the character level instead of the phrase level (example).

 

Semantic Actions:

 

Named semantic actions can be attached to any level in the grammar. These actions are C/C++ functions that will be called immediately if a match is found in the particular context where the action identifier is attached. These action functions may be used as hooks into the parser and may be used to:

 

1.      Perform additional syntax analysis that is otherwise awkward or impossible to do within the confines of the top-down parser as it is.

2.      To generate output from the parser (ASTs for example).

3.      Report warnings or errors.

4.      Implement symbol tables.

 

The function should have an interface:

 

bool Action(Client* client, Char8 const* str, UInt len);

 

Where client is a user-defined type, ‘str’ is a pointer to the string that matches the expression and ‘len’ is the length of the matching string. The function may do some more syntax analysis and if for some reason the function finds the syntax pointed to by ‘str’ in error or unacceptable, it may be able to un-commit the parse by returning false. Otherwise if all is well, the function should return true.

 

We can modify our previous example by attaching a semantic action to the context of Stringmatch(‘C’):

 

{‘A’ | ‘B’} ‘C’:action

 

Thus, whenever the parser encounters a ‘C’ in the source file, the function associated with ‘action’ will be called. Uninterestingly, in this case, each time the function is invoked the parameters for ‘str’ and ‘len’ will be “C” and 1 respectively.

 

If we attach the ‘action’ such that:

 

{‘A’ | ‘B’}:action ‘C’

 

It gets a bit more interesting as the function associated with ‘action’ may be called this time with either ‘A’ or ‘B’ as possible parameters for ‘str’ depending on what is found in the source fed into the the parser.

 

The simplicity of syntax of semantic actions helps keep the source grammar clear and concise. It is the author’s view that the declarative nature of the EBNF syntax should not be cluttered with imperative statements as evident in many compiler-compilers. We shall keep the declaration separate from the implementation.