Character Sets

Sparse bit vectors

In order to accomodate 16/32 and 64 bit characters, the chset class is internally implemented as a sparse bit/boolean set. The implementation uses a sorted vector of disjoint ranges (range_runs). The set is constructed from ranges such that adjacent or overlapping ranges are coalesced.

range_runs are very space-economical in situations where there are lots of ranges and a few individual disjoint values. Searching is O(log n) where n is the number of ranges.

In the near future, the implementation can be further optimized to use the standard library's bit vector whenever possible.

The character set chset matches a set of characters over a finite range bounded by the limits of its template parameter CharT. This class is an optimization of a parser that acts on a set of single characters. The template class is parameterized by the character type (CharT) and can work efficiently with 8, 16 and 32 and even 64 bit characters.

chsets are constructed from literals (e.g. 'x'), chlits, ranges, anychar and nothing (see primitives section above) or copy-constructed from another chset. The chset class uses a copy-on-write scheme that enables instances to be passed along easily by value.

Examples:

chset<> s1('x');
chset<> s2(anychar - s1);

Optionally, character sets may also be constructed using a definition string following a syntax that resembles posix style regular expression character sets, except that double quotes delimit the set elements instead of square brackets and there is no special negation '^' character.

Character set definition strings follow the meta-syntax:

range = anychar >> '-' >> anychar;
set = *(range | anychar);

Since we are defining the set using a string, the usual C/C++ literal string syntax rules apply. Examples:

chset<> s1("a-zA-Z");       // alphabetic characters
chset<> s2("0-9a-fA-F");    // hexadecimal characters
chset<> s3("actgACTG");     // DNA identifiers
chset<> s4("\xff\xfe");     // Hexadecimal 0xFF and 0xFE

The standard Spirit set operators apply (see operators section below) plus an additional character-set-specific inverse (negation) operator:

~a      // Set inverse
a | b   // Set union
a & b   // Set intersection
a - b   // Set difference
a ^ b   // Set xor

where operands a and b are both chsets or one of the operand is either a literal character, a chlit, a range, anychar or nothing. Special optimized overloads are provided for anychar or nothing operands. A nothing operand is converted to an empty set, while an anychar operand is converted to a set having elements of the full range of the character type used (e.g. 0-255 for unsigned 8 bit chars).

A special case is ~anychar which yields nothing, but ~nothing is illegal. Inversion of anychar is asymmetrical, a one-way trip comparable to converting T* to a void*.

An assignment and a copy construtor are provided to allow mixed conversion from a chset<A> to a chset<B>. This is possible when type A is convertible to type B. For example if type A is a char and type B is a wchar_t.