BNFC takes as input a file written in Labelled BNF, which is essentially a BNF grammar with a label attached to each production rule. Brackets "[ ]" indicate lists rather than optional items, and there are few extensions from minimal BNF other than pragmas to define lexer tokens and positions.
Writing one grammar file instead of 4 separate modules is very attractive. Unfortunately BNFC has a serious problem - only context-free grammars are supported.
LBNF is sufficient to represent context-free grammars, and within this domain the system works very well. But despite the claims of the LBNF Report, many interesting languages are not context-free, and they still need to be lexed and parsed. The most notable example is probably C, in which a string can be lexed as an identifier or a type name, depending on previous typedefs and other things. I present here a small set of extensions to enable support for passing state between the lexing and parsing stages, enabling a wider variety of parsers to be constructed from LBNF.
A darcs repository of BNFC, with state support as described here, is available from http://www.tiresiaspress.us/haskell/bnfc/. These features are only supported in the Haskell/Alex backend (Alex-3 is required).
An initial examinationThe approach I took is founded on two simple ideas. First, the lexer needs to access to a state table to determine the correct token, and secondly, the parser needs to update the state table in response to certain rules.
Lexing rulesIn LBNF, tokenizing rules are specified like this:
where "Ident" is the type of the token produced, and the remainder is a regular expression to determine valid "Ident" strings.
token Ident (alpha [alpha|digit|'_']*) ;
A stateful token rule is similar. The implementation supports multiple sets of state, so it's necessary to specify which state set the tokenizer should check.
First the lexer matches the regular expression against the input string. It then takes the match and checks if it's present in "typedefNameSet". If so, a token of type "TypedefName" is produced. Otherwise an "Ident" is produced.(1).
state token TypedefName typedefNameSet [alpha|digit|'_']+ ;
Parsing RulesStandard production rules are extended with pragmas that specify state update annotations:
The number "3" specifies that a token string derived from the 3rd production (Ident) is to be added to the set "typedefNameSet".(2)
-- after this rule, the final Ident will be lexed as a TypedefName Etypedef. Declaration ::= "typedef" TypeName Ident <% state typedefNameSet add 3 %> ;
Similarly, names can be removed from a state set:
When removing a token from a set, it's unnecessary to specify the set the token is being removed from.
-- after this rule, the second token will be lexed as an Ident UnTydeclarator. Direct_declarator ::= TypeName TypedefName <% state remove 2 %> ;
Initial members of the state sets can be specified in the grammar via the pragma TODO.
ScopingThe implementation maintains a stack of state values in order to support lexical scoping of changes. Two additional pragmas, "state push" and "state pop", provide access to stack manipulations. A common usage would be:
It is an error to pop the last frame off the state stack.
PushParen. PushParen ::= "(" <% state push %> PopParen. PopParen ::= ")" <% state pop %>
The implementation only supports one stack for all state sets. It is not currently possible to represent multiple state sets with differing scoping rules.
A drawback of the current implementation is that production rules such as PushParen and PopParen are no longer terminals. Therefore the implementation creates abstract representations for them, although no additional meaning is conveyed than if they were omitted completely. I hope to address this in the future.
Deriving the state string
A major question is, how to determine which token is added to the state set for a given production? The above cases are simple. In a production rule like:
Etypedef. Declaration ::= "typedef" TypeName Ident <% state typedefNameSet add 3 %> ;
"Ident" is a token, so it's simple to derive the string that should be added to the set. However, it's frequently desirable to modify state based upon a non-terminal rule, such as:
Rule. SomeProd ::= Declaration <% state stateset add 1 %> ;
How should the token be determined from the Declaration?
The current implementation only supports state adjustments based upon tokens and lists of tokens, e.g. [Ident]. Code generated from rules like the above "Rule" currently results in a compile-time error due to missing instances. The instances can be manually provided in a separate file, although creating this file can be tedious. I plan to add an additional pragma to the LBNF grammar that would allow for the necessary instances to be automatically generated.
Notes(1) - The astute reader will notice that the TypedefName regex matches strictly more identifiers than the built-in "Ident" regex. Therefore this match provides a "backdoor" into Ident, allowing strings that don't match the "Ident" regex to actually be produced as "Ident"s. I find this behavior useful, and in any case it can be mitigated simply by using the Ident regex for state tokenizing rules. Another possibility would be to use a user-provided alternative in case the match fails.
(2) - An alternative notation would omit the numbers and place the pragma adjacent to the production.