Parser Generator

The parser generator program pgen can be used to quickly develop a parser. It has quite a few built-in features (which may or may not be a good thing depending on what you're trying to do). Based on an input file, the program will create:

  • A lexical analyzer. The lexical analyzer tokenizes the input, ignores whitespace, automatically recognizes both C and C++ style comments, and understands #line directives from a C pre-processor.
  • A data structure for a parse tree, and the parser that generates the parse tree.
  • Code that traverses the parse tree.

The input to the program specifies the user code that is executed when traversing the parse tree.

The parser generates error messages automatically when it cannot completely parse the input. These messages include the file name, line number, and column number where the parser failed. This file, line, column information is also preserved in the parse tree and can be accessed when traversing the parse tree (e.g. to generate error messages during semantic analysis of the parsed input).

Two mechanisms can be used to communicate information during a parse tree traversal. The first is an implicit argument that is always available in the special variable $0. This variable could (for instance) be a symbol table, or some other structure that is being created, and can be used to propagate information downward during a parse tree traversal. The second mechanism is the value returned after a specific BNF item has been traversed. This is best illustrated by means of a simple example.

  %type[W] {{ int double }};

  body: bitem 
      {{W: return NULL; }}
      ;

  bitem: ID
     {{W:
       printf ("Call me [%s]\n", $1);
       return NULL;
     }}
     | INT
     {{W:
       printf ("value: %d\n", $1);
       return NULL;
     }}
     ;

The first line

 %type[W] {{ int double }} ;

specifies that there is a walk defined named W, and there are two types that it uses. The first, int, is the type that is passed top-down when walking the parse tree. The second, double, is the default return type for any of the BNF items specified in the grammar. The parser generator implicitly uses an int * and a double * as the two types respectively.

The rest of the file specifies the BNF and the actions to be performed on parse tree traversal. The BNF itself is quite simple. Note that INT and ID have not been defined in the file. This is because pgen has a few built-in base BNF items. Note also how the code uses the label W: to indicate that the walk type W is being defined.

The rest of the code should be self-explanatory. The syntax $1 is used to specify the value of the first item on the RHS of the BNF.

BNF Syntax

In addition to the base BNF items that are built-in to pgen, there are a number of short cuts that can be useful when writing complicated grammars. The two major ones are illustrated below:

  body: { bitem ";" }** 
      ;

  bitem: ID [ "[" INT "]" ]
       | INT
       ;

The first line specifies that body is a semi-colon separated list of bitems. A bitem is either an INT, or an ID followed by an optional “[” INT “]”. Note that strings are used to specify keywords, and they are automatically turned into tokens by the lexical analyzer that is auto-generated.

The user must write wrapper functions that are used to support the extended BNF syntax.

Parse Tree Traversal

The body of the code that is executed during parse tree traversal can be inserted in a number of locations. The following code illustrates this:

   bitem: ID
      {{W: printf ("I am here!\n"); }}
          opt_array
      {{W: printf ("After optional array!\n");
           return NULL; 
      }}
       ;

   opt_array: [ "[" INT "]" ] 
      {{W: printf ("in opt_array!\n"); return NULL; }}
      ;

The first print statement is called before the recursive traversal of opt_array.

Return Values

Return values can be specified as follows:

 bitem[int]: ID 
       {{W: return 0; }}
           | INT
       {{W: return 1; }}
       ;

In this example, the value of bitem will be an integer, and its value indicates whether it was an ID or INT.

More generally, the return value can be any type name or a pointer to a type name. This means that you may have to typedef structures and other types so that the return value has the right format.

For every return value, the appropriate wrapper function must also be defined. These wrapper functions have names that are of the form prefix_wrap_W_type for a type, and prefix_wrap_W_type_p for a pointer to a type.

Helper Functions and Error Reporting

The parser automatically generates errors with appropriate line and column number information when parsing fails. The errors point to a line and column number, file name, and a description that is automatically generated from the grammar itself.

However, additional errors might be detected when walking the parse tree (e.g. during semantic analysis). It is usually very helpful to report these errors along with the location of the item in the input file that corresponded to the element in the parse tree. To help with this, pgen supports a few different built-ins that use the $ syntax.

  • $E can be used like printf, except it appends ERROR: as well as the location of the error in the message that it produces on stderr. It also exits.
  • $W can be used like $E, except it appends WARNING: instead of ERROR: and it does not exit.
  • $e is like $E, but it does not exit. This is used to support printing multi-line error messages with multiple fprintf statements (see $f) but that begin with a single ERROR: header.
  • $f is the file pointer to which the errors/warnings are sent. This is stderr (but could change in the future).
  • $A(argument) is an assertion.
  • $$ can be used to include a literal $ if you need it.
  • $l is the current line number.
  • $c is the current column number.
  • $n is the name of the current file (char *).

For instance, including

   $E("Type error, expected int, found bool");

would create an output of the form:

ERROR: File `input.in', line 16, col 7
      Type error, expected int, found bool

External Parsers and Walkers