lambdaYacc Manual

The following page is meant to be a guide for specifying a grammar, generating the corresponding parser, and using the parser in LambdaYacc.


Introduction

The basic steps for creating a parser with LambdaYacc are:
  1. Create a .sig and .mod file specifying the grammar of the language.
  2. Generate the parser with the command lyacc <grammar> <parser>.
  3. Run the parser with tjsim <name of parser>
  4. Parse a file or a line
One can also combine parsers.

Specifying a Grammar

Specifying a grammar has eight parts:
  1. The typing declarations for types in the grammar (located in the .sig file)
  2. Information for the tokenzier
  3. Lists of the terminal and non-terminal grammar symbols
  4. printname predicates
  5. The rules for the grammar
  6. Precedence and associativity predicates
  7. freshcopy clauses
  8. Other predicates for running or using the parser
We will detail each part.

Declarations in all grammars

No matter what grammar you are specifying, your signature and module must accumulate lambdayacc. Therefore, accum_sig lyaccshares,blists. and accum_sig lambdayacc. accumulate lambdayacc. should be in your .sig file and .mod file, respectively.

Typing Declarations

It is necessary to put the typing declarations for all terminals and non-terminals in your .sig file. Terminal symbols all be of the type gs. Non-terminals all have the type <something> -> gs, where something is a kind that you have defined. For instance, the Lambda Prolog parser has the following declarations in lpgram.sig for kinds and types:

kind typ type.
kind knd type.
type typet, kindt, arrowt gs.
type formt, intt, stringt, realt, listt gs.
type akind (knd -> gs).
type atype (typ -> gs).

The typet, kindt, arrowt, formt, etc., are the terminals representing the keywords for types and kinds such as "type", "kind", "->", and "o". akind and atype are the non-terminals representing kinds and types in the language. We will see how they are used in later sections.

Information for the Tokenizer

Two parts are needed for the tokenizer: what characters are valid for tokens and what characters represent comments. The former uses a predicate of the form tokchar A :- ... One can specify the characters manually in combination with a set of pre-defined groups:
letter AAll upper and lower case letters of the alphabet, plus the underscore (_) character
letnum AThe same as letter A, with the addition of the decimal digits 0-9
alltypAll printable characters, including punctuation marks

Note that even when using letnum A or alltyp A, the first character of a token cannot be a number. The tokenizer must recognize tokens that start with a number as an integer.

Examples of the tokchar predicate would be:

All letters and numberstokchar A :- letnum A.
Lambda Prolog's tokens, which is all characters with the exception of ",", ";", "(", ")", "[", "]", ":", etc. tokchar S :- alltyp S, C is string_to_int S, not (C = 46; C = 44; C = 41; C = 40; C = 59; C = 91; C = 93; C = 124; C = 58; C = 92; C = 34; C = 126).

In the second example, we use the ASCII character codes for the characters. It would also be possible to use the characters themselves as a string.

The second part of information for the tokenizer is a list of strings with which single-line comments can start. The tokenizer already supports block comments that start and end with /* */. This predicate takes the form comment_list <List of strings>. The strings can be up to two characters long. The comment predicate for Lambda Prolog looks like:

comment_list ["%"].

For Twelf, which supports comments that start with a "#", or a "%" followed by a space, the predicate would be

comment_list ["#","% "].

Lists of terminal and non-terminal grammar symbols

Every grammar symbol must be included in one of two lists, indicating whether it is a terminal or non-terminal. The two lists are specified by

Terminalsterminal_list <list of terminals>
Non-terminalsnonterm_list <list of non-terminals>

See the Lambda Prolog Grammar for an example. Note that you must include iconst, id, and sconst in order to have integers, strings, and strings in quotations marks recognized.

Update: The non-terminals are now specified by the arity of the term, up to an arity of 4. Note that if you are using an older grammar where the non-terminals are in a list with arguments, you can still use them, however the system will not be able to generate freshcopy clauses for you.

printname Predicates

The printname predicates tell the tokenizer what strings correspond to tokens in a file being parsed. They can include letters, numbers, or punctuation symbols. The predicates are of the form printname _ <string> <token>.

Here are a few examples taken from the Lambda Prolog Grammar.

printname _ "module" modulet.
printname _ "=>" impt.
printname _ "+" addt.

The tokens should all be terminal grammar symbols declared in the .sig file that appear in terminal_list.

Grammar Rules

The grammar rules are the valid productions in the syntax of the language. First of all, one must specify the starting symbol of the grammar. This is done with the start_symbol <symbol>. predicate, where <symbol> is one of the non-terminal symbols of the language.

The productions for the language are specified as a list, cfg <rules>., where a rule takes the form

rule (<Lefthand side> ==> <Righthand side>) (<Action>)

The <Lefthand side> is a non-terminal symbol in the language. The <Righthand side> is a list of terminal and non-terminal symbols in the language. The <Action> is Lambda Prolog code indicating what should be done for this rule. Let us look at a few examples.

The first example is taken from the "on-line" calculator. The rule for addition would be

rule ((ae R4) ==> [(ae A4),plust,(ae B4)]) (R4 is (A4 + B4))

This rule says that a term that has two arithmetic expressions (i.e., integers) with a plust symbol ("+") in the middle should be added together and returned as the arithmetic expression which is their sum.

A more complicate example would be the rule for a Lambda Prolog signature file, which includes a header, preamble, and sign declarations:

rule ((signaturedef SIGNATURE1) ==> [sigheader SH1,sigpreamble SP1,signdecls SD1]) (SIGNATURE1 = definesig SH1 SP1 SD1)

Each of the non-terminals in the righthand side of the rule has its own rule(s). This rule says that the information from those non-terminals should be combined with the predicate definesig and returned as a signaturedef.

Precedence and associativity predicates

The precedence and associativity predicates are used to help resolve shift-reduce conflicts. Both binary infix operators and unary operators have corresponding predicates. Declaring precedence and associativity for a binary operator takes the form binaryop <token> <left argument> <right argument> <assoc> <prec>., where <left argument> and <right argument> are tokens corresponding to the arguments to the operator; <assoc> is "left", "right", or "none"; and <prec> is an integer specifying the precedence level. For example, the binaryop predicate for addition in the "on-line" calculator is

binaryop plust (ae X) (ae Y) "left" 3.

The binaryop declaration for implication in the Lambda Prolog Grammar is

binaryop impt (term A) (term B) "right" 130.

Precedence for a unary operator is very similar, declared with the predicate unaryop <token> <argument> <prec>.

Another class of operators that may need precedence and associativity predicates are the implicit operators, e.g., function application. The predicate is just like the one for binary operators: implicitop <token> <left argument> <right argument> <assoc> <prec>.

freshcopy predicates
NOTE: This part is no longer needed, since LambdaYacc can generate the freshcopy clauses for you.

The freshcopy predicates are necessary in order to avoid binding terms in the non-terminals. There must be a freshcopy predicate for each non-terminal grammar symbol that has an argument. For instance, for the atype non-terminal from the Lambda Prolog Grammar is

freshcopy (atype A) (atype B) :- !.

The variable arguments to the non-terminals MUST be different. Additionally, each must have a cut. The final freshcopy predicate should always be freshcopy X X. This will catch all terminals and non-terminals without arguments.

Eventually, the LambdaYacc will be fixed to generate these predicates automatically from the list of non-terminals.

Other predicates for running or using the parser

In the course of defining the grammar, you may need to define other predicates. For instance, the Lambda Prolog Grammar required predicates to form lambda abstractions and to parse and find infix declarations in a module. You can put any additional clauses you need in the grammar file itself, or put it in another file that you accumulate.


Generating the Parser

Generating the parser is actually a three-step process:

  1. Compile the grammar
  2. Run the grammar in tjsim and generate the parser
  3. Compile the generated parser

A shell script to do these three things is included in LambdaYacc. To use the shell script, type lyacc grammar parser [-fc]. The final, optional argument indicates you want LambdaYacc to generate the freshcopy clauses for you. For example, the parser for Lambda Prolog was created using the command lyacc lpgram lpparser -fc.

If your grammar compiles without any warnings, running lyacc could result in one of four outcomes:

  1. Successful creation of the parser with no conflicts or errors
  2. A reduce-reduce conflict
  3. Shift-reduce conflict(s)
  4. A failure for some reason

In the first case, you can run your parser, as described in the next section. In the second case, LambdaYacc will stop and report the error. In order to create the parser, you must fix the reduce-reduce conflict. If you get one or more shift-reduce conflicts, you can try to fix them in the grammar. Otherwise, you can specify whether to shift or reduce when prompted during the parser generation. After choosing whether to shift or reduce for each conflict, you can run your parser.

Unexpected failures can occur for many reasons. The two most frequent reasons seem to be:

  1. Missing a freshcopy predicate
  2. Missing a symbol in either terminal_list or nonterm_list

Another reason could be that you use the same variable twice in the grammar rules. Since it is a single list, each variable must have a different name, or else they will be considered the same term. Additionally, make sure you defined any predicates you used in the <Action> part of the rules.


Running the Parser

Running the parser is accomplished by running tjsim parser. For instance, the command to run the Lambda Prolog parser is tjsim lpparser.


Parsing

You have to options for using the parser: a line or a file. When parsing a line, you can specify the grammar symbol to consider the start symble. To parse a line, type parseline <start symbol> A. The system will prompt you for a line to parse and then return the parsed line. This is particularly helpful for debugging. For instance, to parse a line that is a term in the Lambda Prolog parser, you would use:

parseline (term T) A.

You can also parse a file with parsefile "<filename>" A.

Note that you can define your own predicates to parse that use these predicates. For instance, the Lambda Prolog parser uses a predicate parselp "<Name of Program>" A. to parse a Lambda Prolog module.


Combining Parsers

With the parser combiner, one can combine parsers made with LambdaYacc into one parser with separate clauses for parsing lines and files for two or more parsers.

To combine two parsers, use the predicate:

combinparsers [["<first grammar name>","<first parser name>"],["<second grammar name>", "<second parser name>"],...] "new parser name".
For instance, to combine the calculator parser (calcgrammar grammar file and calculator parser file) and the Lambda Prolog parser (lpgram and lpparser files) to a new module named bigparser, one would use:
combineparsers [["calcgrammar","calculator"],["lpgram","lpparser"]] "bigparser".
One can also specify any signatures that need accumulated (i.e., if you specified the language's abstract syntax in a separate module) via a third string in the list for each parser. Once a parser is complete, you can either parse a line or file as follows:
plw<parser name> A B, where A is a start symbol and B is the resulting parsing
pfw<parser name> A B, where A is the file name and B is the resulting parsing
For example, parsing a file "test.sig" with the Lambda Prolog parser in bigparser would be done with:
pfwlpparser "test.sig" A.
A combined parser will consist of the following files:
  • A .sig and .mod file for the parser (named with "new parser name")
  • For each parser included
    • <grammar name>_tmp.sig
    • <grammar name>_tmp.mod
    • <parser name>cp.sig
    • <parser name>cp.sig

Once you have created the parser using the combineparsers module, simply compile your new parser and run it.

Note that the parser must already be created for this to work.