The following page is meant to be a guide for specifying a grammar,
generating the corresponding parser, and using the parser in
LambdaYacc.
Specifying a grammar has eight parts:
- The typing declarations for types in the grammar (located in the
.sig file)
- Information for the tokenzier
- Lists of the terminal and non-terminal grammar symbols
printname predicates
- The rules for the grammar
- Precedence and associativity predicates
freshcopy clauses
- Other predicates for running or using the parser
We will detail each part.
Declarations in all grammars
No matter what grammar you are specifying, your signature and module
must accumulate lambdayacc. Therefore, accum_sig
lyaccshares,blists. and accum_sig lambdayacc. accumulate lambdayacc. should be
in your .sig file and .mod file, respectively.
Typing Declarations
It is necessary to put the typing declarations for all terminals
and non-terminals in your .sig file. Terminal symbols all be of
the type gs. Non-terminals all have the type
<something> -> gs, where something is
a kind that you have defined. For instance, the Lambda Prolog parser
has the following declarations in lpgram.sig
for kinds and types:
kind typ type.
kind knd type.
type typet, kindt, arrowt gs.
type formt, intt, stringt, realt, listt gs.
type akind (knd -> gs).
type atype (typ -> gs).
The typet, kindt, arrowt, formt, etc., are the terminals
representing the keywords for types and kinds such as "type", "kind",
"->", and "o". akind and atype are the
non-terminals representing kinds and types in the language. We will
see how they are used in later sections.
Information for the Tokenizer
Two parts are needed for the tokenizer: what characters are valid
for tokens and what characters represent comments. The former uses a
predicate of the form tokchar A :- ... One can specify
the characters manually in combination with a set of pre-defined
groups:
letter A | All upper and lower case letters
of the alphabet, plus the underscore (_) character |
letnum A | The same as letter
A, with the addition of the decimal digits 0-9 |
alltyp | All printable characters,
including punctuation marks |
Note that even when using letnum A or alltyp A, the first character
of a token cannot be a number. The tokenizer must recognize tokens
that start with a number as an integer.
Examples of the tokchar predicate would be:
| All letters and numbers | tokchar A :- letnum
A.
|
| Lambda Prolog's tokens, which is all characters with the
exception of ",", ";", "(", ")", "[", "]", ":", etc. |
tokchar S :- alltyp S, C is string_to_int S,
not (C = 46; C = 44; C = 41; C = 40; C = 59; C = 91; C = 93; C = 124;
C = 58; C = 92; C = 34; C = 126). |
In the second example, we use the ASCII character codes for the
characters. It would also be possible to use the characters
themselves as a string.
The second part of information for the tokenizer is a list of
strings with which single-line comments can start. The tokenizer
already supports block comments that start and end with /* */. This
predicate takes the form comment_list <List of
strings>. The strings can be up to two characters
long. The comment predicate for Lambda Prolog looks like:
comment_list ["%"].
For Twelf, which
supports comments that start with a "#", or a "%" followed by a space,
the predicate would be
comment_list ["#","% "].
Lists of terminal and non-terminal grammar symbols
Every grammar symbol must be included in one of two lists,
indicating whether it is a terminal or non-terminal. The two lists
are specified by
| Terminals | terminal_list <list of
terminals> |
| Non-terminals | nonterm_list <list of
non-terminals> |
See the Lambda Prolog
Grammar for an example. Note that you must include
iconst, id, and sconst in order
to have integers, strings, and strings in quotations marks
recognized.
Update: The non-terminals are now specified by the arity of
the term, up to an arity of 4. Note that if you are using an older
grammar where the non-terminals are in a list with arguments, you can
still use them, however the system will not be able to generate
freshcopy clauses for you.
printname Predicates
The printname predicates tell the tokenizer what
strings correspond to tokens in a file being parsed. They can include
letters, numbers, or punctuation symbols. The predicates are of the
form printname _ <string>
<token>.
Here are a few examples taken from the Lambda
Prolog Grammar.
printname _ "module" modulet.
printname _ "=>" impt.
printname _ "+" addt.
The tokens should all be terminal grammar symbols declared in the
.sig file that appear in terminal_list.
Grammar Rules
The grammar rules are the valid productions in the syntax of the
language. First of all, one must specify the starting symbol of the
grammar. This is done with the start_symbol
<symbol>. predicate, where <symbol> is
one of the non-terminal symbols of the language.
The productions for the language are specified as a list, cfg
<rules>., where a rule takes the form
rule
(<Lefthand side> ==> <Righthand side>)
(<Action>)
The <Lefthand side> is a non-terminal
symbol in the language. The <Righthand
side> is a list of terminal and non-terminal symbols in
the language. The <Action> is Lambda Prolog
code indicating what should be done for this rule. Let us look at a
few examples.
The first example is taken from the "on-line" calculator. The rule for
addition would be
rule ((ae R4) ==> [(ae A4),plust,(ae B4)]) (R4 is (A4 + B4))
This rule says that a term that has two arithmetic expressions
(i.e., integers) with a plust symbol ("+") in the middle should be
added together and returned as the arithmetic expression which is
their sum.
A more complicate example would be the rule for a Lambda Prolog
signature file, which includes a header, preamble, and sign
declarations:
rule ((signaturedef SIGNATURE1) ==> [sigheader SH1,sigpreamble
SP1,signdecls SD1]) (SIGNATURE1 = definesig SH1 SP1 SD1)
Each of the non-terminals in the righthand side of the rule has its
own rule(s). This rule says that the information from those
non-terminals should be combined with the predicate
definesig and returned as a
signaturedef.
Precedence and associativity predicates
The precedence and associativity predicates are used to help
resolve shift-reduce conflicts. Both binary infix operators and unary
operators have corresponding predicates. Declaring precedence and
associativity for a binary operator takes the form binaryop
<token> <left argument> <right
argument> <assoc> <prec>.,
where <left argument> and <right argument>
are tokens corresponding to the arguments to the operator;
<assoc> is "left", "right", or
"none"; and <prec> is an integer specifying
the precedence level. For example, the binaryop
predicate for addition in the "on-line" calculator is
binaryop plust (ae X) (ae Y) "left" 3.
The binaryop declaration for implication in the Lambda
Prolog Grammar is
binaryop impt (term A) (term B) "right" 130.
Precedence for a unary operator is very similar, declared with the
predicate unaryop <token> <argument>
<prec>.
Another class of operators that may need precedence and
associativity predicates are the implicit operators, e.g., function
application. The predicate is just like the one for binary operators: implicitop
<token> <left argument> <right
argument> <assoc> <prec>.
freshcopy predicates
NOTE: This part is no longer needed, since LambdaYacc can generate the freshcopy clauses for you.
The freshcopy predicates are necessary in order to
avoid binding terms in the non-terminals. There must be a
freshcopy predicate for each non-terminal grammar symbol
that has an argument. For instance, for the atype
non-terminal from the Lambda
Prolog Grammar is
freshcopy (atype A) (atype B) :- !.
The variable arguments to the non-terminals MUST be different.
Additionally, each must have a cut. The final freshcopy
predicate should always be freshcopy X X. This will
catch all terminals and non-terminals without arguments.
Eventually, the LambdaYacc will be fixed to generate these
predicates automatically from the list of non-terminals.
Other predicates for running or using the parser
In the course of defining the grammar, you may need to define other
predicates. For instance, the Lambda
Prolog Grammar required predicates to form lambda abstractions and
to parse and find infix declarations in a module. You can put any
additional clauses you need in the grammar file itself, or put it in
another file that you accumulate.
Generating the parser is actually a three-step process:
- Compile the grammar
- Run the grammar in tjsim and generate the parser
- Compile the generated parser
A shell script to do these three things is included in LambdaYacc.
To use the shell script, type lyacc grammar
parser [-fc]. The final, optional argument indicates you
want LambdaYacc to generate the freshcopy clauses for you. For
example, the parser for Lambda Prolog was created using the command
lyacc lpgram lpparser -fc.
If your grammar compiles without any warnings, running lyacc
could result in one of four outcomes:
- Successful creation of the parser with no conflicts or errors
- A reduce-reduce conflict
- Shift-reduce conflict(s)
- A failure for some reason
In the first case, you can run your parser, as described in the
next section. In the second case, LambdaYacc will stop and report the
error. In order to create the parser, you must fix the reduce-reduce
conflict. If you get one or more shift-reduce conflicts, you can try
to fix them in the grammar. Otherwise, you can specify whether to
shift or reduce when prompted during the parser generation. After
choosing whether to shift or reduce for each conflict, you can run
your parser.
Unexpected failures can occur for many reasons. The two most
frequent reasons seem to be:
- Missing a
freshcopy predicate
- Missing a symbol in either
terminal_list or
nonterm_list
Another reason could be that you use the same variable twice in the
grammar rules. Since it is a single list, each variable must have a
different name, or else they will be considered the same term.
Additionally, make sure you defined any predicates you used in the
<Action> part of the rules.