Contents

Parsing

This is the second part of compiler which will receive token from lexical analysis. If you are still not familiar with lexical analysis, here is the door.

The goal of parsing

Receive sequence of tokens from lexer and output parse tree of the program. Given a very clear example from the course,

1
2
3
4
5
6
7
8
9
if x == y then 1 else 2
# parser input (given by lexer)
IF ID == ID THEN INT ELSE INT
# parser output
             IF-THEN-ELSE
             /     |     \
            /      INT   INT
        <    >
     ID      ID

To summarize,

Phase Input Output
lexer String of chars String of tokens
parser String of tokens Parse tree

Some compiler would also combine lexer and parser into one phase, which would let parser responsible for lexical analysis.

Parsing is the process of constructing a derivation from an input sentence. It would output derivation or error message for invalid program. For an umambiguous langauge, derivation is equivalent to a parse tree.

Context-free grammars (CFG)

Parser must distinguish between valid and invalid strings of tokens. Therefore, we need a language to describe valid ones and distinguish from invalid ones.
Programming languages always have recursive structure, such as if .. then .. else, or while ... loop, and CFG are the natural notation for this recursive structure. Now, start to introduce CFG. CFG is kinds of replacement rules. For example,

1
2
3
A -> B | C;
B -> E | F;
E -> e;

A can be replaced with B or C, B can be replaced with E or F, and E can be replaced with e. e can not be replaced. CFG can be used to parse an input just like regular expression; however, CFG is stronger than regular expression with more complicated grammar.
Here comes several definition of components from CFG:

  1. nonterminal N: the symbol can be replaced, which is also called as variable.
  2. terminal T: the symbol don’t have rules to be replaced with, just like e in above.
  3. production: replacement rules and uses P to represent the set of replacement rules just like above.
  4. start symbol S: is also the root of parse tree. S belongs to nonterminal.

After listing out the definition, let me list out all the steps:

  1. Begin with start symbol S.
  2. Replace any N in the string by the right-hand side of p.
  3. Repeat second step until there are no N.

Here comes another example showing that how to restrict the format of date as xx year xx month xx day:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
S -> Y M D
Y -> YN YT
YN -> DIG1_9 DIG0_9 DIG0_9 DIG0_9
DIG0_9 -> ZERO | DIG1_9
DIG1_9 -> ONE | DIG2_9
DIG2_9 -> TWO | DIG3_9
DIG3_9 -> THREE | DIG4_9
DIG4_9 -> "4" | "5" | "6" | "7" | "8" | "9"
ZERO -> "0"
ONE -> "1"
TWO -> "2"
THREE -> "3"
YT -> "year"
M -> MN MT
MN -> DOZEN  # from 1 to 12
MT -> "month"
D -> DN DT
DT -> "day"
DOZEN -> DIG1_9 | ONE ZERO | ONE ONE | ONE TWO
DN -> DIG1_9 | ONE DIG1_9 | TWO DIG1_9 | THREE ZERO |  THREE ONE

This is one interesting example of CFG; however, not check about whether date is valid for each month (such as there is no 31th for April). But, this is enough for us to figure out what is CFG.

CFG seems strong but still not enough. Currently CFG can only tell us the received input is valid or not. To build a parse tree, we need to know how it is in the language, also handle errors to give feedback to programmer. What’s more, many grammars generate the same language; however, only some of them are accepted by tools, so form of the grammar is important.

Here comes a little summary for CFG

/12-20-20/CFG1.png
/12-20-20/CFG2.png

Derivations & Ambiguity

We are not only interested in checking whether a string matches CFG or not, but also want to build a parse tree for the string. A derivation is such like,

1
2
3
4
E
-> E + E
-> E * E  + E
-> ...

It is similiar to the concept of replacement rules. A derivation defines the parse tree, but one parse tree can have many derivations. Here comes a sample of parse tree,
/12-20-20/derivation.png
The reason why one parse tree may have multiple derivations is that we can start derivations from arbitrary of subtrees. Among all the derivations, left-most and right-most derivations are important in parser implementation, left-most derivation means that parse tree would start derivation from the left-most side. Here comes another concept about ambiguity if it has more than one parse tree for some string. To eliminate the ambiguity, the most direct method is to rewrite the grammar unambiguously. Consider the grammar,

1
2
3
E -> if E then E
    if E then E else E
    OTHER

given an expression if E1 then if E2 then E3 else E4, we can get two parse trees,
/12-20-20/ambiguity.png
We would like to declare a rule that else matches the closest unmatched then; therefore, the parse tree on the left is not what we want (if E2 then E3 else E4 matched first).

Take away, ambiguity occurs when two or more parse trees exist for some string in the language.

First & Follow set

Before we learn leftmost and rightmost derivation, we need to learn first set and follow set first.
/12-20-20/firstandfollow.png

Ok! Now we can learn how to use first and follow set in parsing.

Why we need first and follow set? It is like a pattern. When we run into some tokens, we can decide what is the next step.

Leftmost derivations (LL)

Always derive the leftmost nonterminal first. It’s also know as LL parsing and top down parsing. In the word ‘LL’, the first ‘L’ means it processes input from left to the right, and the second ‘L’ means leftmost derivation.

Rightmost derivations (LR)

Always derive the rightmost nonterminal first. It’s also known as LR parsing and bottom up parsing. In the word ‘LL’, the first ‘L’ means it processes input also from left to the right, but the second ‘R’ means rightmost derivation.