Lexical Analysis

The goal of lexical analysis

                            <Class, string> = token
string -> [Lexical Analyzer] --------------> [Parser]

The goal of lexical analysis is to partition the input string into lexemes, identify the token of each lexeme (the substring unit we want to identify) then give to parser. On other hand, first part about lexemes is to partition input to units, and second part about token is to define the class of each units.

  1. Given a input string, such as foo = 42
  2. Lexical-Analyzer would identify tokens as <"identifier", "foo">, <operator, "=">, and <"integer", "42">
  3. Give to parser

There are several kinds of token classes, such as Identifier, Integer, Numbers, Keyword, Whitespace (e.g., \t\n is single whitespace token!), Operator. Two interesting classes: One is identifier, which is string composed of letters or digits and start with letter, e.g. A1, B17. Another one is keyword, which is composed of something like if, else… etc.

To figure out which value belongs to the token class. The format of token consists of token name (class) and attribute value (lexeme). To make a brief summary, an implementation of Lexical analysis must do two things: 1. recoginize lexemes, 2. identify token class each lexeme.

There is an example for FORTRAN language in the course: The difference between DO 5 I = 1,25 and DO 5 I = 1.25. Whitespace in FORTRAN is insignificant, so we just need to ignore it. The first one represents for-loop while the second one represents simple assignment. In this case, DO in the first one would be keyword, but DO in the second one would be the substring of identifier. Now comes the problem, it makes sense that we scan the input from left-hand side to right-hand side; however, how do we identify whether it is keyword or identifier while we are still in left-hand side. We need the right-hand side, which is the difference between 1,25 and 1.25; therefore, we need to look-ahead.

Another example for difference between C++ template syntax and stream syntax:

// template syntax
// stream syntax
cin >> var;

How does LA identify the difference of >> between template and stream? To handle this, C++ only permit to use Foo<Bar<Bazz> > in template.

  1. The goal is to partition string. This can be implemented by scanning input from left to right, recognizing one token at a time.
  2. Look-ahead may be required to decide tokens.

Lexical specification

To specify which set of strings belong to each token class, we could use regular languages. To define the regular languages, we use regular expression. Assume a meaning function L, which can map expression to set of strings by L(E) -> set of strings. It’s said that meaning function can help map syntax to semantics.

Why is this mapping to semantics also important? It can allow us to consider notation a separated issue. Same semantics can be mapped to by several syntaxs. Sometimes our syntax is not the best solution for the problem, so we could find another solution based on same semantic.

How to check whether a string belongs to L(R) or not?

  1. List all regular expressions of token classes
  2. Union: L(token_a) U L(token_b) U L(token_c) ...
  3. Given string x, check whether substring sx belongs to L(R) or not.
  4. If true, remove sx from x. Put the rest of x to step 3 until x is totally empty.

How if there are multiple token classes matched? “Maximal match”: choose the Longest Matched Prefix one. Also given sx, it would belong to the token class with higher priority. For example, if could belong to identifier or keyword, and identifier has higher priority.

Lexical implementation

We used regular expression to make lexical specification, but how do we implement it? We could use Finite automata. Regular expression is not enough for parsing string; therefore, we need finite automata as our model. There is a brief introduction to automata on geeksforgeeks. Therefore, we know that automata consists of an input alphabet, finite set of states S, a start state, a set of accepted states, and a set of transition. Transition between states means state receives input to go to another state.

For the difference between Deterministic Finite Automata (DFA) and Nondeterministic Finite Automata (NFA). For DFA, one transition per input per state, and no epsilon moves are allowed. For NFA, multiple transition for per input given state, but epsilon moves are allowed. (epsilon move means transition to different states without assuming any input).

Previously, we already know we can use regular expression to do lexical specification. Regular expression, NFA, DFA always accept kinds of same language even though they have totally different definitions. Therefore, to parse the same string, we can convert from regular expression to DFA with a very simple rule to do lexical analysis. In following steps, we would convert regular expressions to NFA, from NFA to DFA, then from DFA to Table-driven implementation of DFA at the end.

Take away, in this part we know why we need automata? RE, NFA, DFA can all be used to validate string, but it is difficult to implement RE in code. We can use automata to describe the transition behaviors between states.

From regular expression to NFA

So far, we already know that we want to implement lexical specification which will decide lexeme belongs to which token class.

Lexical specification --> Regular expression --> NFA --> DFA --> Table-driven implementation of DFA

Learn how to transition between states with epsilon move. Here gives the example exercies of converting from regular expression to NFA. Given the regular expression which is 1* + 0, which NFA should be picked.
The answer is the first one. What is the difference between first and the second NFA? In the first NFA, even I can get back to the previus state, I can not choose the path of 0 anymore if I am on the path to 1.In the second NDA, I still have chance to go to the path of 0 even I go onto the path of 0. Therefore, for the second DFA, the given regular expression should be 1*0.

Implement NFA in C

Here is my backtracking implementation of NFA in C (
I maintained a data structure of NFA transition table. It looks like
There are four states, and state can transit to another state with any input, which is called epsilon move on the rightmost column.
To figure out my source code, we can take a look at the pseudo code first

bool nfa(cur state, cur input) {
    if (all input processed && cur state == final state) return true;
    if (all input processed && cur state != final state) return false;

    for next state in transition[cur state][next input]
        if ( nfa(next state, next input) == true ) return true;

    for next state in transition[cur state][epsilon]
        if ( nfa(next state, cur input) == true ) return true;

    return false;

The first loop handles with the transition case with one input, while the second loop handles with the case of epsilon move.

From NFA to DFA

Due to the fact that NFA has multiple transition state for per state, it is not appropriate for doing lexical analysis. We need to convert NFA to DFA to get a stable transition state machine. Given a NFA of a(b + c)*,
We will calculate a epsilon-closure for a state. For example, n0 can transite to n1 with the input a, also can transite to n1, n2, n3, n4, n6, and n9 with epsilon. Let me circle it for you,
then imagine the red circle as a single state q1. q1 accepts input b and transites to {n5, n8, n9, n3, n4, n6} which can be assgined as q2. q2 then can use specific input and transite to next state.

     a         b
q0 ----->  q1 -----> q2 -----> ...

At the end, we can get our DFA. q1 and q2 both include n9; therefore, they are both accepted states. Of course, to get calculate epsilon-closures, there are different kinds of algorithm able to achieve it.

Table-driven implementation of DFA

DFA can be implemented by a 2D table.

states input a
i k
j k

The leftmost column shows the states, and the topmost row shows the inputs. This table means that state i can accept input of a and transite to the state of k.

while(input[i]) {
    state = A[state, input[i++]];

The disadvantage of this strategy for implementing a DFA is that there might a lot of duplicate rows in the table. Instead of using 2D table, we can switch to 1D table.

state         a     b
i     ----->  k     m
j     ----->

State j can also point to state k and state m with input of a and b. Therefore, we can save the space cost of the duplicate rows. However, it takes more time to refer with pointers.
Sometimes, the cost of turning NFA into DFA would be so high that we can implement NFA directly. Create table for NFA,

0 1 epsilon
{} {B} {}
{} {F} {C, D}

This table has epsilon as input. States receive different input to transite to a set of states (single state in DFA, this is the difference). This table is guaranteed to be relatively smaller because the number of states and the size of input are limited. Also, we can use the same trick to share the duplicate rows. Another difference is that simulation of NFA would be slower because a set of states for per state rather than single state.
To summary, DFA is faster and less compact with larger table, while NFA is slower and more concise. This is the trade-off everyone can decide on their own.


Here comes a handwritten summary.


  1. Introduction of Finite Automata
  2. 编译原理:有穷自动机(DFA与NFA)
  3. NFA 转换成 DFA:子集构造算法