Contents

Crafting Interpreter Day1

Before this article, concept about lexical analysis is prerequisite. Here is the door about my previous article. Then, I would follow the prev article to implement the scanner part for Lox.

Scanner

One important point is that our scanner will read the source code as a very long^3 string. e.g., different lines of source code would only become the concatenation of strings and \n.
Before we go to the core implementation of scanner, of course we need to define the token type first.
A token class is created with four properties.

1
2
3
4
5
6
Token(TokenType type, String lexeme, Object literal, int line) {
    this.type = type;
    this.lexeme = lexeme;
    this.literal = literal;
    this.line = line;
}

The detailed introduction to these properties are included in my comments, jump to Token.java. line is useful when we want to report the location of error to user.
For type, we can jump to TokenType.
Ok, now we can go to the core function of Scanner.java.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// line 60: scanToken()
c <- next character

// token of single character
// this check is simple
is '(', ')', ...
add to token list

// token of two characters, e.g., '!=', '==', '<=', '>='
four cases use to scan, match, and add to token list

// token of comments, e.g., // blablabla
scan through two slashes
scan through until the end of the line
add to token list

// any whitespace
ignore

// newline
line += 1

// "abc"
call scanStr()
scan through until the next "
add abc to the literal of String token type

// number would be float number xx.xx
if c is digit
call number()
will scan through digits until '.'
move forward, then scan again until the end
add to token list

// identifier would start from '_' or a english letter
call isAlpha()
check whether c is '_' or english letter
scan through all alphanumerics
also check whether c is keywords such like "and", "class", ... defined in HashMap
add to token list

// otherwise :(
bro, this is unexpected error!
report

Test of results

Now, it is time for us to test the effect of our scanner. Here is the test case,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// test lox
print "hello world";

//test comments
var a = 1;
var b = 2;
if(a < b) {
    print true;
} else {
    print false;
}

You should have some answers in your mind first. I also briefly modified the code of toString() in Token class which is used to serialize. Let’s see the results,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
SLASH // test lox null at line: 1
PRINT print null at line: 2
STRING "hello world" hello world at line: 2
SEMICOLON ; null at line: 2
SLASH //test comments null at line: 4
VAR var null at line: 5
IDENTIFIER a null at line: 5
EQUAL = null at line: 5
NUMBER 1 1.0 at line: 5
SEMICOLON ; null at line: 5
VAR var null at line: 6
IDENTIFIER b null at line: 6
EQUAL = null at line: 6
NUMBER 2 2.0 at line: 6
SEMICOLON ; null at line: 6
IF if null at line: 7
LEFT_PAREN ( null at line: 7
IDENTIFIER a null at line: 7
LESS < null at line: 7
IDENTIFIER b null at line: 7
RIGHT_PAREN ) null at line: 7
LEFT_BRACE { null at line: 7
PRINT print null at line: 8
TRUE true null at line: 8
SEMICOLON ; null at line: 8
RIGHT_BRACE } null at line: 9
ELSE else null at line: 9
LEFT_BRACE { null at line: 9
PRINT print null at line: 10
FALSE false null at line: 10
SEMICOLON ; null at line: 10
RIGHT_BRACE } null at line: 11
EOF  null at line: 12

The original code in book might not be able to print the token for slashes (for comment case). I solved the issue with moving the addToken(SLASH) from else {} part to if {} part. More details can be found in line 93.

Treat semicolon as terminator or not

This is a very interesting story mentioned in the guide. Many modern languages use newline as a statement terminator, then there must be many challenges. Sometimes we would like to separate single statement to several lines of code. In this case, how do we identify they belong to the same statement or not? There are a variety of rules set in different programming languages.
The rule of Python is to treat all newlines as significant unless a \ at the end of the line, then python would continue on next line. However, for any newlines inside (), [], or {}, it would ignore them.
What’s interesting for Javascript is that it would treat any newline as meaningless. Unit it runs into parse error, it goes back and turning newline into a semicolon to get a valid statement. This is the work of “automatic semicolon insertion” which is weird.

Reference