Dancing with LLVM: A Moonbit Chronicle (Part 1) - Implementing the Frontend
Introductionβ
Programming language design and compiler implementation have long been considered among the most challenging topics in computer science. The traditional path to learning compilers often requires students to first master a complex set of theoretical foundations:
- Automata Theory: Finite state machines and regular expressions
- Type Theory: The mathematical underpinnings of Ξ»-calculus and type systems
- Computer Architecture: Low-level implementation from assembly language to machine code
However, Moonbit, a functional programming language designed for the modern development landscape, offers a fresh perspective. It not only features a rigorous type system and exceptional memory safety guarantees but, more importantly, its rich syntax and toolchain tailored for the AI era make it an ideal choice for learning and implementing compilers.
Series Overview This series of articles will delve into the core concepts and best practices of modern compiler implementation by building a small programming language compiler called TinyMoonbit.
- Part 1: Focuses on the implementation of the language frontend, including lexical analysis, parsing, and type checking, ultimately generating an abstract syntax tree with complete type annotations.
- Part 2: Dives into the code generation phase, utilizing Moonbit's official
llvm.mbt
binding library to convert the abstract syntax tree into LLVM intermediate representation and finally generate RISC-V assembly code.
TinyMoonbit Language Designβ
TinyMoonbit is a systems-level programming language with an abstraction level comparable to C. Although its syntax heavily borrows from Moonbit, TinyMoonbit is not a subset of the Moonbit language. Instead, it is a simplified version designed to test the feature completeness of llvm.mbt
while also serving an educational purpose.
Note: Due to space constraints, the TinyMoonbit implementation discussed in this series is simpler than the actual TinyMoonbit. For the complete version, please refer to TinyMoonbitLLVM.
Core Featuresβ
TinyMoonbit provides the fundamental features required for modern systems programming:
- β Low-level Memory Operations: Direct pointer manipulation and memory management
- β Control Flow Structures: Conditional branches, loops, and function calls
- β Type Safety: Static type checking and explicit type declarations
- β Simplified Design: To reduce implementation complexity, advanced features like type inference and closures are not supported.
Syntax Exampleβ
Let's demonstrate TinyMoonbit's syntax with a classic implementation of the Fibonacci sequence:
extern fn (x : Int) -> Unit
print_int(Int
x : Int
Int) -> Unit
Unit;
// Recursive implementation of the Fibonacci sequence
fn (n : Int) -> Int
fib(Int
n : Int
Int) -> Int
Int {
if Int
n (self_ : Int, other : Int) -> Bool
<= 1 {
return Int
n;
}
return (n : Int) -> Int
fib(Int
n (self : Int, other : Int) -> Int
Performs subtraction between two 32-bit integers, following standard two's
complement arithmetic rules. When the result overflows or underflows, it
wraps around within the 32-bit integer range.
Parameters:
self
: The minuend (the number being subtracted from).
other
: The subtrahend (the number to subtract).
Returns the difference between self
and other
.
Example:
let a = 42
let b = 10
inspect(a - b, content="32")
let max = 2147483647 // Int maximum value
inspect(max - -1, content="-2147483648") // Overflow case
- 1) (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:
self
: The first integer operand.
other
: The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
inspect(42 + 1, content="43")
inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ (n : Int) -> Int
fib(Int
n (self : Int, other : Int) -> Int
Performs subtraction between two 32-bit integers, following standard two's
complement arithmetic rules. When the result overflows or underflows, it
wraps around within the 32-bit integer range.
Parameters:
self
: The minuend (the number being subtracted from).
other
: The subtrahend (the number to subtract).
Returns the difference between self
and other
.
Example:
let a = 42
let b = 10
inspect(a - b, content="32")
let max = 2147483647 // Int maximum value
inspect(max - -1, content="-2147483648") // Overflow case
- 2);
}
fn main {
(x : Int) -> Unit
print_int((n : Int) -> Int
fib(10));
}
Compilation Targetβ
After the complete compilation process, the above code will generate the following LLVM Intermediate Representation:
; ModuleID = 'tinymoonbit'
source_filename = "tinymoonbit"
define i32 @fib(i32 %0) {
entry:
%1 = alloca i32, align 4
store i32 %0, ptr %1, align 4
%2 = load i32, ptr %1, align 4
%3 = icmp sle i32 %2, 1
br i1 %3, label %4, label %6
4: ; preds = %entry
%5 = load i32, ptr %1, align 4
ret i32 %5
6: ; preds = %4, %entry
%7 = load i32, ptr %1, align 4
%8 = sub i32 %7, 1
%9 = call i32 @fib(i32 %8)
%10 = load i32, ptr %1, align 4
%11 = sub i32 %10, 2
%12 = call i32 @fib(i32 %11)
%13 = add i32 %9, %12
ret i32 %13
}
define void @main() {
entry:
%0 = call i32 @fib(i32 10)
call void @print_int(i32 %0)
}
declare void @print_int(i32 %0)
Chapter 2: Lexical Analysisβ
Lexical Analysis is the first stage of the compilation process. Its core mission is to convert a continuous stream of characters into a sequence of meaningful tokens. This seemingly simple conversion process is, in fact, the cornerstone of the entire compiler pipeline.
From Characters to Symbols: Token Design and Implementationβ
Consider the following code snippet:
let Int
x : Int
Int = 5;
After being processed by the lexer, it will produce the following sequence of tokens:
(Keyword "let") β (Identifier "x") β (Symbol ":") β
(Type "Int") β (Operator "=") β (IntLiteral 5) β (Symbol ";")
This conversion process needs to handle various complex situations:
- Whitespace Filtering: Skipping spaces, tabs, and newlines.
- Keyword Recognition: Distinguishing reserved words from user-defined identifiers.
- Numeric Parsing: Correctly identifying the boundaries of integers and floating-point numbers.
- Operator Handling: Differentiating between single-character and multi-character operators.
Token Type System Designβ
Based on the TinyMoonbit syntax specification, we classify all possible symbols into the following token types:
pub enum Token {
(Bool) -> Token
Bool(Bool
Bool) // Boolean values: true, false
(Int) -> Token
Int(Int
Int) // Integers: 1, 2, 3, ...
(Double) -> Token
Double(Double
Double) // Floating-point numbers: 1.0, 2.5, 3.14, ...
(String) -> Token
Keyword(String
String) // Reserved words: let, if, while, fn, return
(String) -> Token
Upper(String
String) // Type identifiers: start with an uppercase letter, e.g., Int, Double, Bool
(String) -> Token
Lower(String
String) // Variable identifiers: start with a lowercase letter, e.g., x, y, result
(String) -> Token
Symbol(String
String) // Operators and punctuation: +, -, *, :, ;, ->
(Char) -> Token
Bracket(Char
Char) // Brackets: (, ), [, ], {, }
Token
EOF // End-of-file marker
} derive(trait Show {
output(Self, &Logger) -> Unit
to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Eq {
op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq)
Leveraging Pattern Matchingβ
Moonbit's powerful pattern matching capabilities allow us to implement the lexer in an unprecedentedly elegant way. Compared to the traditional finite state machine approach, this pattern-matching-based implementation is more intuitive and easier to understand.
Core Analysis Functionβ
pub fn (code : String) -> Array[Token]
lex(String
code: String
String) -> type Array[T]
An Array
is a collection of values that supports random access and can
grow in size.
Array[enum Token {
Bool(Bool)
Int(Int)
Double(Double)
Keyword(String)
Upper(String)
Lower(String)
Symbol(String)
Bracket(Char)
EOF
}
Token] {
let Array[Token]
tokens = type Array[T]
An Array
is a collection of values that supports random access and can
grow in size.
Array::(capacity~ : Int = ..) -> Array[Token]
Creates a new empty array with an optional initial capacity.
Parameters:
capacity
: The initial capacity of the array. If 0 (default), creates an
array with minimum capacity. Must be non-negative.
Returns a new empty array of type Array[T]
with the specified initial
capacity.
Example:
let arr : Array[Int] = Array::new(capacity=10)
inspect(arr.length(), content="0")
inspect(arr.capacity(), content="10")
let arr : Array[Int] = Array::new()
inspect(arr.length(), content="0")
new()
loop String
code[:] {
// Skip whitespace characters
@string.StringView
[' ' | '\n' | '\r' | '\t', ..rest] =>
continue @string.StringView
rest
// Handle single-line comments
Char
[.."//", ..rest] =>
continue loop @string.StringView
rest {
@string.StringView
['\n' | '\r', ..rest_str] => break @string.StringView
rest_str
@string.StringView
[_, ..rest_str] => continue @string.StringView
rest_str
@string.StringView
[] as rest_str => break @string.StringView
rest_str
}
// Recognize multi-character operators (order is important!)
Char
[.."->", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((String) -> Token
Symbol("->")); continue @string.StringView
rest }
Char
[.."==", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((String) -> Token
Symbol("==")); continue @string.StringView
rest }
Char
[.."!=", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((String) -> Token
Symbol("!=")); continue @string.StringView
rest }
Char
[.."<=", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((String) -> Token
Symbol("<=")); continue @string.StringView
rest }
Char
[..">=", ..rest] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((String) -> Token
Symbol(">=")); continue @string.StringView
rest }
// Recognize single-character operators and punctuation
[':' | '.' | ',' | ';' | '+' | '-' | '*' |
'/' | '%' | '>' | '<' | '=' as c, ..rest] => {
Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((String) -> Token
Symbol("\{Char
c}"))
continue @string.StringView
rest
}
// Recognize brackets
@string.StringView
[Char
'(' | ')' | '[' | ']' | '{' | '}' as c@string.StringView
, ..rest] => {
Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push((Char) -> Token
Bracket(Char
c))
continue @string.StringView
rest
}
// Recognize identifiers and literals
@string.StringView
['a'..='z', ..] as code => {
let (Token
tok, @string.StringView
rest) = (@string.StringView) -> (Token, @string.StringView)
lex_ident(@string.StringView
code);
Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push(Token
tok)
continue @string.StringView
rest
}
['A'..='Z', ..] => { ... }
['0'..='9', ..] => { ... }
// Reached the end of the file
[] => { Array[Token]
tokens.(self : Array[Token], value : Token) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push(Token
EOF); break Array[Token]
tokens }
}
}
Keyword Recognition Strategyβ
Identifier parsing requires special handling for keyword recognition:
pub fn (rest : @string.StringView) -> (Token, @string.StringView)
let_ident(@string.StringView
rest: type @string.StringView
A @string.View
represents a view of a String that maintains proper Unicode
character boundaries. It allows safe access to a substring while handling
multi-byte characters correctly.
@string.View) -> (enum Token {
Bool(Bool)
Int(Int)
Double(Double)
Keyword(String)
Upper(String)
Lower(String)
Symbol(String)
Bracket(Char)
EOF
}
Token, type @string.StringView
A @string.View
represents a view of a String that maintains proper Unicode
character boundaries. It allows safe access to a substring while handling
multi-byte characters correctly.
@string.View) {
// Predefined keyword map
let Unit
keyword_map = Unit
Map.(Array[(String, Token)]) -> Unit
from_array([
("let", Token::(String) -> Token
Keyword("let")),
("fn", Token::(String) -> Token
Keyword("fn")),
("if", Token::(String) -> Token
Keyword("if")),
("else", Token::(String) -> Token
Keyword("else")),
("while", Token::(String) -> Token
Keyword("while")),
("return", Token::(String) -> Token
Keyword("return")),
("extern", Token::(String) -> Token
Keyword("extern")),
("true", Token::(Bool) -> Token
Bool(true)),
("false", Token::(Bool) -> Token
Bool(false)),
])
let Array[Char]
identifier_chars = type Array[T]
An Array
is a collection of values that supports random access and can
grow in size.
Array::(capacity~ : Int = ..) -> Array[Char]
Creates a new empty array with an optional initial capacity.
Parameters:
capacity
: The initial capacity of the array. If 0 (default), creates an
array with minimum capacity. Must be non-negative.
Returns a new empty array of type Array[T]
with the specified initial
capacity.
Example:
let arr : Array[Int] = Array::new(capacity=10)
inspect(arr.length(), content="0")
inspect(arr.capacity(), content="10")
let arr : Array[Int] = Array::new()
inspect(arr.length(), content="0")
new()
let @string.StringView
remaining = loop @string.StringView
rest {
@string.StringView
[Char
'a'..='z' | 'A'..='Z' | '0'..='9' | '_' as c@string.StringView
, ..rest_str] => {
Array[Char]
identifier_chars.(self : Array[Char], value : Char) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push(Char
c)
continue @string.StringView
rest_str
}
@string.StringView
_ as rest_str => break @string.StringView
rest_str
}
let String
ident = (Array[Char]) -> String
String::(chars : Array[Char]) -> String
Convert char array to string.
let s = @string.from_array(['H', 'e', 'l', 'l', 'o'])
assert_eq(s, "Hello")
Do not convert large datas to Array[Char]
and build a string with String::from_array
.
For efficiency considerations, it's recommended to use Buffer
instead.
from_array(Array[Char]
identifier_chars)
let Token
token = Unit
keyword_map.(String) -> Unit
get(String
ident).(() -> Token) -> Token
or_else(() => Token::(String) -> Token
Lower(String
ident))
(Token
token, @string.StringView
remaining)
}
π‘ In-depth Analysis of Moonbit Syntax Featuresβ
The implementation of the lexer above fully demonstrates several outstanding advantages of Moonbit in compiler development:
-
Functional Loop Construct
loop initial_value { pattern1 => continue new_value1 pattern2 => continue new_value2 pattern3 => break final_value }
loop
is not a traditional loop structure but a functional loop:- It accepts an initial parameter as the loop state.
- It handles different cases through pattern matching.
continue
passes the new state to the next iteration.break
terminates the loop and returns the final value.
-
String Views and Pattern Matching
Moonbit's string pattern matching feature greatly simplifies text processing:
// Match a single character ['a', ..rest] => // Starts with the character 'a' // Match a character range ['a'..='z' as c, ..rest] => // A lowercase letter, bound to the variable c // Match a string literal [.."hello", ..rest] => // Equivalent to ['h','e','l','l','o', ..rest] // Match multiple possible characters [' ' | '\t' | '\n', ..rest] => // Any whitespace character
-
The Importance of Pattern Matching Priority
β οΈ Important Reminder: The order of matching is crucial.
When writing pattern matching rules, you must place more specific patterns before more general ones. For example:
// β Correct order loop code[:] { [.."->", ..rest] => { ... } // Match multi-character operators first ['-' | '>' as c, ..rest] => { ... } // Then match single characters } // β Incorrect order - "->" will never be matched loop code[:] { ['-' | '>' as c, ..rest] => { ... } [.."->", ..rest] => { ... } // This will never be executed }
By using this pattern-matching-based approach, we not only avoid complex state machine implementations but also achieve a clearer and more maintainable code structure.
Chapter 3: Parsing and Abstract Syntax Tree Constructionβ
Syntactic Analysis (or Parsing) is the second core stage of the compiler. Its task is to reorganize the sequence of tokens produced by lexical analysis into a hierarchical Abstract Syntax Tree (AST). This process not only verifies whether the program conforms to the language's grammatical rules but also provides a structured data representation for subsequent semantic analysis and code generation.
Abstract Syntax Tree Design: A Structured Representation of the Programβ
Before building the parser, we need to carefully design the structure of the AST. This design determines how the program's syntactic structure is represented and how subsequent compilation stages will process these structures.
1. Core Type Systemβ
First, we define the representation of the TinyMoonbit type system in the AST:
pub enum Type {
Type
Unit // Unit type, represents no return value
Type
Bool // Boolean type: true, false
Type
Int // 32-bit signed integer
Type
Double // 64-bit double-precision floating-point number
} derive(trait Show {
output(Self, &Logger) -> Unit
to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Eq {
op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq, trait ToJson {
to_json(Self) -> Json
}
Trait for types that can be converted to Json
ToJson)
pub fn (type_name : String) -> Type
parse_type(String
type_name: String
String) -> enum Type {
Unit
Bool
Int
Double
}
Type {
match String
type_name {
"Unit" => Type::Type
Unit
"Bool" => Type::Type
Bool
"Int" => Type::Type
Int
"Double" => Type::Type
Double
_ => (msg : String) -> Type
Aborts the program with an error message. Always causes a panic, regardless
of the message provided.
Parameters:
message
: A string containing the error message to be displayed when
aborting.
Returns a value of type T
. However, this function never actually returns a
value as it always causes a panic.
abort("Unknown type: \{String
type_name}")
}
}
2. Layered AST Node Designβ
We use a layered design to clearly represent the different abstraction levels of the program:
-
Atomic Expressions (AtomExpr) Represent the most basic, indivisible expression units:
pub enum AtomExpr {
Bool((Bool) -> AtomExpr
Bool) // Boolean literalBool
Int((Int) -> AtomExpr
Int) // Integer literalInt
Double((Double) -> AtomExpr
Double) // Floating-point literalDouble
Var((String, ty~ : Type?) -> AtomExpr
String, mutString
ty~ :Type?
Type?) // Variable referenceenum Type { Unit Bool Int Double }
Paren((Expr, ty~ : Type?) -> AtomExpr
Expr, mutenum Expr { AtomExpr(AtomExpr, ty~ : Type?) Unary(String, Expr, ty~ : Type?) Binary(String, Expr, Expr, ty~ : Type?) }
ty~ :Type?
Type?) // Parenthesized expressionenum Type { Unit Bool Int Double }
Call((String, Array[Expr], ty~ : Type?) -> AtomExpr
String,String
Array[type Array[T]
An
Array
is a collection of values that supports random access and can grow in size.Expr], mutenum Expr { AtomExpr(AtomExpr, ty~ : Type?) Unary(String, Expr, ty~ : Type?) Binary(String, Expr, Expr, ty~ : Type?) }
ty~ :Type?
Type?) // Function call } derive(enum Type { Unit Bool Int Double }
Show,trait Show { output(Self, &Logger) -> Unit to_string(Self) -> String }
Trait for types that can be converted to
String
Eq,trait Eq { op_equal(Self, Self) -> Bool }
Trait for types whose elements can test for equality
ToJson)trait ToJson { to_json(Self) -> Json }
Trait for types that can be converted to
Json
-
Compound Expressions (Expr) More complex structures that can contain operators and multiple sub-expressions:
pub enum Expr {
AtomExpr((AtomExpr, ty~ : Type?) -> Expr
AtomExpr, mutenum AtomExpr { Bool(Bool) Int(Int) Double(Double) Var(String, ty~ : Type?) Paren(Expr, ty~ : Type?) Call(String, Array[Expr], ty~ : Type?) }
ty~ :Type?
Type?) // Wrapper for atomic expressionsenum Type { Unit Bool Int Double }
Unary((String, Expr, ty~ : Type?) -> Expr
String,String
Expr, mutenum Expr { AtomExpr(AtomExpr, ty~ : Type?) Unary(String, Expr, ty~ : Type?) Binary(String, Expr, Expr, ty~ : Type?) }
ty~ :Type?
Type?) // Unary operation: -, !enum Type { Unit Bool Int Double }
Binary((String, Expr, Expr, ty~ : Type?) -> Expr
String,String
Expr,enum Expr { AtomExpr(AtomExpr, ty~ : Type?) Unary(String, Expr, ty~ : Type?) Binary(String, Expr, Expr, ty~ : Type?) }
Expr, mutenum Expr { AtomExpr(AtomExpr, ty~ : Type?) Unary(String, Expr, ty~ : Type?) Binary(String, Expr, Expr, ty~ : Type?) }
ty~ :Type?
Type?) // Binary operation: +, -, *, /, ==, !=, etc. } derive(enum Type { Unit Bool Int Double }
Show,trait Show { output(Self, &Logger) -> Unit to_string(Self) -> String }
Trait for types that can be converted to
String
Eq,trait Eq { op_equal(Self, Self) -> Bool }
Trait for types whose elements can test for equality
ToJson)trait ToJson { to_json(Self) -> Json }
Trait for types that can be converted to
Json
-
Statements (Stmt) Represent executable units in the program:
pub enum Stmt {
Let((String, Type, Expr) -> Stmt
String,String
Type,enum Type { Unit Bool Int Double }
Expr) // Variable declaration: let x : Int = 5;enum Expr { AtomExpr(AtomExpr, ty~ : Type?) Unary(String, Expr, ty~ : Type?) Binary(String, Expr, Expr, ty~ : Type?) }
Assign((String, Expr) -> Stmt
String,String
Expr) // Assignment statement: x = 10;enum Expr { AtomExpr(AtomExpr, ty~ : Type?) Unary(String, Expr, ty~ : Type?) Binary(String, Expr, Expr, ty~ : Type?) }
If((Expr, Array[Stmt], Array[Stmt]) -> Stmt
Expr,enum Expr { AtomExpr(AtomExpr, ty~ : Type?) Unary(String, Expr, ty~ : Type?) Binary(String, Expr, Expr, ty~ : Type?) }
Array[type Array[T]
An
Array
is a collection of values that supports random access and can grow in size.Stmt],enum Stmt { Let(String, Type, Expr) Assign(String, Expr) If(Expr, Array[Stmt], Array[Stmt]) While(Expr, Array[Stmt]) Return(Expr?) Expr(Expr) }
Array[type Array[T]
An
Array
is a collection of values that supports random access and can grow in size.Stmt]) // Conditional branch: if-elseenum Stmt { Let(String, Type, Expr) Assign(String, Expr) If(Expr, Array[Stmt], Array[Stmt]) While(Expr, Array[Stmt]) Return(Expr?) Expr(Expr) }
While((Expr, Array[Stmt]) -> Stmt
Expr,enum Expr { AtomExpr(AtomExpr, ty~ : Type?) Unary(String, Expr, ty~ : Type?) Binary(String, Expr, Expr, ty~ : Type?) }
Array[type Array[T]
An
Array
is a collection of values that supports random access and can grow in size.Stmt]) // Loop statement: whileenum Stmt { Let(String, Type, Expr) Assign(String, Expr) If(Expr, Array[Stmt], Array[Stmt]) While(Expr, Array[Stmt]) Return(Expr?) Expr(Expr) }
Return((Expr?) -> Stmt
Expr?) // Return statement: return expr;enum Expr { AtomExpr(AtomExpr, ty~ : Type?) Unary(String, Expr, ty~ : Type?) Binary(String, Expr, Expr, ty~ : Type?) }
Expr((Expr) -> Stmt
Expr) // Expression statement } derive(enum Expr { AtomExpr(AtomExpr, ty~ : Type?) Unary(String, Expr, ty~ : Type?) Binary(String, Expr, Expr, ty~ : Type?) }
Show,trait Show { output(Self, &Logger) -> Unit to_string(Self) -> String }
Trait for types that can be converted to
String
Eq,trait Eq { op_equal(Self, Self) -> Bool }
Trait for types whose elements can test for equality
ToJson)trait ToJson { to_json(Self) -> Json }
Trait for types that can be converted to
Json
-
Top-Level Structures Function definitions and the complete program:
pub struct Function {
name :String
String // Function nameString
params :Array[(String, Type)]
Array[(type Array[T]
An
Array
is a collection of values that supports random access and can grow in size.String,String
Type)] // Parameter list: [(param_name, type)]enum Type { Unit Bool Int Double }
ret_ty :Type
Type // Return typeenum Type { Unit Bool Int Double }
body :Array[Stmt]
Array[type Array[T]
An
Array
is a collection of values that supports random access and can grow in size.Stmt] // Sequence of statements in the function body } derive(enum Stmt { Let(String, Type, Expr) Assign(String, Expr) If(Expr, Array[Stmt], Array[Stmt]) While(Expr, Array[Stmt]) Return(Expr?) Expr(Expr) }
Show,trait Show { output(Self, &Logger) -> Unit to_string(Self) -> String }
Trait for types that can be converted to
String
Eq,trait Eq { op_equal(Self, Self) -> Bool }
Trait for types whose elements can test for equality
ToJson) // The program is defined as a map from function names to function definitions pub type Programtrait ToJson { to_json(Self) -> Json }
Trait for types that can be converted to
Json
Map[type Map[K, V]
Mutable linked hash map that maintains the order of insertion, not thread safe.
Example
let map = { 3: "three", 8 : "eight", 1 : "one"} assert_eq(map.get(2), None) assert_eq(map.get(3), Some("three")) map.set(3, "updated") assert_eq(map.get(3), Some("updated"))
String,String
Function]struct Function { name: String params: Array[(String, Type)] ret_ty: Type body: Array[Stmt] }
Design Highlight: Mutability of Type Annotations
Notice that each expression node contains a
mut ty~ : Type?
field. This design allows us to fill in type information during the type-checking phase without having to rebuild the entire AST.
Recursive Descent Parsing: A Top-Down Construction Strategyβ
Recursive Descent is a top-down parsing method where the core idea is to write a corresponding parsing function for each grammar rule. In Moonbit, pattern matching makes the implementation of this method exceptionally elegant.
Parsing Atomic Expressionsβ
pub fn (tokens : ArrayView[Token]) -> (AtomExpr, ArrayView[Token]) raise Error
parse_atom_expr(
ArrayView[Token]
tokens: type ArrayView[T]
A @array.View
represents a view into a section of an array without copying the data.
Example
let arr = [1, 2, 3, 4, 5]
let view = arr[1:4] // Creates a view of elements at indices 1,2,3
inspect(view[0], content="2")
inspect(view.length(), content="3")
ArrayView[enum Token {
Bool(Bool)
Int(Int)
Double(Double)
Keyword(String)
Upper(String)
Lower(String)
Symbol(String)
Bracket(Char)
EOF
}
Token]
) -> (enum AtomExpr {
Bool(Bool)
Int(Int)
Double(Double)
Var(String, ty~ : Type?)
Paren(Expr, ty~ : Type?)
Call(String, Array[Expr], ty~ : Type?)
}
AtomExpr, type ArrayView[T]
A @array.View
represents a view into a section of an array without copying the data.
Example
let arr = [1, 2, 3, 4, 5]
let view = arr[1:4] // Creates a view of elements at indices 1,2,3
inspect(view[0], content="2")
inspect(view.length(), content="3")
ArrayView[enum Token {
Bool(Bool)
Int(Int)
Double(Double)
Keyword(String)
Upper(String)
Lower(String)
Symbol(String)
Bracket(Char)
EOF
}
Token]) raise {
match ArrayView[Token]
tokens {
// Parse literals
ArrayView[Token]
[(Bool) -> Token
BoolArrayView[Token]
(Bool
bArrayView[Token]
), ..rest] => (AtomExpr::(Bool) -> AtomExpr
Bool(Bool
b), ArrayView[Token]
rest)
ArrayView[Token]
[(Int) -> Token
IntArrayView[Token]
(Int
iArrayView[Token]
), ..rest] => (AtomExpr::(Int) -> AtomExpr
Int(Int
i), ArrayView[Token]
rest)
ArrayView[Token]
[(Double) -> Token
DoubleArrayView[Token]
(Double
dArrayView[Token]
), ..rest] => (AtomExpr::(Double) -> AtomExpr
Double(Double
d), ArrayView[Token]
rest)
// Parse function calls: func_name(arg1, arg2, ...)
ArrayView[Token]
[(String) -> Token
LowerArrayView[Token]
(String
func_nameArrayView[Token]
), (Char) -> Token
BracketArrayView[Token]
('('), ..rest] => {
let (Array[Expr]
args, Unit
rest) = (ArrayView[Token]) -> (Array[Expr], Unit)
parse_argument_list(ArrayView[Token]
rest)
match Unit
rest {
Unit
[(Char) -> _/0
BracketUnit
(')'), ..remaining] =>
(AtomExpr::(String, Array[Expr], ty~ : Type?) -> AtomExpr
Call(String
func_name, Array[Expr]
args, Type?
ty=Type?
None), ArrayView[Token]
remaining)
_ => raise Error
SyntaxError("Expected ')' after function arguments")
}
}
// Parse variable references
ArrayView[Token]
[(String) -> Token
LowerArrayView[Token]
(String
var_nameArrayView[Token]
), ..rest] =>
(AtomExpr::(String, ty~ : Type?) -> AtomExpr
Var(String
var_name, Type?
ty=Type?
None), ArrayView[Token]
rest)
// Parse parenthesized expressions: (expression)
ArrayView[Token]
[(Char) -> Token
BracketArrayView[Token]
('('), ..rest] => {
let (Expr
expr, ArrayView[Token]
rest) = (tokens : ArrayView[Token]) -> (Expr, ArrayView[Token]) raise Error
parse_expression(ArrayView[Token]
rest)
match ArrayView[Token]
rest {
ArrayView[Token]
[(Char) -> Token
BracketArrayView[Token]
(')'), ..remaining] =>
(AtomExpr::(Expr, ty~ : Type?) -> AtomExpr
Paren(Expr
expr, Type?
ty=Type?
None), ArrayView[Token]
remaining)
_ => raise Error
SyntaxError("Expected ')' after expression")
}
}
_ => raise Error
SyntaxError("Expected atomic expression")
}
}
Parsing Statementsβ
Statement parsing needs to dispatch to different handler functions based on the starting keyword:
pub fn (tokens : ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_stmt(ArrayView[Token]
tokens : type ArrayView[T]
A @array.View
represents a view into a section of an array without copying the data.
Example
let arr = [1, 2, 3, 4, 5]
let view = arr[1:4] // Creates a view of elements at indices 1,2,3
inspect(view[0], content="2")
inspect(view.length(), content="3")
ArrayView[enum Token {
Bool(Bool)
Int(Int)
Double(Double)
Keyword(String)
Upper(String)
Lower(String)
Symbol(String)
Bracket(Char)
EOF
}
Token]) -> (enum Stmt {
Let(String, Type, Expr)
Assign(String, Expr)
If(Expr, Array[Stmt], Array[Stmt])
While(Expr, Array[Stmt])
Return(Expr?)
Expr(Expr)
}
Stmt, type ArrayView[T]
A @array.View
represents a view into a section of an array without copying the data.
Example
let arr = [1, 2, 3, 4, 5]
let view = arr[1:4] // Creates a view of elements at indices 1,2,3
inspect(view[0], content="2")
inspect(view.length(), content="3")
ArrayView[enum Token {
Bool(Bool)
Int(Int)
Double(Double)
Keyword(String)
Upper(String)
Lower(String)
Symbol(String)
Bracket(Char)
EOF
}
Token]) {
match ArrayView[Token]
tokens {
// Parse let statements
[(String) -> Token
Keyword("let"), (String) -> Token
Lower(String
var_name), (String) -> Token
Symbol(":"), ..] => { /* ... */ }
// Parse if/while/return statements
ArrayView[Token]
[(String) -> Token
KeywordArrayView[Token]
("if"), .. rest] => (ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_if_stmt(ArrayView[Token]
rest)
ArrayView[Token]
[(String) -> Token
KeywordArrayView[Token]
("while"), .. rest] => (ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_while_stmt(ArrayView[Token]
rest)
ArrayView[Token]
[(String) -> Token
KeywordArrayView[Token]
("return"), .. rest] => { /* ... */ }
// Parse assignment statements
ArrayView[Token]
[Token
LowerArrayView[Token]
(_), (String) -> Token
SymbolArrayView[Token]
("="), .. rest] => (ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_assign_stmt(ArrayView[Token]
tokens)
// Parse single expression statements
ArrayView[Token]
[Token
LowerArrayView[Token]
(_), (String) -> Token
SymbolArrayView[Token]
("="), .. rest] => (ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_single_expr_stmt(ArrayView[Token]
tokens)
_ => { /* Error handling */ }
}
}
Challenge: Handling Operator Precedence:
The most complex part of expression parsing is handling operator precedence. We need to ensure that
1 + 2 * 3
is correctly parsed as1 + (2 * 3)
and not(1 + 2) * 3
.
π‘ Application of Advanced Moonbit Featuresβ
Automatic Derivation Featureβ
pub enum Expr {
// ...
} derive(Show, Eq, ToJson)
Moonbit's derive
feature automatically generates common implementations for types. Here we use three:
Show
: Provides debugging output functionality.Eq
: Supports equality comparison.ToJson
: Serializes to JSON format, which is convenient for debugging and persistence.
These automatically generated features are extremely useful in compiler development, especially during the debugging and testing phases.
Error Handling Mechanismβ
pub fn (tokens : ArrayView[Token]) -> (Expr, ArrayView[Token]) raise Error
parse_expression(ArrayView[Token]
tokens: type ArrayView[T]
A @array.View
represents a view into a section of an array without copying the data.
Example
let arr = [1, 2, 3, 4, 5]
let view = arr[1:4] // Creates a view of elements at indices 1,2,3
inspect(view[0], content="2")
inspect(view.length(), content="3")
ArrayView[enum Token {
Bool(Bool)
Int(Int)
Double(Double)
Keyword(String)
Upper(String)
Lower(String)
Symbol(String)
Bracket(Char)
EOF
}
Token]) -> (enum Expr {
AtomExpr(AtomExpr, ty~ : Type?)
Unary(String, Expr, ty~ : Type?)
Binary(String, Expr, Expr, ty~ : Type?)
}
Expr, type ArrayView[T]
A @array.View
represents a view into a section of an array without copying the data.
Example
let arr = [1, 2, 3, 4, 5]
let view = arr[1:4] // Creates a view of elements at indices 1,2,3
inspect(view[0], content="2")
inspect(view.length(), content="3")
ArrayView[enum Token {
Bool(Bool)
Int(Int)
Double(Double)
Keyword(String)
Upper(String)
Lower(String)
Symbol(String)
Bracket(Char)
EOF
}
Token]) raise {
// The 'raise' keyword indicates that this function may throw an exception
}
Moonbit's raise
mechanism provides structured error handling, allowing syntax errors to be accurately located and reported.
Through this layered design and recursive descent parsing strategy, we have built a parser that is both flexible and efficient, laying a solid foundation for the subsequent type-checking phase.
Chapter 4: Type Checking and Semantic Analysisβ
Semantic Analysis is a crucial intermediate stage in compiler design. While parsing ensures the program's structure is correct, it doesn't mean the program is semantically valid. Type Checking, as the core component of semantic analysis, is responsible for verifying the type consistency of all operations in the program, ensuring type safety and runtime correctness.
Scope Management: Building the Environment Chainβ
The primary challenge in type checking is correctly handling variable scopes. At different levels of the program (global, function, block), the same variable name may refer to different entities. We adopt the classic design of an Environment Chain to solve this problem:
pub struct TypeEnv[K, V] {
TypeEnv[K, V]?
parent : struct TypeEnv[K, V] {
parent: TypeEnv[K, V]?
data: Map[K, V]
}
TypeEnv[type parameter K
K, type parameter V
V]? // Reference to the parent environment
Map[K, V]
data : type Map[K, V]
Mutable linked hash map that maintains the order of insertion, not thread safe.
Example
let map = { 3: "three", 8 : "eight", 1 : "one"}
assert_eq(map.get(2), None)
assert_eq(map.get(3), Some("three"))
map.set(3, "updated")
assert_eq(map.get(3), Some("updated"))
Map[type parameter K
K, type parameter V
V] // Variable bindings in the current environment
}
The core of the environment chain is the variable lookup algorithm, which follows the rules of lexical scoping:
pub fn struct TypeEnv[K, V] {
parent: TypeEnv[K, V]?
data: Map[K, V]
}
TypeEnv::(self : TypeEnv[K, V], key : K) -> V?
get[K : trait Eq {
op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq + trait Hash {
hash_combine(Self, Hasher) -> Unit
hash(Self) -> Int
}
Trait for types that can be hashed
Hash, V](TypeEnv[K, V]
self : struct TypeEnv[K, V] {
parent: TypeEnv[K, V]?
data: Map[K, V]
}
Self[type parameter K
K, type parameter V
V], K
key : type parameter K
K) -> type parameter V
V? {
match TypeEnv[K, V]
self.Map[K, V]
data.(self : Map[K, V], key : K) -> V?
Get the value associated with a key.
get(K
key) {
(V) -> V?
Some(V
value) => (V) -> V?
Some(V
value) // Found in the current environment
V?
None =>
match TypeEnv[K, V]
self.TypeEnv[K, V]?
parent {
(TypeEnv[K, V]) -> TypeEnv[K, V]?
Some(TypeEnv[K, V]
parent_env) => TypeEnv[K, V]
parent_env.(self : TypeEnv[K, V], key : K) -> V?
get(K
key) // Recursively search the parent environment
TypeEnv[K, V]?
None => V?
None // Reached the top-level environment, variable not defined
}
}
}
Design Principle: Lexical Scoping
This design ensures that variable lookup follows lexical scoping rules:
- First, search in the current scope.
- If not found, recursively search in the parent scope.
- Continue until the variable is found or the global scope is reached.
Type Checker Architectureβ
Environment management alone is not sufficient to complete the type-checking task. Some operations (like function calls) need to access global program information. Therefore, we design a comprehensive type checker:
pub struct TypeChecker {
TypeEnv[String, Type]
local_env : struct TypeEnv[K, V] {
parent: TypeEnv[K, V]?
data: Map[K, V]
}
TypeEnv[String
String, enum Type {
Unit
Bool
Int
Double
}
Type] // Local variable environment
Function
current_func : struct Function {
name: String
params: Array[(String, Type)]
ret_ty: Type
body: Array[Stmt]
}
Function // The function currently being checked
Program
program : type Program Map[String, Function]
Program // Complete program information
}
Implementation of Partial Node Type Checkingβ
The core of the type checker is to apply the corresponding type rules to different AST nodes. The following is the implementation of expression type checking:
pub fn enum Expr {
AtomExpr(AtomExpr, ty~ : Type?)
Unary(String, Expr, ty~ : Type?)
Binary(String, Expr, Expr, ty~ : Type?)
}
Expr::(self : Expr, env : TypeEnv[String, Type]) -> Type raise Error
check_type(
Expr
self : enum Expr {
AtomExpr(AtomExpr, ty~ : Type?)
Unary(String, Expr, ty~ : Type?)
Binary(String, Expr, Expr, ty~ : Type?)
}
Self,
TypeEnv[String, Type]
env : struct TypeEnv[K, V] {
parent: TypeEnv[K, V]?
data: Map[K, V]
}
TypeEnv[String
String, enum Type {
Unit
Bool
Int
Double
}
Type]
) -> enum Type {
Unit
Bool
Int
Double
}
Type raise {
match Expr
self {
// Type checking for atomic expressions
(AtomExpr) -> Expr
AtomExprExpr
(AtomExpr
atom_exprExpr
, ..) as node => {
let Type
ty = AtomExpr
atom_expr.(TypeEnv[String, Type]) -> Type
check_type(TypeEnv[String, Type]
env)
Expr
nodeUnit
.ty = (Type) -> Type?
SomeUnit
(Type
tyUnit
) // Fill in the type information
Type
ty
}
// Type checking for unary operations
(String, Expr) -> Expr
UnaryExpr
("-", Expr
exprExpr
, ..) as node => {
let Type
ty = Expr
expr.(self : Expr, env : TypeEnv[String, Type]) -> Type raise Error
check_type(TypeEnv[String, Type]
env)
Expr
nodeUnit
.ty = (Type) -> Type?
SomeUnit
(Type
tyUnit
)
Type
ty
}
// Type checking for binary operations
(String, Expr, Expr) -> Expr
BinaryExpr
("+", Expr
lhsExpr
, Expr
rhsExpr
, ..) as node => {
let Type
lhs_type = Expr
lhs.(self : Expr, env : TypeEnv[String, Type]) -> Type raise Error
check_type(TypeEnv[String, Type]
env)
let Type
rhs_type = Expr
rhs.(self : Expr, env : TypeEnv[String, Type]) -> Type raise Error
check_type(TypeEnv[String, Type]
env)
// Ensure operand types are consistent
guard Type
lhs_type (Type, Type) -> Bool
automatically derived
== Type
rhs_type else {
raise Error
TypeCheckError(
"Binary operation requires matching types, got \{Type
lhs_type} and \{Type
rhs_type}"
)
}
let Type
result_type = match String
op {
// Comparison operators always return a boolean value
"==" | "!=" | "<" | "<=" | ">" | ">=" => Type::Type
Bool
// Arithmetic operators, etc., maintain the operand type
_ => Type
lhs_type
}
Expr
nodeUnit
.ty = (Type) -> Type?
SomeUnit
(Type
result_typeUnit
)
Type
result_type
}
}
}
π‘ Moonbit Enum Modification Trick
During the type-checking process, we need to fill in type information for the AST nodes. Moonbit provides an elegant way to modify the mutable fields of enum variants:
pub enum Expr {
AtomExpr(AtomExpr, mut ty~ : Type?)
Unary(String, Expr, mut ty~ : Type?)
Binary(String, Expr, Expr, mut ty~ : Type?)
} derive(Show, Eq, ToJson)
By using the as
binding in pattern matching, we can get a reference to the enum variant and modify its mutable fields:
match expr {
AtomExpr(atom_expr, ..) as node => {
let ?
ty = Unit
atom_expr.(Unit) -> ?
check_type(Unit
env)
node.ty = Some(ty) // Modify the mutable field
ty
}
// ...
}
This design avoids the overhead of rebuilding the entire AST while maintaining a functional programming style.
Complete Compilation Flow Demonstrationβ
After the three stages of lexical analysis, parsing, and type checking, our compiler frontend is now able to convert source code into a fully typed abstract syntax tree. Let's demonstrate the complete process with a simple example:
Source Code Exampleβ
fn (x : Int, y : Int) -> Int
add(Int
x: Int
Int, Int
y: Int
Int) -> Int
Int {
return Int
x (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:
self
: The first integer operand.
other
: The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
inspect(42 + 1, content="43")
inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ Int
y;
}
Compilation Output: Typed ASTβ
Using the derive(ToJson)
feature, we can output the final AST in JSON format for inspection:
{
"functions": {
"add": {
"name": "add",
"params": [
["x", { "$tag": "Int" }],
["y", { "$tag": "Int" }]
],
"ret_ty": { "$tag": "Int" },
"body": [
{
"$tag": "Return",
"0": {
"$tag": "Binary",
"0": "+",
"1": {
"$tag": "AtomExpr",
"0": {
"$tag": "Var",
"0": "x",
"ty": { "$tag": "Int" }
},
"ty": { "$tag": "Int" }
},
"2": {
"$tag": "AtomExpr",
"0": {
"$tag": "Var",
"0": "y",
"ty": { "$tag": "Int" }
},
"ty": { "$tag": "Int" }
},
"ty": { "$tag": "Int" }
}
}
]
}
}
}
From this JSON output, we can clearly see:
- Complete Function Signature: Including the parameter list and return type.
- Type-Annotated AST Nodes: Each expression carries type information.
- Structured Program Representation: Provides a clear data structure for the subsequent code generation phase.
Conclusionβ
In this article, we have delved into the complete implementation process of a compiler frontend. From a stream of characters to a typed abstract syntax tree, we have witnessed the unique advantages of the Moonbit language in compiler construction:
Core Takeawaysβ
- The Power of Pattern Matching: Moonbit's string pattern matching and structural pattern matching greatly simplify the implementation of lexical analysis and parsing.
- Functional Programming Paradigm: The combination of the
loop
construct, environment chains, and immutable data structures provides a solution that is both elegant and efficient. - Expressive Type System: Through mutable fields in enums and trait objects, we can build data structures that are both type-safe and flexible.
- Engineering Features: Features like
derive
, structured error handling, and JSON serialization significantly improve development efficiency.
Looking Ahead to Part 2β
Having mastered the implementation of the frontend, the next article will guide us into the more exciting code generation phase. We will:
- Delve into the design philosophy of LLVM Intermediate Representation.
- Explore how to use Moonbit's official
llvm.mbt
binding library. - Implement the complete conversion from AST to LLVM IR.
- Generate executable RISC-V assembly code.
Building a compiler is a complex and challenging process, but as we have shown in this article, Moonbit provides powerful and elegant tools for this task. Let's continue this exciting compiler construction journey in the next part.
Recommended Resources