Skip to main content

Dancing with LLVM: A Moonbit Chronicle (Part 1) - Implementing the Frontend

Β· 16 min read


Introduction​

Programming language design and compiler implementation have long been considered among the most challenging topics in computer science. The traditional path to learning compilers often requires students to first master a complex set of theoretical foundations:

  • Automata Theory: Finite state machines and regular expressions
  • Type Theory: The mathematical underpinnings of Ξ»-calculus and type systems
  • Computer Architecture: Low-level implementation from assembly language to machine code

However, Moonbit, a functional programming language designed for the modern development landscape, offers a fresh perspective. It not only features a rigorous type system and exceptional memory safety guarantees but, more importantly, its rich syntax and toolchain tailored for the AI era make it an ideal choice for learning and implementing compilers.

Series Overview This series of articles will delve into the core concepts and best practices of modern compiler implementation by building a small programming language compiler called TinyMoonbit.

  • Part 1: Focuses on the implementation of the language frontend, including lexical analysis, parsing, and type checking, ultimately generating an abstract syntax tree with complete type annotations.
  • Part 2: Dives into the code generation phase, utilizing Moonbit's official llvm.mbt binding library to convert the abstract syntax tree into LLVM intermediate representation and finally generate RISC-V assembly code.

TinyMoonbit Language Design​

TinyMoonbit is a systems-level programming language with an abstraction level comparable to C. Although its syntax heavily borrows from Moonbit, TinyMoonbit is not a subset of the Moonbit language. Instead, it is a simplified version designed to test the feature completeness of llvm.mbt while also serving an educational purpose.

Note: Due to space constraints, the TinyMoonbit implementation discussed in this series is simpler than the actual TinyMoonbit. For the complete version, please refer to TinyMoonbitLLVM.

Core Features​

TinyMoonbit provides the fundamental features required for modern systems programming:

  • βœ… Low-level Memory Operations: Direct pointer manipulation and memory management
  • βœ… Control Flow Structures: Conditional branches, loops, and function calls
  • βœ… Type Safety: Static type checking and explicit type declarations
  • ❌ Simplified Design: To reduce implementation complexity, advanced features like type inference and closures are not supported.

Syntax Example​

Let's demonstrate TinyMoonbit's syntax with a classic implementation of the Fibonacci sequence:

extern fn 
(x : Int) -> Unit
print_int
(
Int
x
:
Int
Int
) ->
Unit
Unit
;
// Recursive implementation of the Fibonacci sequence fn
(n : Int) -> Int
fib
(
Int
n
:
Int
Int
) ->
Int
Int
{
if
Int
n
(self_ : Int, other : Int) -> Bool
<=
1 {
return
Int
n
;
} return
(n : Int) -> Int
fib
(
Int
n
(self : Int, other : Int) -> Int

Performs subtraction between two 32-bit integers, following standard two's complement arithmetic rules. When the result overflows or underflows, it wraps around within the 32-bit integer range.

Parameters:

  • self : The minuend (the number being subtracted from).
  • other : The subtrahend (the number to subtract).

Returns the difference between self and other.

Example:

  let a = 42
  let b = 10
  inspect(a - b, content="32")
  let max = 2147483647 // Int maximum value
  inspect(max - -1, content="-2147483648") // Overflow case
-
1)
(self : Int, other : Int) -> Int

Adds two 32-bit signed integers. Performs two's complement arithmetic, which means the operation will wrap around if the result exceeds the range of a 32-bit integer.

Parameters:

  • self : The first integer operand.
  • other : The second integer operand.

Returns a new integer that is the sum of the two operands. If the mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to 2,147,483,647), the result wraps around according to two's complement rules.

Example:

  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+
(n : Int) -> Int
fib
(
Int
n
(self : Int, other : Int) -> Int

Performs subtraction between two 32-bit integers, following standard two's complement arithmetic rules. When the result overflows or underflows, it wraps around within the 32-bit integer range.

Parameters:

  • self : The minuend (the number being subtracted from).
  • other : The subtrahend (the number to subtract).

Returns the difference between self and other.

Example:

  let a = 42
  let b = 10
  inspect(a - b, content="32")
  let max = 2147483647 // Int maximum value
  inspect(max - -1, content="-2147483648") // Overflow case
-
2);
} fn main {
(x : Int) -> Unit
print_int
(
(n : Int) -> Int
fib
(10));
}

Compilation Target​

After the complete compilation process, the above code will generate the following LLVM Intermediate Representation:

; ModuleID = 'tinymoonbit'
source_filename = "tinymoonbit"

define i32 @fib(i32 %0) {
entry:
  %1 = alloca i32, align 4
  store i32 %0, ptr %1, align 4
  %2 = load i32, ptr %1, align 4
  %3 = icmp sle i32 %2, 1
  br i1 %3, label %4, label %6

4:                                                ; preds = %entry
  %5 = load i32, ptr %1, align 4
  ret i32 %5

6:                                                ; preds = %4, %entry
  %7 = load i32, ptr %1, align 4
  %8 = sub i32 %7, 1
  %9 = call i32 @fib(i32 %8)
  %10 = load i32, ptr %1, align 4
  %11 = sub i32 %10, 2
  %12 = call i32 @fib(i32 %11)
  %13 = add i32 %9, %12
  ret i32 %13
}

define void @main() {
entry:
  %0 = call i32 @fib(i32 10)
  call void @print_int(i32 %0)
}

declare void @print_int(i32 %0)

Chapter 2: Lexical Analysis​

Lexical Analysis is the first stage of the compilation process. Its core mission is to convert a continuous stream of characters into a sequence of meaningful tokens. This seemingly simple conversion process is, in fact, the cornerstone of the entire compiler pipeline.

From Characters to Symbols: Token Design and Implementation​

Consider the following code snippet:

let 
Int
x
:
Int
Int
= 5;

After being processed by the lexer, it will produce the following sequence of tokens:

(Keyword "let") β†’ (Identifier "x") β†’ (Symbol ":") β†’
(Type "Int") β†’ (Operator "=") β†’ (IntLiteral 5) β†’ (Symbol ";")

This conversion process needs to handle various complex situations:

  1. Whitespace Filtering: Skipping spaces, tabs, and newlines.
  2. Keyword Recognition: Distinguishing reserved words from user-defined identifiers.
  3. Numeric Parsing: Correctly identifying the boundaries of integers and floating-point numbers.
  4. Operator Handling: Differentiating between single-character and multi-character operators.

Token Type System Design​

Based on the TinyMoonbit syntax specification, we classify all possible symbols into the following token types:

pub enum Token {
  
(Bool) -> Token
Bool
(
Bool
Bool
) // Boolean values: true, false
(Int) -> Token
Int
(
Int
Int
) // Integers: 1, 2, 3, ...
(Double) -> Token
Double
(
Double
Double
) // Floating-point numbers: 1.0, 2.5, 3.14, ...
(String) -> Token
Keyword
(
String
String
) // Reserved words: let, if, while, fn, return
(String) -> Token
Upper
(
String
String
) // Type identifiers: start with an uppercase letter, e.g., Int, Double, Bool
(String) -> Token
Lower
(
String
String
) // Variable identifiers: start with a lowercase letter, e.g., x, y, result
(String) -> Token
Symbol
(
String
String
) // Operators and punctuation: +, -, *, :, ;, ->
(Char) -> Token
Bracket
(
Char
Char
) // Brackets: (, ), [, ], {, }
Token
EOF
// End-of-file marker
} derive(
trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}

Trait for types that can be converted to String

Show
,
trait Eq {
  op_equal(Self, Self) -> Bool
}

Trait for types whose elements can test for equality

Eq
)

Leveraging Pattern Matching​

Moonbit's powerful pattern matching capabilities allow us to implement the lexer in an unprecedentedly elegant way. Compared to the traditional finite state machine approach, this pattern-matching-based implementation is more intuitive and easier to understand.

Core Analysis Function​

pub fn 
(code : String) -> Array[Token]
lex
(
String
code
:
String
String
) ->
type Array[T]

An Array is a collection of values that supports random access and can grow in size.

Array
[
enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
}
Token
] {
let
Array[Token]
tokens
=
type Array[T]

An Array is a collection of values that supports random access and can grow in size.

Array
::
(capacity~ : Int = ..) -> Array[Token]

Creates a new empty array with an optional initial capacity.

Parameters:

  • capacity : The initial capacity of the array. If 0 (default), creates an array with minimum capacity. Must be non-negative.

Returns a new empty array of type Array[T] with the specified initial capacity.

Example:

  let arr : Array[Int] = Array::new(capacity=10)
  inspect(arr.length(), content="0")
  inspect(arr.capacity(), content="10")

  let arr : Array[Int] = Array::new()
  inspect(arr.length(), content="0")
new
()
loop
String
code
[:] {
// Skip whitespace characters
@string.StringView
[' ' | '\n' | '\r' | '\t', ..rest]
=>
continue
@string.StringView
rest
// Handle single-line comments
Char
[.."//", ..rest]
=>
continue loop
@string.StringView
rest
{
@string.StringView
['\n' | '\r', ..rest_str]
=> break
@string.StringView
rest_str
@string.StringView
[_, ..rest_str]
=> continue
@string.StringView
rest_str
@string.StringView
[] as rest_str
=> break
@string.StringView
rest_str
} // Recognize multi-character operators (order is important!)
Char
[.."->", ..rest]
=> {
Array[Token]
tokens
.
(self : Array[Token], value : Token) -> Unit

Adds an element to the end of the array.

If the array is at capacity, it will be reallocated.

Example

  let v = []
  v.push(3)
push
(
(String) -> Token
Symbol
("->")); continue
@string.StringView
rest
}
Char
[.."==", ..rest]
=> {
Array[Token]
tokens
.
(self : Array[Token], value : Token) -> Unit

Adds an element to the end of the array.

If the array is at capacity, it will be reallocated.

Example

  let v = []
  v.push(3)
push
(
(String) -> Token
Symbol
("==")); continue
@string.StringView
rest
}
Char
[.."!=", ..rest]
=> {
Array[Token]
tokens
.
(self : Array[Token], value : Token) -> Unit

Adds an element to the end of the array.

If the array is at capacity, it will be reallocated.

Example

  let v = []
  v.push(3)
push
(
(String) -> Token
Symbol
("!=")); continue
@string.StringView
rest
}
Char
[.."<=", ..rest]
=> {
Array[Token]
tokens
.
(self : Array[Token], value : Token) -> Unit

Adds an element to the end of the array.

If the array is at capacity, it will be reallocated.

Example

  let v = []
  v.push(3)
push
(
(String) -> Token
Symbol
("<=")); continue
@string.StringView
rest
}
Char
[..">=", ..rest]
=> {
Array[Token]
tokens
.
(self : Array[Token], value : Token) -> Unit

Adds an element to the end of the array.

If the array is at capacity, it will be reallocated.

Example

  let v = []
  v.push(3)
push
(
(String) -> Token
Symbol
(">=")); continue
@string.StringView
rest
}
// Recognize single-character operators and punctuation [':' | '.' | ',' | ';' | '+' | '-' | '*' | '/' | '%' | '>' | '<' | '=' as c, ..rest] => {
Array[Token]
tokens
.
(self : Array[Token], value : Token) -> Unit

Adds an element to the end of the array.

If the array is at capacity, it will be reallocated.

Example

  let v = []
  v.push(3)
push
(
(String) -> Token
Symbol
("\{
Char
c
}"))
continue
@string.StringView
rest
} // Recognize brackets
@string.StringView
[
Char
'(' | ')' | '[' | ']' | '{' | '}' as c
@string.StringView
, ..rest]
=> {
Array[Token]
tokens
.
(self : Array[Token], value : Token) -> Unit

Adds an element to the end of the array.

If the array is at capacity, it will be reallocated.

Example

  let v = []
  v.push(3)
push
(
(Char) -> Token
Bracket
(
Char
c
))
continue
@string.StringView
rest
} // Recognize identifiers and literals
@string.StringView
['a'..='z', ..] as code
=> {
let (
Token
tok
,
@string.StringView
rest
) =
(@string.StringView) -> (Token, @string.StringView)
lex_ident
(
@string.StringView
code
);
Array[Token]
tokens
.
(self : Array[Token], value : Token) -> Unit

Adds an element to the end of the array.

If the array is at capacity, it will be reallocated.

Example

  let v = []
  v.push(3)
push
(
Token
tok
)
continue
@string.StringView
rest
} ['A'..='Z', ..] => { ... } ['0'..='9', ..] => { ... } // Reached the end of the file [] => {
Array[Token]
tokens
.
(self : Array[Token], value : Token) -> Unit

Adds an element to the end of the array.

If the array is at capacity, it will be reallocated.

Example

  let v = []
  v.push(3)
push
(
Token
EOF
); break
Array[Token]
tokens
}
} }

Keyword Recognition Strategy​

Identifier parsing requires special handling for keyword recognition:

pub fn 
(rest : @string.StringView) -> (Token, @string.StringView)
let_ident
(
@string.StringView
rest
:
type @string.StringView

A @string.View represents a view of a String that maintains proper Unicode character boundaries. It allows safe access to a substring while handling multi-byte characters correctly.

@string.View
) -> (
enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
}
Token
,
type @string.StringView

A @string.View represents a view of a String that maintains proper Unicode character boundaries. It allows safe access to a substring while handling multi-byte characters correctly.

@string.View
) {
// Predefined keyword map let
Unit
keyword_map
=
Unit
Map
.
(Array[(String, Token)]) -> Unit
from_array
([
("let", Token::
(String) -> Token
Keyword
("let")),
("fn", Token::
(String) -> Token
Keyword
("fn")),
("if", Token::
(String) -> Token
Keyword
("if")),
("else", Token::
(String) -> Token
Keyword
("else")),
("while", Token::
(String) -> Token
Keyword
("while")),
("return", Token::
(String) -> Token
Keyword
("return")),
("extern", Token::
(String) -> Token
Keyword
("extern")),
("true", Token::
(Bool) -> Token
Bool
(true)),
("false", Token::
(Bool) -> Token
Bool
(false)),
]) let
Array[Char]
identifier_chars
=
type Array[T]

An Array is a collection of values that supports random access and can grow in size.

Array
::
(capacity~ : Int = ..) -> Array[Char]

Creates a new empty array with an optional initial capacity.

Parameters:

  • capacity : The initial capacity of the array. If 0 (default), creates an array with minimum capacity. Must be non-negative.

Returns a new empty array of type Array[T] with the specified initial capacity.

Example:

  let arr : Array[Int] = Array::new(capacity=10)
  inspect(arr.length(), content="0")
  inspect(arr.capacity(), content="10")

  let arr : Array[Int] = Array::new()
  inspect(arr.length(), content="0")
new
()
let
@string.StringView
remaining
= loop
@string.StringView
rest
{
@string.StringView
[
Char
'a'..='z' | 'A'..='Z' | '0'..='9' | '_' as c
@string.StringView
, ..rest_str]
=> {
Array[Char]
identifier_chars
.
(self : Array[Char], value : Char) -> Unit

Adds an element to the end of the array.

If the array is at capacity, it will be reallocated.

Example

  let v = []
  v.push(3)
push
(
Char
c
)
continue
@string.StringView
rest_str
}
@string.StringView
_ as rest_str
=> break
@string.StringView
rest_str
} let
String
ident
=
(Array[Char]) -> String
String::
(chars : Array[Char]) -> String

Convert char array to string.

  let s = @string.from_array(['H', 'e', 'l', 'l', 'o'])
  assert_eq(s, "Hello")

Do not convert large datas to Array[Char] and build a string with String::from_array.

For efficiency considerations, it's recommended to use Buffer instead.

from_array
(
Array[Char]
identifier_chars
)
let
Token
token
=
Unit
keyword_map
.
(String) -> Unit
get
(
String
ident
).
(() -> Token) -> Token
or_else
(() => Token::
(String) -> Token
Lower
(
String
ident
))
(
Token
token
,
@string.StringView
remaining
)
}

πŸ’‘ In-depth Analysis of Moonbit Syntax Features​

The implementation of the lexer above fully demonstrates several outstanding advantages of Moonbit in compiler development:

  1. Functional Loop Construct

    loop initial_value {
      pattern1 => continue new_value1
      pattern2 => continue new_value2
      pattern3 => break final_value
    }
    

    loop is not a traditional loop structure but a functional loop:

    • It accepts an initial parameter as the loop state.
    • It handles different cases through pattern matching.
    • continue passes the new state to the next iteration.
    • break terminates the loop and returns the final value.
  2. String Views and Pattern Matching

    Moonbit's string pattern matching feature greatly simplifies text processing:

    // Match a single character
    ['a', ..rest] => // Starts with the character 'a'
    
    // Match a character range
    ['a'..='z' as c, ..rest] => // A lowercase letter, bound to the variable c
    
    // Match a string literal
    [.."hello", ..rest] => // Equivalent to ['h','e','l','l','o', ..rest]
    
    // Match multiple possible characters
    [' ' | '\t' | '\n', ..rest] => // Any whitespace character
    
  3. The Importance of Pattern Matching Priority

    ⚠️ Important Reminder: The order of matching is crucial.

    When writing pattern matching rules, you must place more specific patterns before more general ones. For example:

    // βœ… Correct order
    loop code[:] {
      [.."->", ..rest] => { ... }     // Match multi-character operators first
      ['-' | '>' as c, ..rest] => { ... }  // Then match single characters
    }
    
    // ❌ Incorrect order - "->" will never be matched
    loop code[:] {
      ['-' | '>' as c, ..rest] => { ... }
      [.."->", ..rest] => { ... }     // This will never be executed
    }
    

By using this pattern-matching-based approach, we not only avoid complex state machine implementations but also achieve a clearer and more maintainable code structure.


Chapter 3: Parsing and Abstract Syntax Tree Construction​

Syntactic Analysis (or Parsing) is the second core stage of the compiler. Its task is to reorganize the sequence of tokens produced by lexical analysis into a hierarchical Abstract Syntax Tree (AST). This process not only verifies whether the program conforms to the language's grammatical rules but also provides a structured data representation for subsequent semantic analysis and code generation.

Abstract Syntax Tree Design: A Structured Representation of the Program​

Before building the parser, we need to carefully design the structure of the AST. This design determines how the program's syntactic structure is represented and how subsequent compilation stages will process these structures.

1. Core Type System​

First, we define the representation of the TinyMoonbit type system in the AST:

pub enum Type {
  
Type
Unit
// Unit type, represents no return value
Type
Bool
// Boolean type: true, false
Type
Int
// 32-bit signed integer
Type
Double
// 64-bit double-precision floating-point number
} derive(
trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}

Trait for types that can be converted to String

Show
,
trait Eq {
  op_equal(Self, Self) -> Bool
}

Trait for types whose elements can test for equality

Eq
,
trait ToJson {
  to_json(Self) -> Json
}

Trait for types that can be converted to Json

ToJson
)
pub fn
(type_name : String) -> Type
parse_type
(
String
type_name
:
String
String
) ->
enum Type {
  Unit
  Bool
  Int
  Double
}
Type
{
match
String
type_name
{
"Unit" => Type::
Type
Unit
"Bool" => Type::
Type
Bool
"Int" => Type::
Type
Int
"Double" => Type::
Type
Double
_ =>
(msg : String) -> Type

Aborts the program with an error message. Always causes a panic, regardless of the message provided.

Parameters:

  • message : A string containing the error message to be displayed when aborting.

Returns a value of type T. However, this function never actually returns a value as it always causes a panic.

abort
("Unknown type: \{
String
type_name
}")
} }

2. Layered AST Node Design​

We use a layered design to clearly represent the different abstraction levels of the program:

  1. Atomic Expressions (AtomExpr) Represent the most basic, indivisible expression units:

    pub enum AtomExpr {
      
    (Bool) -> AtomExpr
    Bool
    (
    Bool
    Bool
    ) // Boolean literal
    (Int) -> AtomExpr
    Int
    (
    Int
    Int
    ) // Integer literal
    (Double) -> AtomExpr
    Double
    (
    Double
    Double
    ) // Floating-point literal
    (String, ty~ : Type?) -> AtomExpr
    Var
    (
    String
    String
    , mut
    Type?
    ty
    ~ :
    enum Type {
      Unit
      Bool
      Int
      Double
    }
    Type
    ?) // Variable reference
    (Expr, ty~ : Type?) -> AtomExpr
    Paren
    (
    enum Expr {
      AtomExpr(AtomExpr, ty~ : Type?)
      Unary(String, Expr, ty~ : Type?)
      Binary(String, Expr, Expr, ty~ : Type?)
    }
    Expr
    , mut
    Type?
    ty
    ~ :
    enum Type {
      Unit
      Bool
      Int
      Double
    }
    Type
    ?) // Parenthesized expression
    (String, Array[Expr], ty~ : Type?) -> AtomExpr
    Call
    (
    String
    String
    ,
    type Array[T]

    An Array is a collection of values that supports random access and can grow in size.

    Array
    [
    enum Expr {
      AtomExpr(AtomExpr, ty~ : Type?)
      Unary(String, Expr, ty~ : Type?)
      Binary(String, Expr, Expr, ty~ : Type?)
    }
    Expr
    ], mut
    Type?
    ty
    ~ :
    enum Type {
      Unit
      Bool
      Int
      Double
    }
    Type
    ?) // Function call
    } derive(
    trait Show {
      output(Self, &Logger) -> Unit
      to_string(Self) -> String
    }

    Trait for types that can be converted to String

    Show
    ,
    trait Eq {
      op_equal(Self, Self) -> Bool
    }

    Trait for types whose elements can test for equality

    Eq
    ,
    trait ToJson {
      to_json(Self) -> Json
    }

    Trait for types that can be converted to Json

    ToJson
    )
  2. Compound Expressions (Expr) More complex structures that can contain operators and multiple sub-expressions:

    pub enum Expr {
      
    (AtomExpr, ty~ : Type?) -> Expr
    AtomExpr
    (
    enum AtomExpr {
      Bool(Bool)
      Int(Int)
      Double(Double)
      Var(String, ty~ : Type?)
      Paren(Expr, ty~ : Type?)
      Call(String, Array[Expr], ty~ : Type?)
    }
    AtomExpr
    , mut
    Type?
    ty
    ~ :
    enum Type {
      Unit
      Bool
      Int
      Double
    }
    Type
    ?) // Wrapper for atomic expressions
    (String, Expr, ty~ : Type?) -> Expr
    Unary
    (
    String
    String
    ,
    enum Expr {
      AtomExpr(AtomExpr, ty~ : Type?)
      Unary(String, Expr, ty~ : Type?)
      Binary(String, Expr, Expr, ty~ : Type?)
    }
    Expr
    , mut
    Type?
    ty
    ~ :
    enum Type {
      Unit
      Bool
      Int
      Double
    }
    Type
    ?) // Unary operation: -, !
    (String, Expr, Expr, ty~ : Type?) -> Expr
    Binary
    (
    String
    String
    ,
    enum Expr {
      AtomExpr(AtomExpr, ty~ : Type?)
      Unary(String, Expr, ty~ : Type?)
      Binary(String, Expr, Expr, ty~ : Type?)
    }
    Expr
    ,
    enum Expr {
      AtomExpr(AtomExpr, ty~ : Type?)
      Unary(String, Expr, ty~ : Type?)
      Binary(String, Expr, Expr, ty~ : Type?)
    }
    Expr
    , mut
    Type?
    ty
    ~ :
    enum Type {
      Unit
      Bool
      Int
      Double
    }
    Type
    ?) // Binary operation: +, -, *, /, ==, !=, etc.
    } derive(
    trait Show {
      output(Self, &Logger) -> Unit
      to_string(Self) -> String
    }

    Trait for types that can be converted to String

    Show
    ,
    trait Eq {
      op_equal(Self, Self) -> Bool
    }

    Trait for types whose elements can test for equality

    Eq
    ,
    trait ToJson {
      to_json(Self) -> Json
    }

    Trait for types that can be converted to Json

    ToJson
    )
  3. Statements (Stmt) Represent executable units in the program:

    pub enum Stmt {
      
    (String, Type, Expr) -> Stmt
    Let
    (
    String
    String
    ,
    enum Type {
      Unit
      Bool
      Int
      Double
    }
    Type
    ,
    enum Expr {
      AtomExpr(AtomExpr, ty~ : Type?)
      Unary(String, Expr, ty~ : Type?)
      Binary(String, Expr, Expr, ty~ : Type?)
    }
    Expr
    ) // Variable declaration: let x : Int = 5;
    (String, Expr) -> Stmt
    Assign
    (
    String
    String
    ,
    enum Expr {
      AtomExpr(AtomExpr, ty~ : Type?)
      Unary(String, Expr, ty~ : Type?)
      Binary(String, Expr, Expr, ty~ : Type?)
    }
    Expr
    ) // Assignment statement: x = 10;
    (Expr, Array[Stmt], Array[Stmt]) -> Stmt
    If
    (
    enum Expr {
      AtomExpr(AtomExpr, ty~ : Type?)
      Unary(String, Expr, ty~ : Type?)
      Binary(String, Expr, Expr, ty~ : Type?)
    }
    Expr
    ,
    type Array[T]

    An Array is a collection of values that supports random access and can grow in size.

    Array
    [
    enum Stmt {
      Let(String, Type, Expr)
      Assign(String, Expr)
      If(Expr, Array[Stmt], Array[Stmt])
      While(Expr, Array[Stmt])
      Return(Expr?)
      Expr(Expr)
    }
    Stmt
    ],
    type Array[T]

    An Array is a collection of values that supports random access and can grow in size.

    Array
    [
    enum Stmt {
      Let(String, Type, Expr)
      Assign(String, Expr)
      If(Expr, Array[Stmt], Array[Stmt])
      While(Expr, Array[Stmt])
      Return(Expr?)
      Expr(Expr)
    }
    Stmt
    ]) // Conditional branch: if-else
    (Expr, Array[Stmt]) -> Stmt
    While
    (
    enum Expr {
      AtomExpr(AtomExpr, ty~ : Type?)
      Unary(String, Expr, ty~ : Type?)
      Binary(String, Expr, Expr, ty~ : Type?)
    }
    Expr
    ,
    type Array[T]

    An Array is a collection of values that supports random access and can grow in size.

    Array
    [
    enum Stmt {
      Let(String, Type, Expr)
      Assign(String, Expr)
      If(Expr, Array[Stmt], Array[Stmt])
      While(Expr, Array[Stmt])
      Return(Expr?)
      Expr(Expr)
    }
    Stmt
    ]) // Loop statement: while
    (Expr?) -> Stmt
    Return
    (
    enum Expr {
      AtomExpr(AtomExpr, ty~ : Type?)
      Unary(String, Expr, ty~ : Type?)
      Binary(String, Expr, Expr, ty~ : Type?)
    }
    Expr
    ?) // Return statement: return expr;
    (Expr) -> Stmt
    Expr
    (
    enum Expr {
      AtomExpr(AtomExpr, ty~ : Type?)
      Unary(String, Expr, ty~ : Type?)
      Binary(String, Expr, Expr, ty~ : Type?)
    }
    Expr
    ) // Expression statement
    } derive(
    trait Show {
      output(Self, &Logger) -> Unit
      to_string(Self) -> String
    }

    Trait for types that can be converted to String

    Show
    ,
    trait Eq {
      op_equal(Self, Self) -> Bool
    }

    Trait for types whose elements can test for equality

    Eq
    ,
    trait ToJson {
      to_json(Self) -> Json
    }

    Trait for types that can be converted to Json

    ToJson
    )
  4. Top-Level Structures Function definitions and the complete program:

    pub struct Function {
      
    String
    name
    :
    String
    String
    // Function name
    Array[(String, Type)]
    params
    :
    type Array[T]

    An Array is a collection of values that supports random access and can grow in size.

    Array
    [(
    String
    String
    ,
    enum Type {
      Unit
      Bool
      Int
      Double
    }
    Type
    )] // Parameter list: [(param_name, type)]
    Type
    ret_ty
    :
    enum Type {
      Unit
      Bool
      Int
      Double
    }
    Type
    // Return type
    Array[Stmt]
    body
    :
    type Array[T]

    An Array is a collection of values that supports random access and can grow in size.

    Array
    [
    enum Stmt {
      Let(String, Type, Expr)
      Assign(String, Expr)
      If(Expr, Array[Stmt], Array[Stmt])
      While(Expr, Array[Stmt])
      Return(Expr?)
      Expr(Expr)
    }
    Stmt
    ] // Sequence of statements in the function body
    } derive(
    trait Show {
      output(Self, &Logger) -> Unit
      to_string(Self) -> String
    }

    Trait for types that can be converted to String

    Show
    ,
    trait Eq {
      op_equal(Self, Self) -> Bool
    }

    Trait for types whose elements can test for equality

    Eq
    ,
    trait ToJson {
      to_json(Self) -> Json
    }

    Trait for types that can be converted to Json

    ToJson
    )
    // The program is defined as a map from function names to function definitions pub type Program
    type Map[K, V]

    Mutable linked hash map that maintains the order of insertion, not thread safe.

    Example

      let map = { 3: "three", 8 :  "eight", 1 :  "one"}
      assert_eq(map.get(2), None)
      assert_eq(map.get(3), Some("three"))
      map.set(3, "updated")
      assert_eq(map.get(3), Some("updated"))
    Map
    [
    String
    String
    ,
    struct Function {
      name: String
      params: Array[(String, Type)]
      ret_ty: Type
      body: Array[Stmt]
    }
    Function
    ]

Design Highlight: Mutability of Type Annotations

Notice that each expression node contains a mut ty~ : Type? field. This design allows us to fill in type information during the type-checking phase without having to rebuild the entire AST.

Recursive Descent Parsing: A Top-Down Construction Strategy​

Recursive Descent is a top-down parsing method where the core idea is to write a corresponding parsing function for each grammar rule. In Moonbit, pattern matching makes the implementation of this method exceptionally elegant.

Parsing Atomic Expressions​

pub fn 
(tokens : ArrayView[Token]) -> (AtomExpr, ArrayView[Token]) raise Error
parse_atom_expr
(
ArrayView[Token]
tokens
:
type ArrayView[T]

A @array.View represents a view into a section of an array without copying the data.

Example

let arr = [1, 2, 3, 4, 5]
let view = arr[1:4] // Creates a view of elements at indices 1,2,3
inspect(view[0], content="2")
inspect(view.length(), content="3")
ArrayView
[
enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
}
Token
]
) -> (
enum AtomExpr {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Var(String, ty~ : Type?)
  Paren(Expr, ty~ : Type?)
  Call(String, Array[Expr], ty~ : Type?)
}
AtomExpr
,
type ArrayView[T]

A @array.View represents a view into a section of an array without copying the data.

Example

let arr = [1, 2, 3, 4, 5]
let view = arr[1:4] // Creates a view of elements at indices 1,2,3
inspect(view[0], content="2")
inspect(view.length(), content="3")
ArrayView
[
enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
}
Token
]) raise {
match
ArrayView[Token]
tokens
{
// Parse literals
ArrayView[Token]
[
(Bool) -> Token
Bool
ArrayView[Token]
(
Bool
b
ArrayView[Token]
), ..rest]
=> (AtomExpr::
(Bool) -> AtomExpr
Bool
(
Bool
b
),
ArrayView[Token]
rest
)
ArrayView[Token]
[
(Int) -> Token
Int
ArrayView[Token]
(
Int
i
ArrayView[Token]
), ..rest]
=> (AtomExpr::
(Int) -> AtomExpr
Int
(
Int
i
),
ArrayView[Token]
rest
)
ArrayView[Token]
[
(Double) -> Token
Double
ArrayView[Token]
(
Double
d
ArrayView[Token]
), ..rest]
=> (AtomExpr::
(Double) -> AtomExpr
Double
(
Double
d
),
ArrayView[Token]
rest
)
// Parse function calls: func_name(arg1, arg2, ...)
ArrayView[Token]
[
(String) -> Token
Lower
ArrayView[Token]
(
String
func_name
ArrayView[Token]
),
(Char) -> Token
Bracket
ArrayView[Token]
('('), ..rest]
=> {
let (
Array[Expr]
args
,
Unit
rest
) =
(ArrayView[Token]) -> (Array[Expr], Unit)
parse_argument_list
(
ArrayView[Token]
rest
)
match
Unit
rest
{
Unit
[
(Char) -> _/0
Bracket
Unit
(')'), ..remaining]
=>
(AtomExpr::
(String, Array[Expr], ty~ : Type?) -> AtomExpr
Call
(
String
func_name
,
Array[Expr]
args
,
Type?
ty
=
Type?
None
),
ArrayView[Token]
remaining
)
_ => raise
Error
SyntaxError
("Expected ')' after function arguments")
} } // Parse variable references
ArrayView[Token]
[
(String) -> Token
Lower
ArrayView[Token]
(
String
var_name
ArrayView[Token]
), ..rest]
=>
(AtomExpr::
(String, ty~ : Type?) -> AtomExpr
Var
(
String
var_name
,
Type?
ty
=
Type?
None
),
ArrayView[Token]
rest
)
// Parse parenthesized expressions: (expression)
ArrayView[Token]
[
(Char) -> Token
Bracket
ArrayView[Token]
('('), ..rest]
=> {
let (
Expr
expr
,
ArrayView[Token]
rest
) =
(tokens : ArrayView[Token]) -> (Expr, ArrayView[Token]) raise Error
parse_expression
(
ArrayView[Token]
rest
)
match
ArrayView[Token]
rest
{
ArrayView[Token]
[
(Char) -> Token
Bracket
ArrayView[Token]
(')'), ..remaining]
=>
(AtomExpr::
(Expr, ty~ : Type?) -> AtomExpr
Paren
(
Expr
expr
,
Type?
ty
=
Type?
None
),
ArrayView[Token]
remaining
)
_ => raise
Error
SyntaxError
("Expected ')' after expression")
} } _ => raise
Error
SyntaxError
("Expected atomic expression")
} }

Parsing Statements​

Statement parsing needs to dispatch to different handler functions based on the starting keyword:

pub fn 
(tokens : ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_stmt
(
ArrayView[Token]
tokens
:
type ArrayView[T]

A @array.View represents a view into a section of an array without copying the data.

Example

let arr = [1, 2, 3, 4, 5]
let view = arr[1:4] // Creates a view of elements at indices 1,2,3
inspect(view[0], content="2")
inspect(view.length(), content="3")
ArrayView
[
enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
}
Token
]) -> (
enum Stmt {
  Let(String, Type, Expr)
  Assign(String, Expr)
  If(Expr, Array[Stmt], Array[Stmt])
  While(Expr, Array[Stmt])
  Return(Expr?)
  Expr(Expr)
}
Stmt
,
type ArrayView[T]

A @array.View represents a view into a section of an array without copying the data.

Example

let arr = [1, 2, 3, 4, 5]
let view = arr[1:4] // Creates a view of elements at indices 1,2,3
inspect(view[0], content="2")
inspect(view.length(), content="3")
ArrayView
[
enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
}
Token
]) {
match
ArrayView[Token]
tokens
{
// Parse let statements [
(String) -> Token
Keyword
("let"),
(String) -> Token
Lower
(
String
var_name
),
(String) -> Token
Symbol
(":"), ..] => { /* ... */ }
// Parse if/while/return statements
ArrayView[Token]
[
(String) -> Token
Keyword
ArrayView[Token]
("if"), .. rest]
=>
(ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_if_stmt
(
ArrayView[Token]
rest
)
ArrayView[Token]
[
(String) -> Token
Keyword
ArrayView[Token]
("while"), .. rest]
=>
(ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_while_stmt
(
ArrayView[Token]
rest
)
ArrayView[Token]
[
(String) -> Token
Keyword
ArrayView[Token]
("return"), .. rest]
=> { /* ... */ }
// Parse assignment statements
ArrayView[Token]
[
Token
Lower
ArrayView[Token]
(_),
(String) -> Token
Symbol
ArrayView[Token]
("="), .. rest]
=>
(ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_assign_stmt
(
ArrayView[Token]
tokens
)
// Parse single expression statements
ArrayView[Token]
[
Token
Lower
ArrayView[Token]
(_),
(String) -> Token
Symbol
ArrayView[Token]
("="), .. rest]
=>
(ArrayView[Token]) -> (Stmt, ArrayView[Token])
parse_single_expr_stmt
(
ArrayView[Token]
tokens
)
_ => { /* Error handling */ } } }

Challenge: Handling Operator Precedence:

The most complex part of expression parsing is handling operator precedence. We need to ensure that 1 + 2 * 3 is correctly parsed as 1 + (2 * 3) and not (1 + 2) * 3.

πŸ’‘ Application of Advanced Moonbit Features​

Automatic Derivation Feature​

pub enum Expr {
  // ...
} derive(Show, Eq, ToJson)

Moonbit's derive feature automatically generates common implementations for types. Here we use three:

  • Show: Provides debugging output functionality.
  • Eq: Supports equality comparison.
  • ToJson: Serializes to JSON format, which is convenient for debugging and persistence.

These automatically generated features are extremely useful in compiler development, especially during the debugging and testing phases.

Error Handling Mechanism​

pub fn 
(tokens : ArrayView[Token]) -> (Expr, ArrayView[Token]) raise Error
parse_expression
(
ArrayView[Token]
tokens
:
type ArrayView[T]

A @array.View represents a view into a section of an array without copying the data.

Example

let arr = [1, 2, 3, 4, 5]
let view = arr[1:4] // Creates a view of elements at indices 1,2,3
inspect(view[0], content="2")
inspect(view.length(), content="3")
ArrayView
[
enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
}
Token
]) -> (
enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
}
Expr
,
type ArrayView[T]

A @array.View represents a view into a section of an array without copying the data.

Example

let arr = [1, 2, 3, 4, 5]
let view = arr[1:4] // Creates a view of elements at indices 1,2,3
inspect(view[0], content="2")
inspect(view.length(), content="3")
ArrayView
[
enum Token {
  Bool(Bool)
  Int(Int)
  Double(Double)
  Keyword(String)
  Upper(String)
  Lower(String)
  Symbol(String)
  Bracket(Char)
  EOF
}
Token
]) raise {
// The 'raise' keyword indicates that this function may throw an exception }

Moonbit's raise mechanism provides structured error handling, allowing syntax errors to be accurately located and reported.

Through this layered design and recursive descent parsing strategy, we have built a parser that is both flexible and efficient, laying a solid foundation for the subsequent type-checking phase.


Chapter 4: Type Checking and Semantic Analysis​

Semantic Analysis is a crucial intermediate stage in compiler design. While parsing ensures the program's structure is correct, it doesn't mean the program is semantically valid. Type Checking, as the core component of semantic analysis, is responsible for verifying the type consistency of all operations in the program, ensuring type safety and runtime correctness.

Scope Management: Building the Environment Chain​

The primary challenge in type checking is correctly handling variable scopes. At different levels of the program (global, function, block), the same variable name may refer to different entities. We adopt the classic design of an Environment Chain to solve this problem:

pub struct TypeEnv[K, V] {
  
TypeEnv[K, V]?
parent
:
struct TypeEnv[K, V] {
  parent: TypeEnv[K, V]?
  data: Map[K, V]
}
TypeEnv
[

type parameter K

K
,

type parameter V

V
]? // Reference to the parent environment
Map[K, V]
data
:
type Map[K, V]

Mutable linked hash map that maintains the order of insertion, not thread safe.

Example

  let map = { 3: "three", 8 :  "eight", 1 :  "one"}
  assert_eq(map.get(2), None)
  assert_eq(map.get(3), Some("three"))
  map.set(3, "updated")
  assert_eq(map.get(3), Some("updated"))
Map
[

type parameter K

K
,

type parameter V

V
] // Variable bindings in the current environment
}

The core of the environment chain is the variable lookup algorithm, which follows the rules of lexical scoping:

pub fn 
struct TypeEnv[K, V] {
  parent: TypeEnv[K, V]?
  data: Map[K, V]
}
TypeEnv
::
(self : TypeEnv[K, V], key : K) -> V?
get
[K :
trait Eq {
  op_equal(Self, Self) -> Bool
}

Trait for types whose elements can test for equality

Eq
+
trait Hash {
  hash_combine(Self, Hasher) -> Unit
  hash(Self) -> Int
}

Trait for types that can be hashed

Hash
, V](
TypeEnv[K, V]
self
:
struct TypeEnv[K, V] {
  parent: TypeEnv[K, V]?
  data: Map[K, V]
}
Self
[

type parameter K

K
,

type parameter V

V
],
K
key
:

type parameter K

K
) ->

type parameter V

V
? {
match
TypeEnv[K, V]
self
.
Map[K, V]
data
.
(self : Map[K, V], key : K) -> V?

Get the value associated with a key.

get
(
K
key
) {
(V) -> V?
Some
(
V
value
) =>
(V) -> V?
Some
(
V
value
) // Found in the current environment
V?
None
=>
match
TypeEnv[K, V]
self
.
TypeEnv[K, V]?
parent
{
(TypeEnv[K, V]) -> TypeEnv[K, V]?
Some
(
TypeEnv[K, V]
parent_env
) =>
TypeEnv[K, V]
parent_env
.
(self : TypeEnv[K, V], key : K) -> V?
get
(
K
key
) // Recursively search the parent environment
TypeEnv[K, V]?
None
=>
V?
None
// Reached the top-level environment, variable not defined
} } }

Design Principle: Lexical Scoping

This design ensures that variable lookup follows lexical scoping rules:

  1. First, search in the current scope.
  2. If not found, recursively search in the parent scope.
  3. Continue until the variable is found or the global scope is reached.

Type Checker Architecture​

Environment management alone is not sufficient to complete the type-checking task. Some operations (like function calls) need to access global program information. Therefore, we design a comprehensive type checker:

pub struct TypeChecker {
  
TypeEnv[String, Type]
local_env
:
struct TypeEnv[K, V] {
  parent: TypeEnv[K, V]?
  data: Map[K, V]
}
TypeEnv
[
String
String
,
enum Type {
  Unit
  Bool
  Int
  Double
}
Type
] // Local variable environment
Function
current_func
:
struct Function {
  name: String
  params: Array[(String, Type)]
  ret_ty: Type
  body: Array[Stmt]
}
Function
// The function currently being checked
Program
program
:
type Program Map[String, Function]
Program
// Complete program information
}

Implementation of Partial Node Type Checking​

The core of the type checker is to apply the corresponding type rules to different AST nodes. The following is the implementation of expression type checking:

pub fn 
enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
}
Expr
::
(self : Expr, env : TypeEnv[String, Type]) -> Type raise Error
check_type
(
Expr
self
:
enum Expr {
  AtomExpr(AtomExpr, ty~ : Type?)
  Unary(String, Expr, ty~ : Type?)
  Binary(String, Expr, Expr, ty~ : Type?)
}
Self
,
TypeEnv[String, Type]
env
:
struct TypeEnv[K, V] {
  parent: TypeEnv[K, V]?
  data: Map[K, V]
}
TypeEnv
[
String
String
,
enum Type {
  Unit
  Bool
  Int
  Double
}
Type
]
) ->
enum Type {
  Unit
  Bool
  Int
  Double
}
Type
raise {
match
Expr
self
{
// Type checking for atomic expressions
(AtomExpr) -> Expr
AtomExpr
Expr
(
AtomExpr
atom_expr
Expr
, ..) as node
=> {
let
Type
ty
=
AtomExpr
atom_expr
.
(TypeEnv[String, Type]) -> Type
check_type
(
TypeEnv[String, Type]
env
)
Expr
node
Unit
.ty =
(Type) -> Type?
Some
Unit
(
Type
ty
Unit
)
// Fill in the type information
Type
ty
} // Type checking for unary operations
(String, Expr) -> Expr
Unary
Expr
("-",
Expr
expr
Expr
, ..) as node
=> {
let
Type
ty
=
Expr
expr
.
(self : Expr, env : TypeEnv[String, Type]) -> Type raise Error
check_type
(
TypeEnv[String, Type]
env
)
Expr
node
Unit
.ty =
(Type) -> Type?
Some
Unit
(
Type
ty
Unit
)
Type
ty
} // Type checking for binary operations
(String, Expr, Expr) -> Expr
Binary
Expr
("+",
Expr
lhs
Expr
,
Expr
rhs
Expr
, ..) as node
=> {
let
Type
lhs_type
=
Expr
lhs
.
(self : Expr, env : TypeEnv[String, Type]) -> Type raise Error
check_type
(
TypeEnv[String, Type]
env
)
let
Type
rhs_type
=
Expr
rhs
.
(self : Expr, env : TypeEnv[String, Type]) -> Type raise Error
check_type
(
TypeEnv[String, Type]
env
)
// Ensure operand types are consistent guard
Type
lhs_type
(Type, Type) -> Bool

automatically derived

==
Type
rhs_type
else {
raise
Error
TypeCheckError
(
"Binary operation requires matching types, got \{
Type
lhs_type
} and \{
Type
rhs_type
}"
) } let
Type
result_type
= match
String
op
{
// Comparison operators always return a boolean value "==" | "!=" | "<" | "<=" | ">" | ">=" => Type::
Type
Bool
// Arithmetic operators, etc., maintain the operand type _ =>
Type
lhs_type
}
Expr
node
Unit
.ty =
(Type) -> Type?
Some
Unit
(
Type
result_type
Unit
)
Type
result_type
} } }

πŸ’‘ Moonbit Enum Modification Trick

During the type-checking process, we need to fill in type information for the AST nodes. Moonbit provides an elegant way to modify the mutable fields of enum variants:

pub enum Expr {
  AtomExpr(AtomExpr, mut ty~ : Type?)
  Unary(String, Expr, mut ty~ : Type?)
  Binary(String, Expr, Expr, mut ty~ : Type?)
} derive(Show, Eq, ToJson)

By using the as binding in pattern matching, we can get a reference to the enum variant and modify its mutable fields:

match expr {
  AtomExpr(atom_expr, ..) as node => {
    let 
?
ty
=
Unit
atom_expr
.
(Unit) -> ?
check_type
(
Unit
env
)
node.ty = Some(ty) // Modify the mutable field ty } // ... }

This design avoids the overhead of rebuilding the entire AST while maintaining a functional programming style.


Complete Compilation Flow Demonstration​

After the three stages of lexical analysis, parsing, and type checking, our compiler frontend is now able to convert source code into a fully typed abstract syntax tree. Let's demonstrate the complete process with a simple example:

Source Code Example​

fn 
(x : Int, y : Int) -> Int
add
(
Int
x
:
Int
Int
,
Int
y
:
Int
Int
) ->
Int
Int
{
return
Int
x
(self : Int, other : Int) -> Int

Adds two 32-bit signed integers. Performs two's complement arithmetic, which means the operation will wrap around if the result exceeds the range of a 32-bit integer.

Parameters:

  • self : The first integer operand.
  • other : The second integer operand.

Returns a new integer that is the sum of the two operands. If the mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to 2,147,483,647), the result wraps around according to two's complement rules.

Example:

  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+
Int
y
;
}

Compilation Output: Typed AST​

Using the derive(ToJson) feature, we can output the final AST in JSON format for inspection:

{
  "functions": {
    "add": {
      "name": "add",
      "params": [
        ["x", { "$tag": "Int" }],
        ["y", { "$tag": "Int" }]
      ],
      "ret_ty": { "$tag": "Int" },
      "body": [
        {
          "$tag": "Return",
          "0": {
            "$tag": "Binary",
            "0": "+",
            "1": {
              "$tag": "AtomExpr",
              "0": {
                "$tag": "Var",
                "0": "x",
                "ty": { "$tag": "Int" }
              },
              "ty": { "$tag": "Int" }
            },
            "2": {
              "$tag": "AtomExpr",
              "0": {
                "$tag": "Var",
                "0": "y",
                "ty": { "$tag": "Int" }
              },
              "ty": { "$tag": "Int" }
            },
            "ty": { "$tag": "Int" }
          }
        }
      ]
    }
  }
}

From this JSON output, we can clearly see:

  1. Complete Function Signature: Including the parameter list and return type.
  2. Type-Annotated AST Nodes: Each expression carries type information.
  3. Structured Program Representation: Provides a clear data structure for the subsequent code generation phase.

Conclusion​

In this article, we have delved into the complete implementation process of a compiler frontend. From a stream of characters to a typed abstract syntax tree, we have witnessed the unique advantages of the Moonbit language in compiler construction:

Core Takeaways​

  1. The Power of Pattern Matching: Moonbit's string pattern matching and structural pattern matching greatly simplify the implementation of lexical analysis and parsing.
  2. Functional Programming Paradigm: The combination of the loop construct, environment chains, and immutable data structures provides a solution that is both elegant and efficient.
  3. Expressive Type System: Through mutable fields in enums and trait objects, we can build data structures that are both type-safe and flexible.
  4. Engineering Features: Features like derive, structured error handling, and JSON serialization significantly improve development efficiency.

Looking Ahead to Part 2​

Having mastered the implementation of the frontend, the next article will guide us into the more exciting code generation phase. We will:

  • Delve into the design philosophy of LLVM Intermediate Representation.
  • Explore how to use Moonbit's official llvm.mbt binding library.
  • Implement the complete conversion from AST to LLVM IR.
  • Generate executable RISC-V assembly code.

Building a compiler is a complex and challenging process, but as we have shown in this article, Moonbit provides powerful and elegant tools for this task. Let's continue this exciting compiler construction journey in the next part.

Recommended Resources