Skip to main content

Dancing with LLVM: A Moonbit Chronicle (Part 2) - LLVM Backend Generation

· 17 min read


Introduction

In the process of programming language design, the frontend is responsible for understanding and verifying the structure and semantics of a program, while the compiler backend takes on the task of translating these abstract concepts into executable machine code. The implementation of the backend not only requires a deep understanding of the target architecture but also mastery of complex optimization techniques to generate efficient code.

LLVM (Low Level Virtual Machine), as a comprehensive modern compiler infrastructure, provides us with a powerful and flexible solution. By converting a program into LLVM Intermediate Representation (IR), we can leverage LLVM's mature toolchain to compile the code to various target architectures, including RISC-V, ARM, and x86.

Moonbit's LLVM Ecosystem

Moonbit officially provides two important LLVM-related projects:

  • llvm.mbt: Moonbit language bindings for the original LLVM, providing direct access to the llvm-c interface. It requires the installation of the complete LLVM toolchain, can only generate for native backends, and requires you to handle compilation and linking yourself, but it can generate IR that is fully compatible with the original LLVM.
  • MoonLLVM: A pure Moonbit implementation of an LLVM-like system. It can generate LLVM IR without external dependencies and supports JavaScript and WebAssembly backends.

This article chooses llvm.mbt as our tool. Its API design is inspired by the highly acclaimed inkwell library in the Rust ecosystem.

In the previous article, "Dancing with LLVM: A Moonbit Chronicle (Part 1) - Implementing the Frontend," we completed the conversion from source code to a typed abstract syntax tree. This article will build on that achievement, focusing on the core techniques and implementation details of code generation.


Chapter 1: Representing the LLVM Type System in Moonbit

Before diving into code generation, we first need to understand how llvm.mbt represents LLVM's various concepts within Moonbit's type system. LLVM's type system is quite complex, containing multiple levels such as basic types, composite types, and function types.

Trait Objects: An Abstract Representation of Types

In the API design of llvm.mbt, you will frequently encounter the core concept of &Type. This is not a concrete struct or enum, but a Trait Object—which can be understood as the functional equivalent of an abstract base class in object-oriented programming.

// &Type is a trait object representing any LLVM type
let 
Unit
some_type
: &Type =
Unit
context
.
() -> Unit
i32_type
()

Type Identification and Conversion

To determine the specific type of a &Type, we need to perform a runtime type check using the as_type_enum interface:

pub fn 
(ty : Unit) -> String
identify_type
(
Unit
ty
: &Type) ->
String
String
{
match
Unit
ty
.
() -> Unit
as_type_enum
() {
(Unit) -> Unit
IntType
(
Unit
int_ty
) => "Integer type with \{
Unit
int_ty
.
() -> Unit
get_bit_width
()} bits"
(_/0) -> Unit
FloatType
(
_/0
float_ty
) => "Floating point type"
(_/0) -> Unit
PointerType
(
_/0
ptr_ty
) => "Pointer type"
(_/0) -> Unit
FunctionType
(
_/0
func_ty
) => "Function type"
(_/0) -> Unit
ArrayType
(
_/0
array_ty
) => "Array type"
(_/0) -> Unit
StructType
(
_/0
struct_ty
) => "Structure type"
(_/0) -> Unit
VectorType
(
_/0
vec_ty
) => "Vector type"
(_/0) -> Unit
ScalableVectorType
(
_/0
svec_ty
) => "Scalable vector type"
(_/0) -> Unit
MetadataType
(
_/0
meta_ty
) => "Metadata type"
} }

Safe Type Conversion Strategies

When we are certain that a &Type has a specific type, there are several conversion methods to choose from:

  1. Direct Conversion (for deterministic scenarios)

    let 
    Unit
    ty
    : &Type =
    Unit
    context
    .
    () -> Unit
    i32_type
    ()
    let
    ?
    i32_ty
    =
    Unit
    ty
    .
    () -> ?
    into_int_type
    () // Direct conversion, errors are handled by llvm.mbt
    let
    ?
    bit_width
    =
    ?
    i32_ty
    .
    () -> ?
    get_bit_width
    () // Call a method specific to IntType
  2. Defensive Conversion (recommended for production environments)

    let 
    Unit
    ty
    : &Type =
    () -> Unit
    get_some_type
    () // An unknown type obtained from somewhere
    guard ty.as_type_enum() is IntType(i32_ty) else { raise CodeGenError("Expected integer type, got \{ty}") } // Now it's safe to use i32_ty let
    ?
    bit_width
    =
    ?
    i32_ty
    .
    () -> ?
    get_bit_width
    ()

Constructing Composite Types

LLVM supports various composite types, which are usually constructed through methods of basic types:

pub fn 
(context : ?) -> Unit
create_composite_types
(
?
context
: @llvm.Context) ->
Unit
Unit
{
let
Unit
i32_ty
=
?
context
.
() -> Unit
i32_type
()
let
Unit
f64_ty
=
?
context
.
() -> Unit
f64_type
()
// Array type: [16 x i32] let
Unit
i32_array_ty
=
Unit
i32_ty
.
(Int) -> Unit
array_type
(16)
// Function type: i32 (i32, i32) let
Unit
add_func_ty
=
Unit
i32_ty
.
(Array[Unit]) -> Unit
fn_type
([
Unit
i32_ty
,
Unit
i32_ty
])
// Struct type: {i32, f64} let
Unit
struct_ty
=
?
context
.
(Array[Unit]) -> Unit
struct_type
([
Unit
i32_ty
,
Unit
f64_ty
])
// Pointer type (all pointers are opaque in LLVM 18+) let
Unit
ptr_ty
=
Unit
i32_ty
.
() -> Unit
ptr_type
()
// Output type information for verification
(input : String) -> Unit

Prints any value that implements the Show trait to the standard output, followed by a newline.

Parameters:

  • value : The value to be printed. Must implement the Show trait.

Example:

  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println
("Array type: \{
Unit
i32_array_ty
}") // [16 x i32]
(input : String) -> Unit

Prints any value that implements the Show trait to the standard output, followed by a newline.

Parameters:

  • value : The value to be printed. Must implement the Show trait.

Example:

  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println
("Function type: \{
Unit
add_func_ty
}") // i32 (i32, i32)
(input : String) -> Unit

Prints any value that implements the Show trait to the standard output, followed by a newline.

Parameters:

  • value : The value to be printed. Must implement the Show trait.

Example:

  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println
("Struct type: \{
Unit
struct_ty
}") // {i32, f64}
(input : String) -> Unit

Prints any value that implements the Show trait to the standard output, followed by a newline.

Parameters:

  • value : The value to be printed. Must implement the Show trait.

Example:

  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println
("Pointer type: \{
Unit
ptr_ty
}") // ptr
}

Important Reminder: Opaque Pointers

Starting with LLVM version 18, all pointer types use the opaque pointer design. This means that regardless of the type they point to, all pointers are represented as ptr in the IR, and the specific type information they point to is no longer visible in the type system.


Chapter 2: The LLVM Value System and the BasicValue Concept

Compared to the type system, LLVM's value system is more complex. llvm.mbt, consistent with inkwell, divides values into two important abstract layers: Value and BasicValue. The difference lies in distinguishing the source of value creation from the way values are used:

  • Value: Focuses on how a value is produced (e.g., constants, instruction results).
  • BasicValue: Focuses on what basic type a value has (e.g., integer, float, pointer).

Practical Application Example

pub fn 
(context : ?, builder : ?) -> Unit
demonstrate_value_system
(
?
context
: Context,
?
builder
: Builder) ->
Unit
Unit
{
let
Unit
i32_ty
=
?
context
.
() -> Unit
i32_type
()
// Create two integer constants - these are directly IntValue let
Unit
const1
=
Unit
i32_ty
.
(Int) -> Unit
const_int
(10) // Value: IntValue, BasicValue: IntValue
let
Unit
const2
=
Unit
i32_ty
.
(Int) -> Unit
const_int
(20) // Value: IntValue, BasicValue: IntValue
// Perform an addition operation - the result is an InstructionValue let
Unit
add_result
=
?
builder
.
(Unit, Unit) -> Unit
build_int_add
(
Unit
const1
,
Unit
const2
)
// In different contexts, we need different perspectives: // As an instruction to check its properties let
Unit
instruction
=
Unit
add_result
.
() -> Unit
as_instruction
()
(input : String) -> Unit

Prints any value that implements the Show trait to the standard output, followed by a newline.

Parameters:

  • value : The value to be printed. Must implement the Show trait.

Example:

  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println
("Instruction opcode: \{
Unit
instruction
.
() -> Unit
get_opcode
()}")
// As a basic value to get its type let
Unit
basic_value
=
Unit
add_result
.
() -> Unit
into_basic_value
()
(input : String) -> Unit

Prints any value that implements the Show trait to the standard output, followed by a newline.

Parameters:

  • value : The value to be printed. Must implement the Show trait.

Example:

  println(42)
  println("Hello, World!")
  println([1, 2, 3])
println
("Result type: \{
Unit
basic_value
.
() -> Unit
get_type
()}")
// As an integer value for subsequent calculations let
Unit
int_value
=
Unit
add_result
.
() -> Unit
into_int_value
()
let
Unit
final_result
=
?
builder
.
(Unit, Unit) -> Unit
build_int_mul
(
Unit
int_value
,
Unit
const1
)
}

Complete Classification of Value Types

  1. ValueEnum: All possible value types

    pub enum ValueEnum {
      
    (?) -> ValueEnum
    IntValue
    (IntValue) // Integer value
    (?) -> ValueEnum
    FloatValue
    (FloatValue) // Floating-point value
    (?) -> ValueEnum
    PointerValue
    (PointerValue) // Pointer value
    (?) -> ValueEnum
    StructValue
    (StructValue) // Struct value
    (?) -> ValueEnum
    FunctionValue
    (FunctionValue) // Function value
    (?) -> ValueEnum
    ArrayValue
    (ArrayValue) // Array value
    (?) -> ValueEnum
    VectorValue
    (VectorValue) // Vector value
    (?) -> ValueEnum
    PhiValue
    (PhiValue) // Phi node value
    (?) -> ValueEnum
    ScalableVectorValue
    (ScalableVectorValue) // Scalable vector value
    (?) -> ValueEnum
    MetadataValue
    (MetadataValue) // Metadata value
    (?) -> ValueEnum
    CallSiteValue
    (CallSiteValue) // Call site value
    (?) -> ValueEnum
    GlobalValue
    (GlobalValue) // Global value
    (?) -> ValueEnum
    InstructionValue
    (InstructionValue) // Instruction value
    } derive(
    trait Show {
      output(Self, &Logger) -> Unit
      to_string(Self) -> String
    }

    Trait for types that can be converted to String

    Show
    )
  2. BasicValueEnum: Values that have a basic type

    pub enum BasicValueEnum {
      
    (?) -> BasicValueEnum
    ArrayValue
    (ArrayValue) // Array value
    (?) -> BasicValueEnum
    IntValue
    (IntValue) // Integer value
    (?) -> BasicValueEnum
    FloatValue
    (FloatValue) // Floating-point value
    (?) -> BasicValueEnum
    PointerValue
    (PointerValue) // Pointer value
    (?) -> BasicValueEnum
    StructValue
    (StructValue) // Struct value
    (?) -> BasicValueEnum
    VectorValue
    (VectorValue) // Vector value
    (?) -> BasicValueEnum
    ScalableVectorValue
    (ScalableVectorValue) // Scalable vector value
    } derive(
    trait Show {
      output(Self, &Logger) -> Unit
      to_string(Self) -> String
    }

    Trait for types that can be converted to String

    Show
    )

💡 Best Practices for Value Conversion

In the actual code generation process, we often need to convert between different value perspectives:

pub fn 
(instruction_result : Unit) -> Unit
value_conversion_patterns
(
Unit
instruction_result
: &Value) ->
Unit
Unit
{
// Pattern 1: I know what type this is, convert directly let
Unit
int_val
=
Unit
instruction_result
.
() -> Unit
into_int_value
()
// Pattern 2: I just need a basic value, I don't care about the specific type let
Unit
basic_val
=
Unit
instruction_result
.
() -> Unit
into_basic_value
()
// Pattern 3: Defensive programming, check before converting match
Unit
instruction_result
.
() -> Unit
as_value_enum
() {
// Handle integer values
(Unit) -> Unit
IntValue
(
Unit
int_val
) =>
(Unit) -> Unit
handle_integer
(
Unit
int_val
)
// Handle float values
(Unit) -> Unit
FloatValue
(
Unit
float_val
) =>
(Unit) -> Unit
handle_float
(
Unit
float_val
)
_ => raise
Error
CodeGenError
("Unexpected value type")
} }

Through this two-layer abstraction, llvm.mbt maintains the integrity of the LLVM value system while providing an intuitive and easy-to-use interface for Moonbit developers.


Chapter 3: Practical LLVM IR Generation

Now that we understand the type and value systems, let's demonstrate how to use llvm.mbt to generate LLVM IR with a complete example. This example will implement a simple muladd function, showing the entire process from initialization to instruction generation.

Infrastructure Initialization

Any LLVM program begins by establishing three core components:

pub fn 
() -> (?, ?, ?)
initialize_llvm
() -> (Context, Module, Builder) {
// 1. Create an LLVM context - a container for all LLVM objects let
?
context
=
() -> ?
@llvm.Context::create
()
// 2. Create a module - a container for functions and global variables let
?
module
=
?
context
.
(String) -> ?
create_module
("demo_module")
// 3. Create an IR builder - used to generate instructions let
?
builder
=
?
context
.
() -> ?
create_builder
()
(
?
context
,
?
module
,
?
builder
)
}

A Simple Function Generation Example

Let's implement a function that calculates (a * b) + c:

pub fn 
() -> String
generate_muladd_function
() ->
String
String
{
// Initialize LLVM infrastructure let (
?
context
,
?
module
,
?
builder
) =
() -> (?, ?, ?)
initialize_llvm
()
// Define the function signature let
Unit
i32_ty
=
?
context
.
() -> Unit
i32_type
()
let
Unit
func_type
=
Unit
i32_ty
.
(Array[Unit]) -> Unit
fn_type
([
Unit
i32_ty
,
Unit
i32_ty
,
Unit
i32_ty
])
let
Unit
func_value
=
?
module
.
(String, Unit) -> Unit
add_function
("muladd",
Unit
func_type
)
// Create the function entry basic block let
Unit
entry_block
=
?
context
.
(Unit, String) -> Unit
append_basic_block
(
Unit
func_value
, "entry")
?
builder
.
(Unit) -> Unit
position_at_end
(
Unit
entry_block
)
// Get the function parameters let
Unit
arg_a
=
Unit
func_value
.
(Int) -> Unit
get_nth_param
(0).
() -> Unit
unwrap
().
() -> Unit
into_int_value
()
let
Unit
arg_b
=
Unit
func_value
.
(Int) -> Unit
get_nth_param
(1).
() -> Unit
unwrap
().
() -> Unit
into_int_value
()
let
Unit
arg_c
=
Unit
func_value
.
(Int) -> Unit
get_nth_param
(2).
() -> Unit
unwrap
().
() -> Unit
into_int_value
()
// Generate calculation instructions let
Unit
mul_result
=
?
builder
.
(Unit, Unit) -> Unit
build_int_mul
(
Unit
arg_a
,
Unit
arg_b
).
() -> Unit
into_int_value
()
let
Unit
add_result
=
?
builder
.
(Unit, Unit) -> Unit
build_int_add
(
Unit
mul_result
,
Unit
arg_c
)
// Generate the return instruction let _ =
?
builder
.
(Unit) -> Unit
build_return
(
Unit
add_result
)
// Output the generated IR
?
module
.
() -> String
dump
()
}

Generated LLVM IR

Running the above code will produce the following LLVM Intermediate Representation:

; ModuleID = 'demo_module'
source_filename = "demo_module"

define i32 @muladd(i32 %0, i32 %1, i32 %2) {
entry:
  %3 = mul i32 %0, %1
  %4 = add i32 %3, %2
  ret i32 %4
}

💡 Code Generation Best Practices

  1. Naming Conventions

    For instructions that return a value, the build interface has a name label argument, which can be used to add a name to the result of the instruction.

    let 
    ?
    mul_result
    =
    Unit
    builder
    .
    (Unit, Unit, String) -> ?
    build_int_mul
    (
    Unit
    lhs
    ,
    Unit
    rhs
    ,
    String
    name
    ="temp_product")
    let
    ?
    final_result
    =
    Unit
    builder
    .
    (?, Unit, String) -> ?
    build_int_add
    (
    ?
    mul_result
    ,
    Unit
    offset
    ,
    String
    name
    ="final_sum")
  2. Error Handling

    Use raise instead of panic for error handling, and manage exceptions for situations that are not easy to determine directly.

    // Check for operations that might fail
    match func_value.get_nth_param(index) {
      Some(param) => param.into_int_value()
      None => raise CodeGenError("Function parameter \{index} not found")
    }
    

Chapter 4: TinyMoonbit Compiler Implementation

Now let's turn our attention to the actual compiler implementation, converting the abstract syntax tree we built in the previous article into LLVM IR.

Type Mapping: From Parser to LLVM

First, we need to establish a mapping between the TinyMoonbit type system and the LLVM type system:

pub struct CodeGen {
  
?
parser_program
: Program // AST representation of the source program
?
llvm_context
: @llvm.Context // LLVM context
?
llvm_module
: @llvm.Module // LLVM module
?
builder
: @llvm.Builder // IR builder
Map[String, ?]
llvm_functions
:
type Map[K, V]

Mutable linked hash map that maintains the order of insertion, not thread safe.

Example

  let map = { 3: "three", 8 :  "eight", 1 :  "one"}
  assert_eq(map.get(2), None)
  assert_eq(map.get(3), Some("three"))
  map.set(3, "updated")
  assert_eq(map.get(3), Some("updated"))
Map
[
String
String
, @llvm.FunctionValue] // Function map
} pub fn
(?, ?) -> Unit raise Error
convert_type
(
?
self
: Self,
?
parser_type
: Type) -> &@llvm.Type raise {
match
?
parser_type
{
Type::
?
Unit
=>
?
self
Unit
.
?
llvm_context
Unit
.
() -> Unit
void_type
Unit
() as &@llvm.Type
Type::
?
Bool
=>
?
self
.
?
llvm_context
.
() -> Unit
bool_type
()
Type::
?
Int
=>
?
self
.
?
llvm_context
.
() -> Unit
i32_type
()
Type::
?
Double
=>
?
self
.
?
llvm_context
.
() -> Unit
f64_type
()
// Can be extended with more types as needed } }

Environment Management: Mapping Variables to Values

During the code generation phase, we need to maintain a mapping from variable names to LLVM values:

pub struct Env {
  
Env?
parent
:
struct Env {
  parent: Env?
  symbols: Map[String, Unit]
  codegen: CodeGen
  parser_function: ?
  llvm_function: ?
}
Env
? // Reference to the parent environment
Map[String, Unit]
symbols
:
type Map[K, V]

Mutable linked hash map that maintains the order of insertion, not thread safe.

Example

  let map = { 3: "three", 8 :  "eight", 1 :  "one"}
  assert_eq(map.get(2), None)
  assert_eq(map.get(3), Some("three"))
  map.set(3, "updated")
  assert_eq(map.get(3), Some("updated"))
Map
[
String
String
, &@llvm.Value] // Local variable map
// Global information
CodeGen
codegen
:
struct CodeGen {
  parser_program: ?
  llvm_context: ?
  llvm_module: ?
  builder: ?
  llvm_functions: Map[String, ?]
}
CodeGen
// Reference to the code generator
?
parser_function
: Function // AST of the current function
?
llvm_function
: @llvm.FunctionValue // LLVM representation of the current function
} pub fn
(?, String) -> Unit?
get_symbol
(
?
self
: Self,
String
name
:
String
String
) -> &@llvm.Value? {
match
?
self
.
Map[String, Unit]
symbols
.
(self : Map[String, Unit], key : String) -> Unit?

Get the value associated with a key.

get
(
String
name
) {
(Unit) -> Unit?
Some
(
Unit
value
) =>
(Unit) -> Unit?
Some
(
Unit
value
)
Unit?
None
=>
match
?
self
.
Env?
parent
{
(Env) -> Env?
Some
(
Env
parent_env
) =>
Env
parent_env
.
(String) -> Unit?
get_symbol
(
String
name
)
Env?
None
=>
Unit?
None
} } }

Variable Handling: Memory Allocation Strategy

As a systems-level language, TinyMoonbit supports variable reassignment. In LLVM IR's SSA (Static Single Assignment) form, we need to use the alloca + load/store pattern to implement mutable variables:

pub fn Stmt::
(?, Env) -> Unit raise Error
emit
(
?
self
: Self,
Env
env
:
struct Env {
  parent: Env?
  symbols: Map[String, Unit]
  codegen: CodeGen
  parser_function: ?
  llvm_function: ?
}
Env
) ->
Unit
Unit
raise {
match
?
self
{
// Variable declaration: e.g., let x : Int = 5;
(String, Unit, Unit) -> ?
Let
(
String
var_name
,
Unit
var_type
,
Unit
init_expr
) => {
// Convert the type and allocate stack space let
Unit
llvm_type
=
Env
env
.
CodeGen
codegen
.
(Unit) -> Unit
convert_type
(
Unit
var_type
)
let
Unit
alloca
=
Env
env
.
CodeGen
codegen
.
?
builder
.
(Unit, String) -> Unit
build_alloca
(
Unit
llvm_type
,
String
var_name
)
// Record the allocated pointer in the symbol table
Env
env
.
Map[String, Unit]
symbols
.
(self : Map[String, Unit], key : String, value : Unit) -> Unit

Set a key-value pair into the hash map.

set
(
String
var_name
,
Unit
alloca
Unit
as &@llvm.Value
)
// Calculate the value of the initialization expression let
Unit
init_value
=
Unit
init_expr
.
(Env) -> Unit
emit
(
Env
env
).
() -> Unit
into_basic_value
()
// Store the initial value into the allocated memory let _ =
Env
env
.
CodeGen
codegen
.
?
builder
.
(Unit, Unit) -> Unit
build_store
(
Unit
alloca
,
Unit
init_value
)
} // Variable assignment: x = 10;
(Unit, Unit) -> ?
Assign
(
Unit
var_name
,
Unit
rhs_expr
) => {
// Get the memory address of the variable from the symbol table guard let
(_/0) -> Unit
Some
(
_/0
var_ptr
) =
Env
env
.
(Unit) -> Unit
get_symbol
(
Unit
var_name
) else {
raise
Error
CodeGenError
("Undefined variable: \{
Unit
var_name
}")
} // Calculate the value of the right-hand side expression let
Unit
rhs_value
=
Unit
rhs_expr
.
(Env) -> Unit
emit
(
Env
env
).
() -> Unit
into_basic_value
()
// Store the new value into the variable's memory let _ =
Env
env
.
CodeGen
codegen
.
?
builder
.
(Unit, Unit) -> Unit
build_store
(
Unit
var_ptr
,
Unit
rhs_value
)
} // Other statement types... _ => { /* Handle other statements */ } } }

Design Decision: Why use alloca?

In functional languages, immutable variables can be directly mapped to SSA values. However, TinyMoonbit supports variable reassignment, which conflicts with the SSA principle of "each variable is assigned only once."

The alloca + load/store pattern is the standard way to handle mutable variables:

  • alloca: Allocates memory space on the stack.
  • store: Writes a value to memory.
  • load: Reads a value from memory.

LLVM's optimization process will automatically convert simple allocas back to value form (the mem2reg optimization).

Expression Code Generation

Expression code generation is relatively straightforward, mainly involving calling the corresponding instruction-building methods based on the expression type:

fn Expr::
(?, Env) -> Unit raise Error
emit
(
?
self
: Self,
Env
env
:
struct Env {
  parent: Env?
  symbols: Map[String, Unit]
  codegen: CodeGen
  parser_function: ?
  llvm_function: ?
}
Env
) -> &@llvm.Value raise {
match
?
self
{
(Unit) -> ?
AtomExpr
(
Unit
atom_expr
, ..) =>
Unit
atom_expr
.
(Env) -> Unit
emit
(
Env
env
)
(String, Unit, _/0) -> ?
Unary
("-",
Unit
expr
,
_/0
ty
=
(_/0) -> _/1
Some
(
_/0
Int
)) => {
let
Unit
value
=
Unit
expr
.
() -> Unit
emit
().
() -> Unit
into_int_value
()
let
Unit
zero
=
Env
env
.
Unit
gen
.
Unit
llvm_ctx
.
() -> Unit
i32_type
().
() -> Unit
const_zero
()
Env
env
.
Unit
gen
.
?
builder
.
(Unit, Unit) -> Unit
build_int_sub
(
Unit
zero
,
Unit
value
)
}
(String, Unit, _/0) -> ?
Unary
("-",
Unit
expr
,
_/0
ty
=
(_/0) -> _/1
Some
(
_/0
Double
)) => {
let
Unit
value
=
Unit
expr
.
() -> Unit
emit
().
() -> Unit
into_float_value
()
Env
env
.
Unit
gen
.
?
builder
.
(Unit) -> Unit
build_float_neg
(
Unit
value
)
}
(String, Unit, Unit, _/0) -> ?
Binary
("+",
Unit
lhs
,
Unit
rhs
,
_/0
ty
=
(_/0) -> _/1
Some
(
_/0
Int
)) => {
let
Unit
lhs_val
=
Unit
lhs
.
() -> Unit
emit
().
() -> Unit
into_int_value
()
let
Unit
rhs_val
=
Unit
rhs
.
() -> Unit
emit
().
() -> Unit
into_int_value
()
Env
env
.
Unit
gen
.
?
builder
.
(Unit, Unit) -> Unit
build_int_add
(
Unit
lhs_val
,
Unit
rhs_val
)
} // ... others } }

Technical Detail: Floating-Point Negation

Note that when handling floating-point negation, we use build_float_neg instead of subtracting the operand from zero. This is because:

  1. IEEE 754 Standard: Floating-point numbers have special values (like NaN, ∞), and simple subtraction might produce incorrect results.
  2. Performance Considerations: Dedicated negation instructions are usually more efficient on modern processors.
  3. Precision Guarantee: Avoids unnecessary rounding errors.

Chapter 5: Implementation of Control Flow Instructions

Control flow is the backbone of program logic, including conditional branches and loop structures. In LLVM IR, control flow is implemented through Basic Blocks and branch instructions. Each basic block represents a sequence of instructions with no internal jumps, and blocks are connected by branch instructions.

Conditional Branches: Implementing if-else Statements

Conditional branches require creating multiple basic blocks to represent different execution paths:

fn Stmt::
(?, Env) -> Unit raise Error
emit
(
?
self
: Self,
Env
env
:
struct Env {
  parent: Env?
  symbols: Map[String, Unit]
  codegen: CodeGen
  parser_function: ?
  llvm_function: ?
}
Env
) ->
Unit
Unit
raise {
let
Unit
ctx
=
Env
env
.
Unit
gen
.
Unit
llvm_ctx
let
Unit
func
=
Env
env
.
Unit
llvm_func
let
?
builder
=
Env
env
.
Unit
gen
.
?
builder
match
?
self
{
(Unit, Unit, Unit) -> ?
If
(
Unit
cond
,
Unit
then_stmts
,
Unit
else_stmts
) => {
let
Unit
cond_val
=
Unit
cond
.
(Env) -> Unit
emit
(
Env
env
).
() -> Unit
into_int_value
()
// Create three basic blocks let
Unit
then_block
=
Unit
ctx
.
(Unit) -> Unit
append_basic_block
(
Unit
llvm_func
)
let
Unit
else_block
=
Unit
ctx
.
(Unit) -> Unit
append_basic_block
(
Unit
llvm_func
)
let
Unit
merge_block
=
Unit
ctx
.
(Unit) -> Unit
append_basic_block
(
Unit
llvm_func
)
// Create the jump instruction let _ =
?
builder
.
(Unit, Unit, Unit) -> Unit
build_conditional_branch
(
Unit
cond_val
,
Unit
then_block
,
Unit
else_block
,
) // Generate code for the then_block
?
builder
.
(Unit) -> Unit
position_at_end
(
Unit
then_block
)
let
Unit
then_env
=
?
self
.
() -> Unit
subenv
()
Unit
then_stmts
.
((Unit) -> Unit) -> Unit
each
(
Unit
s
=>
Unit
s
.
(Unit) -> Unit
emitStmt
(
Unit
then_env
))
let _ =
?
builder
.
(Unit) -> Unit
build_unconditional_branch
(
Unit
merge_block
)
// Generate code for the else_block
?
builder
.
(Unit) -> Unit
position_at_end
(
Unit
else_block
)
let
Unit
else_env
=
?
self
.
() -> Unit
subenv
()
Unit
else_stmts
.
((Unit) -> Unit) -> Unit
each
(
Unit
s
=>
Unit
s
.
(Unit) -> Unit
emitStmt
(
Unit
else_env
))
let _ =
?
builder
.
(Unit) -> Unit
build_unconditional_branch
(
Unit
merge_block
)
// After code generation is complete, the builder's position should be on the merge_block
?
builder
.
(Unit) -> Unit
position_at_end
(
Unit
merge_block
)
} // ... } }

Generated LLVM IR Example

For the following TinyMoonbit code:

if x > 0 {
  y = x + 1;
} else {
  y = x - 1;
}

It will generate LLVM IR similar to this:

  %1 = load i32, ptr %x, align 4
  %2 = icmp sgt i32 %1, 0
  br i1 %2, label %if.then, label %if.else

if.then:                                          ; preds = %0
  %3 = load i32, ptr %x, align 4
  %4 = add i32 %3, 1
  store i32 %4, ptr %y, align 4
  br label %if.end

if.else:                                          ; preds = %0
  %5 = load i32, ptr %x, align 4
  %6 = sub i32 %5, 1
  store i32 %6, ptr %y, align 4
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  ; Subsequent code...

Loop Structures: Implementing while Statements

The implementation of loops requires special attention to the correct connection of the condition check and the loop body:

fn Stmt::
(?, Env) -> Unit raise Error
emit
(
?
self
: Self,
Env
env
:
struct Env {
  parent: Env?
  symbols: Map[String, Unit]
  codegen: CodeGen
  parser_function: ?
  llvm_function: ?
}
Env
) ->
Unit
Unit
raise {
let
Unit
ctx
=
Env
env
.
Unit
gen
.
Unit
llvm_ctx
let
Unit
func
=
Env
env
.
Unit
llvm_func
let
?
builder
=
Env
env
.
Unit
gen
.
?
builder
match
?
self
{
(Unit, Unit) -> ?
While
(
Unit
cond
,
Unit
body
) => {
// Generate three blocks let
Unit
cond_block
=
Unit
ctx
.
(Unit) -> Unit
append_basic_block
(
Unit
llvm_func
)
let
Unit
body_block
=
Unit
ctx
.
(Unit) -> Unit
append_basic_block
(
Unit
llvm_func
)
let
Unit
merge_block
=
Unit
ctx
.
(Unit) -> Unit
append_basic_block
(
Unit
llvm_func
)
// First, unconditionally jump to the cond block let _ =
?
builder
.
(Unit) -> Unit
build_unconditional_branch
(
Unit
cond_block
)
?
builder
.
(Unit) -> Unit
position_at_end
(
Unit
cond_block
)
// Generate code within the cond block, as well as the conditional jump instruction let
Unit
cond_val
=
Unit
cond
.
() -> Unit
emit
().
() -> Unit
into_int_value
()
let _ =
?
builder
.
(Unit, Unit, Unit) -> Unit
build_conditional_branch
(
Unit
cond_val
,
Unit
body_block
,
Unit
merge_block
,
)
?
builder
.
(Unit) -> Unit
position_at_end
(
Unit
body_block
)
// Generate code for the body block, with an unconditional jump to the cond block at the end let
Unit
body_env
=
?
self
.
() -> Unit
subenv
()
Unit
body
.
((Unit) -> Unit) -> Unit
each
(
Unit
s
=>
Unit
s
.
(Unit) -> Unit
emitStmt
(
Unit
body_env
))
let _ =
?
builder
.
(Unit) -> Unit
build_unconditional_branch
(
Unit
cond_block
)
// After code generation is finished, jump to the merge block
?
builder
.
(Unit) -> Unit
position_at_end
(
Unit
merge_block
)
} // ... } }

Generated LLVM IR Example

For the TinyMoonbit code:

while i < 10 {
  i = i + 1;
}

It will generate:

  br label %while.cond

while.cond:                                       ; preds = %while.body, %0
  %1 = load i32, ptr %i, align 4
  %2 = icmp slt i32 %1, 10
  br i1 %2, label %while.body, label %while.end

while.body:                                       ; preds = %while.cond
  %3 = load i32, ptr %i, align 4
  %4 = add i32 %3, 1
  store i32 %4, ptr %i, align 4
  br label %while.cond

while.end:                                        ; preds = %while.cond
  ; Subsequent code...

💡 Control Flow Design Points

  1. Basic Block Naming Strategy

    The append_basic_block function also has a name label argument.

    // Use descriptive block names for easier debugging and understanding
    let 
    ?
    then_block
    =
    Unit
    context
    .
    (Unit, String) -> ?
    append_basic_block
    (
    Unit
    func
    ,
    String
    name
    ="if.then")
    let
    ?
    else_block
    =
    Unit
    context
    .
    (Unit, String) -> ?
    append_basic_block
    (
    Unit
    func
    ,
    String
    name
    ="if.else")
    let
    ?
    merge_block
    =
    Unit
    context
    .
    (Unit, String) -> ?
    append_basic_block
    (
    Unit
    func
    ,
    String
    name
    ="if.end")
  2. Scope Management

    // Create a separate scope for each branch and loop body
    let 
    ?
    branch_env
    =
    Unit
    env
    .
    () -> ?
    sub_env
    ()
    branch_stmts.each( stmt => stmt.emit(branch_env) }
  3. Builder Position Management

    At the end, be sure to place the instruction builder on the correct basic block.

    // Always ensure the builder points to the correct basic block
    builder.position_at_end(merge_block)
    // Generate instructions in this block...
    

Chapter 6: From LLVM IR to Machine Code

After generating the complete LLVM IR, we need to convert it into assembly code for the target machine. Although llvm.mbt provides a complete target machine configuration API, for learning purposes, we can use a simpler method.

Compiling with the llc Toolchain

The most direct method is to output the generated LLVM IR to a file and then use the LLVM toolchain to compile it:

Call the dump function of the Module, or you can use the println function.

let 
CodeGen
gen
:
struct CodeGen {
  parser_program: ?
  llvm_context: ?
  llvm_module: ?
  builder: ?
  llvm_functions: Map[String, ?]
}
CodeGen
= ...
let
?
prog
=
CodeGen
gen
.
?
llvm_prog
prog.dump() // dump is recommended as it will be slightly faster than println, with the same effect // or println(prog)

Complete Compilation Flow Example

Let's look at a complete compilation flow from source code to assembly code:

  1. TinyMoonbit Source Code

    fn 
    (n : Int) -> Int
    factorial
    (
    Int
    n
    :
    Int
    Int
    ) ->
    Int
    Int
    {
    if
    Int
    n
    (self_ : Int, other : Int) -> Bool
    <=
    1 {
    return 1; } return
    Int
    n
    (self : Int, other : Int) -> Int

    Multiplies two 32-bit integers. This is the implementation of the * operator for Int.

    Parameters:

    • self : The first integer operand.
    • other : The second integer operand.

    Returns the product of the two integers. If the result overflows the range of Int, it wraps around according to two's complement arithmetic.

    Example:

      inspect(42 * 2, content="84")
      inspect(-10 * 3, content="-30")
      let max = 2147483647 // Int.max_value
      inspect(max * 2, content="-2") // Overflow wraps around
    *
    (n : Int) -> Int
    factorial
    (
    Int
    n
    (self : Int, other : Int) -> Int

    Performs subtraction between two 32-bit integers, following standard two's complement arithmetic rules. When the result overflows or underflows, it wraps around within the 32-bit integer range.

    Parameters:

    • self : The minuend (the number being subtracted from).
    • other : The subtrahend (the number to subtract).

    Returns the difference between self and other.

    Example:

      let a = 42
      let b = 10
      inspect(a - b, content="32")
      let max = 2147483647 // Int maximum value
      inspect(max - -1, content="-2147483648") // Overflow case
    -
    1);
    } fn main() -> Unit { let
    Int
    result
    :
    Int
    Int
    =
    (n : Int) -> Int
    factorial
    (5);
    (Int) -> Unit
    print_int
    (
    Int
    result
    );
    }
  2. Generated LLVM IR

    ; ModuleID = 'tinymoonbit'
    source_filename = "tinymoonbit"
    
    define i32 @factorial(i32 %0) {
    entry:
      %1 = alloca i32, align 4
      store i32 %0, ptr %1, align 4
      %2 = load i32, ptr %1, align 4
      %3 = icmp sle i32 %2, 1
      br i1 %3, label %4, label %6
    
    4:                                                ; preds = %entry
      ret i32 1
    
    6:                                                ; preds = %entry
      %7 = load i32, ptr %1, align 4
      %8 = load i32, ptr %1, align 4
      %9 = sub i32 %8, 1
      %10 = call i32 @factorial(i32 %9)
      %11 = mul i32 %7, %10
      ret i32 %11
    }
    
    define void @main() {
    entry:
      %0 = alloca i32, align 4
      %1 = call i32 @factorial(i32 5)
      store i32 %1, ptr %0, align 4
      %2 = load i32, ptr %0, align 4
      call void @print_int(i32 %2)
      ret void
    }
    
    declare void @print_int(i32 %0)
    
  3. Generating RISC-V Assembly with llc

    # Generate llvm ir
    moon run main --target native > fact.ll
    
    # Generate RISC-V 64-bit assembly code
    llc -march=riscv64 -mattr=+m -o fact.s fact.ll
    
  4. Generated RISC-V Assembly Snippet

    factorial:
    .Lfunc_begin0:
    	.cfi_startproc
    	addi	sp, sp, -32
    	.cfi_def_cfa_offset 32
    	sd	ra, 24(sp)
    	.cfi_offset ra, -8
    	sd	s0, 16(sp)
    	.cfi_offset s0, -16
    	addi	s0, sp, 32
    	.cfi_def_cfa s0, 0
    	sw	a0, -20(s0)
    	lw	a0, -20(s0)
    	li	a1, 1
    	blt	a1, a0, .LBB0_2
    	li	a0, 1
    	j	.LBB0_3
    .LBB0_2:
    	lw	a0, -20(s0)
    	lw	a1, -20(s0)
    	addi	a1, a1, -1
    	sw	a0, -24(s0)
    	mv	a0, a1
    	call	factorial
    	lw	a1, -24(s0)
    	mul	a0, a1, a0
    .LBB0_3:
    	ld	ra, 24(sp)
    	ld	s0, 16(sp)
    	addi	sp, sp, 32
    	ret
    

Conclusion

Through this two-part series, we have completed a fully functional, albeit simple, compiler implementation. From the lexical analysis of a character stream to the construction of an abstract syntax tree, and finally to the generation of LLVM IR and machine code output.

Review

Part 1:

  • An elegant lexer based on pattern matching
  • Implementation of a recursive descent parser
  • A complete type-checking system
  • Scope management with an environment chain

Part 2:

  • A deep dive into the LLVM type and value systems
  • Variable management strategies in SSA form
  • Correct implementation of control flow instructions
  • A complete code generation pipeline

Moonbit's Advantages in Compiler Development

Through this practical project, we have gained a deep appreciation for Moonbit's unique value in the field of compiler construction:

  1. Expressive Pattern Matching: Greatly simplifies the complexity of AST processing and type analysis.
  2. Functional Programming Paradigm: Immutable data structures and pure functions make the compiler logic clearer and more reliable.
  3. Modern Type System: Trait objects, generics, and error handling mechanisms provide ample abstraction capabilities.
  4. Excellent Engineering Features: Features like derive and JSON serialization significantly improve development efficiency.

Final Words

Compiler technology represents the perfect combination of computer science theory and engineering practice. With a modern tool like Moonbit, we can explore this ancient yet vibrant field in a more elegant and efficient way.

We hope this series of articles will provide readers with a powerful aid on their journey into compiler design.

Recommended Learning Resources