Dancing with LLVM: A Moonbit Chronicle (Part 2) - LLVM Backend Generation
Introduction
In the process of programming language design, the frontend is responsible for understanding and verifying the structure and semantics of a program, while the compiler backend takes on the task of translating these abstract concepts into executable machine code. The implementation of the backend not only requires a deep understanding of the target architecture but also mastery of complex optimization techniques to generate efficient code.
LLVM (Low Level Virtual Machine), as a comprehensive modern compiler infrastructure, provides us with a powerful and flexible solution. By converting a program into LLVM Intermediate Representation (IR), we can leverage LLVM's mature toolchain to compile the code to various target architectures, including RISC-V, ARM, and x86.
Moonbit's LLVM Ecosystem
Moonbit officially provides two important LLVM-related projects:
llvm.mbt
: Moonbit language bindings for the original LLVM, providing direct access to the llvm-c interface. It requires the installation of the complete LLVM toolchain, can only generate for native backends, and requires you to handle compilation and linking yourself, but it can generate IR that is fully compatible with the original LLVM.MoonLLVM
: A pure Moonbit implementation of an LLVM-like system. It can generate LLVM IR without external dependencies and supports JavaScript and WebAssembly backends.This article chooses
llvm.mbt
as our tool. Its API design is inspired by the highly acclaimedinkwell
library in the Rust ecosystem.
In the previous article, "Dancing with LLVM: A Moonbit Chronicle (Part 1) - Implementing the Frontend," we completed the conversion from source code to a typed abstract syntax tree. This article will build on that achievement, focusing on the core techniques and implementation details of code generation.
Chapter 1: Representing the LLVM Type System in Moonbit
Before diving into code generation, we first need to understand how llvm.mbt
represents LLVM's various concepts within Moonbit's type system. LLVM's type system is quite complex, containing multiple levels such as basic types, composite types, and function types.
Trait Objects: An Abstract Representation of Types
In the API design of llvm.mbt
, you will frequently encounter the core concept of &Type
. This is not a concrete struct
or enum
, but a Trait Object—which can be understood as the functional equivalent of an abstract base class in object-oriented programming.
// &Type is a trait object representing any LLVM type
let Unit
some_type: &Type = Unit
context.() -> Unit
i32_type()
Type Identification and Conversion
To determine the specific type of a &Type
, we need to perform a runtime type check using the as_type_enum
interface:
pub fn (ty : Unit) -> String
identify_type(Unit
ty: &Type) -> String
String {
match Unit
ty.() -> Unit
as_type_enum() {
(Unit) -> Unit
IntType(Unit
int_ty) => "Integer type with \{Unit
int_ty.() -> Unit
get_bit_width()} bits"
(_/0) -> Unit
FloatType(_/0
float_ty) => "Floating point type"
(_/0) -> Unit
PointerType(_/0
ptr_ty) => "Pointer type"
(_/0) -> Unit
FunctionType(_/0
func_ty) => "Function type"
(_/0) -> Unit
ArrayType(_/0
array_ty) => "Array type"
(_/0) -> Unit
StructType(_/0
struct_ty) => "Structure type"
(_/0) -> Unit
VectorType(_/0
vec_ty) => "Vector type"
(_/0) -> Unit
ScalableVectorType(_/0
svec_ty) => "Scalable vector type"
(_/0) -> Unit
MetadataType(_/0
meta_ty) => "Metadata type"
}
}
Safe Type Conversion Strategies
When we are certain that a &Type
has a specific type, there are several conversion methods to choose from:
-
Direct Conversion (for deterministic scenarios)
let
ty: &Type =Unit
context.Unit
i32_type() let() -> Unit
i32_ty =?
ty.Unit
into_int_type() // Direct conversion, errors are handled by llvm.mbt let() -> ?
bit_width =?
i32_ty.?
get_bit_width() // Call a method specific to IntType() -> ?
-
Defensive Conversion (recommended for production environments)
let
ty: &Type =Unit
get_some_type() // An unknown type obtained from somewhere guard ty.as_type_enum() is IntType(i32_ty) else { raise CodeGenError("Expected integer type, got \{ty}") } // Now it's safe to use i32_ty let() -> Unit
bit_width =?
i32_ty.?
get_bit_width()() -> ?
Constructing Composite Types
LLVM supports various composite types, which are usually constructed through methods of basic types:
pub fn (context : ?) -> Unit
create_composite_types(?
context: @llvm.Context) -> Unit
Unit {
let Unit
i32_ty = ?
context.() -> Unit
i32_type()
let Unit
f64_ty = ?
context.() -> Unit
f64_type()
// Array type: [16 x i32]
let Unit
i32_array_ty = Unit
i32_ty.(Int) -> Unit
array_type(16)
// Function type: i32 (i32, i32)
let Unit
add_func_ty = Unit
i32_ty.(Array[Unit]) -> Unit
fn_type([Unit
i32_ty, Unit
i32_ty])
// Struct type: {i32, f64}
let Unit
struct_ty = ?
context.(Array[Unit]) -> Unit
struct_type([Unit
i32_ty, Unit
f64_ty])
// Pointer type (all pointers are opaque in LLVM 18+)
let Unit
ptr_ty = Unit
i32_ty.() -> Unit
ptr_type()
// Output type information for verification
(input : String) -> Unit
Prints any value that implements the Show
trait to the standard output,
followed by a newline.
Parameters:
value
: The value to be printed. Must implement the Show
trait.
Example:
println(42)
println("Hello, World!")
println([1, 2, 3])
println("Array type: \{Unit
i32_array_ty}") // [16 x i32]
(input : String) -> Unit
Prints any value that implements the Show
trait to the standard output,
followed by a newline.
Parameters:
value
: The value to be printed. Must implement the Show
trait.
Example:
println(42)
println("Hello, World!")
println([1, 2, 3])
println("Function type: \{Unit
add_func_ty}") // i32 (i32, i32)
(input : String) -> Unit
Prints any value that implements the Show
trait to the standard output,
followed by a newline.
Parameters:
value
: The value to be printed. Must implement the Show
trait.
Example:
println(42)
println("Hello, World!")
println([1, 2, 3])
println("Struct type: \{Unit
struct_ty}") // {i32, f64}
(input : String) -> Unit
Prints any value that implements the Show
trait to the standard output,
followed by a newline.
Parameters:
value
: The value to be printed. Must implement the Show
trait.
Example:
println(42)
println("Hello, World!")
println([1, 2, 3])
println("Pointer type: \{Unit
ptr_ty}") // ptr
}
Important Reminder: Opaque Pointers
Starting with LLVM version 18, all pointer types use the opaque pointer design. This means that regardless of the type they point to, all pointers are represented as
ptr
in the IR, and the specific type information they point to is no longer visible in the type system.
Chapter 2: The LLVM Value System and the BasicValue Concept
Compared to the type system, LLVM's value system is more complex. llvm.mbt
, consistent with inkwell
, divides values into two important abstract layers: Value
and BasicValue
. The difference lies in distinguishing the source of value creation from the way values are used:
- Value: Focuses on how a value is produced (e.g., constants, instruction results).
- BasicValue: Focuses on what basic type a value has (e.g., integer, float, pointer).
Practical Application Example
pub fn (context : ?, builder : ?) -> Unit
demonstrate_value_system(?
context: Context, ?
builder: Builder) -> Unit
Unit {
let Unit
i32_ty = ?
context.() -> Unit
i32_type()
// Create two integer constants - these are directly IntValue
let Unit
const1 = Unit
i32_ty.(Int) -> Unit
const_int(10) // Value: IntValue, BasicValue: IntValue
let Unit
const2 = Unit
i32_ty.(Int) -> Unit
const_int(20) // Value: IntValue, BasicValue: IntValue
// Perform an addition operation - the result is an InstructionValue
let Unit
add_result = ?
builder.(Unit, Unit) -> Unit
build_int_add(Unit
const1, Unit
const2)
// In different contexts, we need different perspectives:
// As an instruction to check its properties
let Unit
instruction = Unit
add_result.() -> Unit
as_instruction()
(input : String) -> Unit
Prints any value that implements the Show
trait to the standard output,
followed by a newline.
Parameters:
value
: The value to be printed. Must implement the Show
trait.
Example:
println(42)
println("Hello, World!")
println([1, 2, 3])
println("Instruction opcode: \{Unit
instruction.() -> Unit
get_opcode()}")
// As a basic value to get its type
let Unit
basic_value = Unit
add_result.() -> Unit
into_basic_value()
(input : String) -> Unit
Prints any value that implements the Show
trait to the standard output,
followed by a newline.
Parameters:
value
: The value to be printed. Must implement the Show
trait.
Example:
println(42)
println("Hello, World!")
println([1, 2, 3])
println("Result type: \{Unit
basic_value.() -> Unit
get_type()}")
// As an integer value for subsequent calculations
let Unit
int_value = Unit
add_result.() -> Unit
into_int_value()
let Unit
final_result = ?
builder.(Unit, Unit) -> Unit
build_int_mul(Unit
int_value, Unit
const1)
}
Complete Classification of Value Types
-
ValueEnum: All possible value types
pub enum ValueEnum {
IntValue(IntValue) // Integer value(?) -> ValueEnum
FloatValue(FloatValue) // Floating-point value(?) -> ValueEnum
PointerValue(PointerValue) // Pointer value(?) -> ValueEnum
StructValue(StructValue) // Struct value(?) -> ValueEnum
FunctionValue(FunctionValue) // Function value(?) -> ValueEnum
ArrayValue(ArrayValue) // Array value(?) -> ValueEnum
VectorValue(VectorValue) // Vector value(?) -> ValueEnum
PhiValue(PhiValue) // Phi node value(?) -> ValueEnum
ScalableVectorValue(ScalableVectorValue) // Scalable vector value(?) -> ValueEnum
MetadataValue(MetadataValue) // Metadata value(?) -> ValueEnum
CallSiteValue(CallSiteValue) // Call site value(?) -> ValueEnum
GlobalValue(GlobalValue) // Global value(?) -> ValueEnum
InstructionValue(InstructionValue) // Instruction value } derive((?) -> ValueEnum
Show)trait Show { output(Self, &Logger) -> Unit to_string(Self) -> String }
Trait for types that can be converted to
String
-
BasicValueEnum: Values that have a basic type
pub enum BasicValueEnum {
ArrayValue(ArrayValue) // Array value(?) -> BasicValueEnum
IntValue(IntValue) // Integer value(?) -> BasicValueEnum
FloatValue(FloatValue) // Floating-point value(?) -> BasicValueEnum
PointerValue(PointerValue) // Pointer value(?) -> BasicValueEnum
StructValue(StructValue) // Struct value(?) -> BasicValueEnum
VectorValue(VectorValue) // Vector value(?) -> BasicValueEnum
ScalableVectorValue(ScalableVectorValue) // Scalable vector value } derive((?) -> BasicValueEnum
Show)trait Show { output(Self, &Logger) -> Unit to_string(Self) -> String }
Trait for types that can be converted to
String
💡 Best Practices for Value Conversion
In the actual code generation process, we often need to convert between different value perspectives:
pub fn (instruction_result : Unit) -> Unit
value_conversion_patterns(Unit
instruction_result: &Value) -> Unit
Unit {
// Pattern 1: I know what type this is, convert directly
let Unit
int_val = Unit
instruction_result.() -> Unit
into_int_value()
// Pattern 2: I just need a basic value, I don't care about the specific type
let Unit
basic_val = Unit
instruction_result.() -> Unit
into_basic_value()
// Pattern 3: Defensive programming, check before converting
match Unit
instruction_result.() -> Unit
as_value_enum() {
// Handle integer values
(Unit) -> Unit
IntValue(Unit
int_val) => (Unit) -> Unit
handle_integer(Unit
int_val)
// Handle float values
(Unit) -> Unit
FloatValue(Unit
float_val) => (Unit) -> Unit
handle_float(Unit
float_val)
_ => raise Error
CodeGenError("Unexpected value type")
}
}
Through this two-layer abstraction, llvm.mbt
maintains the integrity of the LLVM value system while providing an intuitive and easy-to-use interface for Moonbit developers.
Chapter 3: Practical LLVM IR Generation
Now that we understand the type and value systems, let's demonstrate how to use llvm.mbt
to generate LLVM IR with a complete example. This example will implement a simple muladd
function, showing the entire process from initialization to instruction generation.
Infrastructure Initialization
Any LLVM program begins by establishing three core components:
pub fn () -> (?, ?, ?)
initialize_llvm() -> (Context, Module, Builder) {
// 1. Create an LLVM context - a container for all LLVM objects
let ?
context = () -> ?
@llvm.Context::create()
// 2. Create a module - a container for functions and global variables
let ?
module = ?
context.(String) -> ?
create_module("demo_module")
// 3. Create an IR builder - used to generate instructions
let ?
builder = ?
context.() -> ?
create_builder()
(?
context, ?
module, ?
builder)
}
A Simple Function Generation Example
Let's implement a function that calculates (a * b) + c
:
pub fn () -> String
generate_muladd_function() -> String
String {
// Initialize LLVM infrastructure
let (?
context, ?
module, ?
builder) = () -> (?, ?, ?)
initialize_llvm()
// Define the function signature
let Unit
i32_ty = ?
context.() -> Unit
i32_type()
let Unit
func_type = Unit
i32_ty.(Array[Unit]) -> Unit
fn_type([Unit
i32_ty, Unit
i32_ty, Unit
i32_ty])
let Unit
func_value = ?
module.(String, Unit) -> Unit
add_function("muladd", Unit
func_type)
// Create the function entry basic block
let Unit
entry_block = ?
context.(Unit, String) -> Unit
append_basic_block(Unit
func_value, "entry")
?
builder.(Unit) -> Unit
position_at_end(Unit
entry_block)
// Get the function parameters
let Unit
arg_a = Unit
func_value.(Int) -> Unit
get_nth_param(0).() -> Unit
unwrap().() -> Unit
into_int_value()
let Unit
arg_b = Unit
func_value.(Int) -> Unit
get_nth_param(1).() -> Unit
unwrap().() -> Unit
into_int_value()
let Unit
arg_c = Unit
func_value.(Int) -> Unit
get_nth_param(2).() -> Unit
unwrap().() -> Unit
into_int_value()
// Generate calculation instructions
let Unit
mul_result = ?
builder.(Unit, Unit) -> Unit
build_int_mul(Unit
arg_a, Unit
arg_b).() -> Unit
into_int_value()
let Unit
add_result = ?
builder.(Unit, Unit) -> Unit
build_int_add(Unit
mul_result, Unit
arg_c)
// Generate the return instruction
let _ = ?
builder.(Unit) -> Unit
build_return(Unit
add_result)
// Output the generated IR
?
module.() -> String
dump()
}
Generated LLVM IR
Running the above code will produce the following LLVM Intermediate Representation:
; ModuleID = 'demo_module'
source_filename = "demo_module"
define i32 @muladd(i32 %0, i32 %1, i32 %2) {
entry:
%3 = mul i32 %0, %1
%4 = add i32 %3, %2
ret i32 %4
}
💡 Code Generation Best Practices
-
Naming Conventions
For instructions that return a value, the build interface has a
name
label argument, which can be used to add a name to the result of the instruction.let
mul_result =?
builder.Unit
build_int_mul((Unit, Unit, String) -> ?
lhs,Unit
rhs,Unit
name="temp_product") letString
final_result =?
builder.Unit
build_int_add((?, Unit, String) -> ?
mul_result,?
offset,Unit
name="final_sum")String
-
Error Handling
Use
raise
instead ofpanic
for error handling, and manage exceptions for situations that are not easy to determine directly.// Check for operations that might fail match func_value.get_nth_param(index) { Some(param) => param.into_int_value() None => raise CodeGenError("Function parameter \{index} not found") }
Chapter 4: TinyMoonbit Compiler Implementation
Now let's turn our attention to the actual compiler implementation, converting the abstract syntax tree we built in the previous article into LLVM IR.
Type Mapping: From Parser to LLVM
First, we need to establish a mapping between the TinyMoonbit type system and the LLVM type system:
pub struct CodeGen {
?
parser_program : Program // AST representation of the source program
?
llvm_context : @llvm.Context // LLVM context
?
llvm_module : @llvm.Module // LLVM module
?
builder : @llvm.Builder // IR builder
Map[String, ?]
llvm_functions : type Map[K, V]
Mutable linked hash map that maintains the order of insertion, not thread safe.
Example
let map = { 3: "three", 8 : "eight", 1 : "one"}
assert_eq(map.get(2), None)
assert_eq(map.get(3), Some("three"))
map.set(3, "updated")
assert_eq(map.get(3), Some("updated"))
Map[String
String, @llvm.FunctionValue] // Function map
}
pub fn (?, ?) -> Unit raise Error
convert_type(?
self : Self, ?
parser_type : Type) -> &@llvm.Type raise {
match ?
parser_type {
Type::?
Unit => ?
selfUnit
.?
llvm_contextUnit
.() -> Unit
void_typeUnit
() as &@llvm.Type
Type::?
Bool => ?
self.?
llvm_context.() -> Unit
bool_type()
Type::?
Int => ?
self.?
llvm_context.() -> Unit
i32_type()
Type::?
Double => ?
self.?
llvm_context.() -> Unit
f64_type()
// Can be extended with more types as needed
}
}
Environment Management: Mapping Variables to Values
During the code generation phase, we need to maintain a mapping from variable names to LLVM values:
pub struct Env {
Env?
parent : struct Env {
parent: Env?
symbols: Map[String, Unit]
codegen: CodeGen
parser_function: ?
llvm_function: ?
}
Env? // Reference to the parent environment
Map[String, Unit]
symbols : type Map[K, V]
Mutable linked hash map that maintains the order of insertion, not thread safe.
Example
let map = { 3: "three", 8 : "eight", 1 : "one"}
assert_eq(map.get(2), None)
assert_eq(map.get(3), Some("three"))
map.set(3, "updated")
assert_eq(map.get(3), Some("updated"))
Map[String
String, &@llvm.Value] // Local variable map
// Global information
CodeGen
codegen : struct CodeGen {
parser_program: ?
llvm_context: ?
llvm_module: ?
builder: ?
llvm_functions: Map[String, ?]
}
CodeGen // Reference to the code generator
?
parser_function : Function // AST of the current function
?
llvm_function : @llvm.FunctionValue // LLVM representation of the current function
}
pub fn (?, String) -> Unit?
get_symbol(?
self : Self, String
name : String
String) -> &@llvm.Value? {
match ?
self.Map[String, Unit]
symbols.(self : Map[String, Unit], key : String) -> Unit?
Get the value associated with a key.
get(String
name) {
(Unit) -> Unit?
Some(Unit
value) => (Unit) -> Unit?
Some(Unit
value)
Unit?
None =>
match ?
self.Env?
parent {
(Env) -> Env?
Some(Env
parent_env) => Env
parent_env.(String) -> Unit?
get_symbol(String
name)
Env?
None => Unit?
None
}
}
}
Variable Handling: Memory Allocation Strategy
As a systems-level language, TinyMoonbit supports variable reassignment. In LLVM IR's SSA (Static Single Assignment) form, we need to use the alloca + load/store pattern to implement mutable variables:
pub fn Stmt::(?, Env) -> Unit raise Error
emit(?
self : Self, Env
env : struct Env {
parent: Env?
symbols: Map[String, Unit]
codegen: CodeGen
parser_function: ?
llvm_function: ?
}
Env) -> Unit
Unit raise {
match ?
self {
// Variable declaration: e.g., let x : Int = 5;
(String, Unit, Unit) -> ?
Let(String
var_name, Unit
var_type, Unit
init_expr) => {
// Convert the type and allocate stack space
let Unit
llvm_type = Env
env.CodeGen
codegen.(Unit) -> Unit
convert_type(Unit
var_type)
let Unit
alloca = Env
env.CodeGen
codegen.?
builder.(Unit, String) -> Unit
build_alloca(Unit
llvm_type, String
var_name)
// Record the allocated pointer in the symbol table
Env
env.Map[String, Unit]
symbols.(self : Map[String, Unit], key : String, value : Unit) -> Unit
Set a key-value pair into the hash map.
set(String
var_name, Unit
allocaUnit
as &@llvm.Value)
// Calculate the value of the initialization expression
let Unit
init_value = Unit
init_expr.(Env) -> Unit
emit(Env
env).() -> Unit
into_basic_value()
// Store the initial value into the allocated memory
let _ = Env
env.CodeGen
codegen.?
builder.(Unit, Unit) -> Unit
build_store(Unit
alloca, Unit
init_value)
}
// Variable assignment: x = 10;
(Unit, Unit) -> ?
Assign(Unit
var_name, Unit
rhs_expr) => {
// Get the memory address of the variable from the symbol table
guard let (_/0) -> Unit
Some(_/0
var_ptr) = Env
env.(Unit) -> Unit
get_symbol(Unit
var_name) else {
raise Error
CodeGenError("Undefined variable: \{Unit
var_name}")
}
// Calculate the value of the right-hand side expression
let Unit
rhs_value = Unit
rhs_expr.(Env) -> Unit
emit(Env
env).() -> Unit
into_basic_value()
// Store the new value into the variable's memory
let _ = Env
env.CodeGen
codegen.?
builder.(Unit, Unit) -> Unit
build_store(Unit
var_ptr, Unit
rhs_value)
}
// Other statement types...
_ => { /* Handle other statements */ }
}
}
Design Decision: Why use
alloca
?In functional languages, immutable variables can be directly mapped to SSA values. However, TinyMoonbit supports variable reassignment, which conflicts with the SSA principle of "each variable is assigned only once."
The alloca + load/store pattern is the standard way to handle mutable variables:
alloca
: Allocates memory space on the stack.store
: Writes a value to memory.load
: Reads a value from memory.LLVM's optimization process will automatically convert simple
alloca
s back to value form (themem2reg
optimization).
Expression Code Generation
Expression code generation is relatively straightforward, mainly involving calling the corresponding instruction-building methods based on the expression type:
fn Expr::(?, Env) -> Unit raise Error
emit(?
self: Self, Env
env: struct Env {
parent: Env?
symbols: Map[String, Unit]
codegen: CodeGen
parser_function: ?
llvm_function: ?
}
Env) -> &@llvm.Value raise {
match ?
self {
(Unit) -> ?
AtomExpr(Unit
atom_expr, ..) => Unit
atom_expr.(Env) -> Unit
emit(Env
env)
(String, Unit, _/0) -> ?
Unary("-", Unit
expr, _/0
ty = (_/0) -> _/1
Some(_/0
Int)) => {
let Unit
value = Unit
expr.() -> Unit
emit().() -> Unit
into_int_value()
let Unit
zero = Env
env.Unit
gen.Unit
llvm_ctx.() -> Unit
i32_type().() -> Unit
const_zero()
Env
env.Unit
gen.?
builder.(Unit, Unit) -> Unit
build_int_sub(Unit
zero, Unit
value)
}
(String, Unit, _/0) -> ?
Unary("-", Unit
expr, _/0
ty = (_/0) -> _/1
Some(_/0
Double)) => {
let Unit
value = Unit
expr.() -> Unit
emit().() -> Unit
into_float_value()
Env
env.Unit
gen.?
builder.(Unit) -> Unit
build_float_neg(Unit
value)
}
(String, Unit, Unit, _/0) -> ?
Binary("+", Unit
lhs, Unit
rhs, _/0
ty=(_/0) -> _/1
Some(_/0
Int)) => {
let Unit
lhs_val = Unit
lhs.() -> Unit
emit().() -> Unit
into_int_value()
let Unit
rhs_val = Unit
rhs.() -> Unit
emit().() -> Unit
into_int_value()
Env
env.Unit
gen.?
builder.(Unit, Unit) -> Unit
build_int_add(Unit
lhs_val, Unit
rhs_val)
}
// ... others
}
}
Technical Detail: Floating-Point Negation
Note that when handling floating-point negation, we use
build_float_neg
instead of subtracting the operand from zero. This is because:
- IEEE 754 Standard: Floating-point numbers have special values (like NaN, ∞), and simple subtraction might produce incorrect results.
- Performance Considerations: Dedicated negation instructions are usually more efficient on modern processors.
- Precision Guarantee: Avoids unnecessary rounding errors.
Chapter 5: Implementation of Control Flow Instructions
Control flow is the backbone of program logic, including conditional branches and loop structures. In LLVM IR, control flow is implemented through Basic Blocks and branch instructions. Each basic block represents a sequence of instructions with no internal jumps, and blocks are connected by branch instructions.
Conditional Branches: Implementing if-else Statements
Conditional branches require creating multiple basic blocks to represent different execution paths:
fn Stmt::(?, Env) -> Unit raise Error
emit(?
self: Self, Env
env: struct Env {
parent: Env?
symbols: Map[String, Unit]
codegen: CodeGen
parser_function: ?
llvm_function: ?
}
Env) -> Unit
Unit raise {
let Unit
ctx = Env
env.Unit
gen.Unit
llvm_ctx
let Unit
func = Env
env.Unit
llvm_func
let ?
builder = Env
env.Unit
gen.?
builder
match ?
self {
(Unit, Unit, Unit) -> ?
If(Unit
cond, Unit
then_stmts, Unit
else_stmts) => {
let Unit
cond_val = Unit
cond.(Env) -> Unit
emit(Env
env).() -> Unit
into_int_value()
// Create three basic blocks
let Unit
then_block = Unit
ctx.(Unit) -> Unit
append_basic_block(Unit
llvm_func)
let Unit
else_block = Unit
ctx.(Unit) -> Unit
append_basic_block(Unit
llvm_func)
let Unit
merge_block = Unit
ctx.(Unit) -> Unit
append_basic_block(Unit
llvm_func)
// Create the jump instruction
let _ = ?
builder.(Unit, Unit, Unit) -> Unit
build_conditional_branch(
Unit
cond_val, Unit
then_block, Unit
else_block,
)
// Generate code for the then_block
?
builder.(Unit) -> Unit
position_at_end(Unit
then_block)
let Unit
then_env = ?
self.() -> Unit
subenv()
Unit
then_stmts.((Unit) -> Unit) -> Unit
each(Unit
s => Unit
s.(Unit) -> Unit
emitStmt(Unit
then_env))
let _ = ?
builder.(Unit) -> Unit
build_unconditional_branch(Unit
merge_block)
// Generate code for the else_block
?
builder.(Unit) -> Unit
position_at_end(Unit
else_block)
let Unit
else_env = ?
self.() -> Unit
subenv()
Unit
else_stmts.((Unit) -> Unit) -> Unit
each(Unit
s => Unit
s.(Unit) -> Unit
emitStmt(Unit
else_env))
let _ = ?
builder.(Unit) -> Unit
build_unconditional_branch(Unit
merge_block)
// After code generation is complete, the builder's position should be on the merge_block
?
builder.(Unit) -> Unit
position_at_end(Unit
merge_block)
}
// ...
}
}
Generated LLVM IR Example
For the following TinyMoonbit code:
if x > 0 {
y = x + 1;
} else {
y = x - 1;
}
It will generate LLVM IR similar to this:
%1 = load i32, ptr %x, align 4
%2 = icmp sgt i32 %1, 0
br i1 %2, label %if.then, label %if.else
if.then: ; preds = %0
%3 = load i32, ptr %x, align 4
%4 = add i32 %3, 1
store i32 %4, ptr %y, align 4
br label %if.end
if.else: ; preds = %0
%5 = load i32, ptr %x, align 4
%6 = sub i32 %5, 1
store i32 %6, ptr %y, align 4
br label %if.end
if.end: ; preds = %if.else, %if.then
; Subsequent code...
Loop Structures: Implementing while Statements
The implementation of loops requires special attention to the correct connection of the condition check and the loop body:
fn Stmt::(?, Env) -> Unit raise Error
emit(?
self: Self, Env
env: struct Env {
parent: Env?
symbols: Map[String, Unit]
codegen: CodeGen
parser_function: ?
llvm_function: ?
}
Env) -> Unit
Unit raise {
let Unit
ctx = Env
env.Unit
gen.Unit
llvm_ctx
let Unit
func = Env
env.Unit
llvm_func
let ?
builder = Env
env.Unit
gen.?
builder
match ?
self {
(Unit, Unit) -> ?
While(Unit
cond, Unit
body) => {
// Generate three blocks
let Unit
cond_block = Unit
ctx.(Unit) -> Unit
append_basic_block(Unit
llvm_func)
let Unit
body_block = Unit
ctx.(Unit) -> Unit
append_basic_block(Unit
llvm_func)
let Unit
merge_block = Unit
ctx.(Unit) -> Unit
append_basic_block(Unit
llvm_func)
// First, unconditionally jump to the cond block
let _ = ?
builder.(Unit) -> Unit
build_unconditional_branch(Unit
cond_block)
?
builder.(Unit) -> Unit
position_at_end(Unit
cond_block)
// Generate code within the cond block, as well as the conditional jump instruction
let Unit
cond_val = Unit
cond.() -> Unit
emit().() -> Unit
into_int_value()
let _ = ?
builder.(Unit, Unit, Unit) -> Unit
build_conditional_branch(
Unit
cond_val, Unit
body_block, Unit
merge_block,
)
?
builder.(Unit) -> Unit
position_at_end(Unit
body_block)
// Generate code for the body block, with an unconditional jump to the cond block at the end
let Unit
body_env = ?
self.() -> Unit
subenv()
Unit
body.((Unit) -> Unit) -> Unit
each(Unit
s => Unit
s.(Unit) -> Unit
emitStmt(Unit
body_env))
let _ = ?
builder.(Unit) -> Unit
build_unconditional_branch(Unit
cond_block)
// After code generation is finished, jump to the merge block
?
builder.(Unit) -> Unit
position_at_end(Unit
merge_block)
}
// ...
}
}
Generated LLVM IR Example
For the TinyMoonbit code:
while i < 10 {
i = i + 1;
}
It will generate:
br label %while.cond
while.cond: ; preds = %while.body, %0
%1 = load i32, ptr %i, align 4
%2 = icmp slt i32 %1, 10
br i1 %2, label %while.body, label %while.end
while.body: ; preds = %while.cond
%3 = load i32, ptr %i, align 4
%4 = add i32 %3, 1
store i32 %4, ptr %i, align 4
br label %while.cond
while.end: ; preds = %while.cond
; Subsequent code...
💡 Control Flow Design Points
-
Basic Block Naming Strategy
The
append_basic_block
function also has aname
label argument.// Use descriptive block names for easier debugging and understanding let
then_block =?
context.Unit
append_basic_block((Unit, String) -> ?
func,Unit
name="if.then") letString
else_block =?
context.Unit
append_basic_block((Unit, String) -> ?
func,Unit
name="if.else") letString
merge_block =?
context.Unit
append_basic_block((Unit, String) -> ?
func,Unit
name="if.end")String
-
Scope Management
// Create a separate scope for each branch and loop body let
branch_env =?
env.Unit
sub_env() branch_stmts.each( stmt => stmt.emit(branch_env) }() -> ?
-
Builder Position Management
At the end, be sure to place the instruction builder on the correct basic block.
// Always ensure the builder points to the correct basic block builder.position_at_end(merge_block) // Generate instructions in this block...
Chapter 6: From LLVM IR to Machine Code
After generating the complete LLVM IR, we need to convert it into assembly code for the target machine. Although llvm.mbt
provides a complete target machine configuration API, for learning purposes, we can use a simpler method.
Compiling with the llc
Toolchain
The most direct method is to output the generated LLVM IR to a file and then use the LLVM toolchain to compile it:
Call the dump
function of the Module
, or you can use the println
function.
let CodeGen
gen : struct CodeGen {
parser_program: ?
llvm_context: ?
llvm_module: ?
builder: ?
llvm_functions: Map[String, ?]
}
CodeGen = ...
let ?
prog = CodeGen
gen.?
llvm_prog
prog.dump() // dump is recommended as it will be slightly faster than println, with the same effect
// or println(prog)
Complete Compilation Flow Example
Let's look at a complete compilation flow from source code to assembly code:
-
TinyMoonbit Source Code
fn
factorial((n : Int) -> Int
n:Int
Int) ->Int
Int { ifInt
nInt
<= 1 { return 1; } return(self_ : Int, other : Int) -> Bool
nInt
*(self : Int, other : Int) -> Int
Multiplies two 32-bit integers. This is the implementation of the
*
operator forInt
.Parameters:
self
: The first integer operand.other
: The second integer operand.
Returns the product of the two integers. If the result overflows the range of
Int
, it wraps around according to two's complement arithmetic.Example:
inspect(42 * 2, content="84") inspect(-10 * 3, content="-30") let max = 2147483647 // Int.max_value inspect(max * 2, content="-2") // Overflow wraps around
factorial((n : Int) -> Int
nInt
- 1); } fn main() -> Unit { let(self : Int, other : Int) -> Int
Performs subtraction between two 32-bit integers, following standard two's complement arithmetic rules. When the result overflows or underflows, it wraps around within the 32-bit integer range.
Parameters:
self
: The minuend (the number being subtracted from).other
: The subtrahend (the number to subtract).
Returns the difference between
self
andother
.Example:
let a = 42 let b = 10 inspect(a - b, content="32") let max = 2147483647 // Int maximum value inspect(max - -1, content="-2147483648") // Overflow case
result:Int
Int =Int
factorial(5);(n : Int) -> Int
print_int((Int) -> Unit
result); }Int
-
Generated LLVM IR
; ModuleID = 'tinymoonbit' source_filename = "tinymoonbit" define i32 @factorial(i32 %0) { entry: %1 = alloca i32, align 4 store i32 %0, ptr %1, align 4 %2 = load i32, ptr %1, align 4 %3 = icmp sle i32 %2, 1 br i1 %3, label %4, label %6 4: ; preds = %entry ret i32 1 6: ; preds = %entry %7 = load i32, ptr %1, align 4 %8 = load i32, ptr %1, align 4 %9 = sub i32 %8, 1 %10 = call i32 @factorial(i32 %9) %11 = mul i32 %7, %10 ret i32 %11 } define void @main() { entry: %0 = alloca i32, align 4 %1 = call i32 @factorial(i32 5) store i32 %1, ptr %0, align 4 %2 = load i32, ptr %0, align 4 call void @print_int(i32 %2) ret void } declare void @print_int(i32 %0)
-
Generating RISC-V Assembly with
llc
# Generate llvm ir moon run main --target native > fact.ll # Generate RISC-V 64-bit assembly code llc -march=riscv64 -mattr=+m -o fact.s fact.ll
-
Generated RISC-V Assembly Snippet
factorial: .Lfunc_begin0: .cfi_startproc addi sp, sp, -32 .cfi_def_cfa_offset 32 sd ra, 24(sp) .cfi_offset ra, -8 sd s0, 16(sp) .cfi_offset s0, -16 addi s0, sp, 32 .cfi_def_cfa s0, 0 sw a0, -20(s0) lw a0, -20(s0) li a1, 1 blt a1, a0, .LBB0_2 li a0, 1 j .LBB0_3 .LBB0_2: lw a0, -20(s0) lw a1, -20(s0) addi a1, a1, -1 sw a0, -24(s0) mv a0, a1 call factorial lw a1, -24(s0) mul a0, a1, a0 .LBB0_3: ld ra, 24(sp) ld s0, 16(sp) addi sp, sp, 32 ret
Conclusion
Through this two-part series, we have completed a fully functional, albeit simple, compiler implementation. From the lexical analysis of a character stream to the construction of an abstract syntax tree, and finally to the generation of LLVM IR and machine code output.
Review
Part 1:
- An elegant lexer based on pattern matching
- Implementation of a recursive descent parser
- A complete type-checking system
- Scope management with an environment chain
Part 2:
- A deep dive into the LLVM type and value systems
- Variable management strategies in SSA form
- Correct implementation of control flow instructions
- A complete code generation pipeline
Moonbit's Advantages in Compiler Development
Through this practical project, we have gained a deep appreciation for Moonbit's unique value in the field of compiler construction:
- Expressive Pattern Matching: Greatly simplifies the complexity of AST processing and type analysis.
- Functional Programming Paradigm: Immutable data structures and pure functions make the compiler logic clearer and more reliable.
- Modern Type System: Trait objects, generics, and error handling mechanisms provide ample abstraction capabilities.
- Excellent Engineering Features: Features like
derive
and JSON serialization significantly improve development efficiency.
Final Words
Compiler technology represents the perfect combination of computer science theory and engineering practice. With a modern tool like Moonbit, we can explore this ancient yet vibrant field in a more elegant and efficient way.
We hope this series of articles will provide readers with a powerful aid on their journey into compiler design.
Recommended Learning Resources