Two Approaches to Regex Engines: Derivative and Thompson VM
Regular expression engines can be implemented using fundamentally different approaches, each with distinct trade-offs in performance, memory usage, and implementation complexity. This article explores two mathematically equivalent but practically different methods for regex matching: Brzozowski derivatives and Thompson's virtual machine approach.
Both methods operate on the same abstract syntax tree representation, providing a unified foundation for direct performance comparison. The key insight is how these seemingly different approaches solve identical problems through different computational strategies—one through algebraic transformation, the other through program execution.
Conventions & Definitions
To establish a common foundation, both regex engines start with a shared AST representation that captures the essential structure of regular expressions in a tree format:
enum Ast {
(Char) -> Ast
Chr(Char
Char)
(Ast, Ast) -> Ast
Seq(enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast, enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast)
(Ast, Int?) -> Ast
Rep(enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast, Int
Int?)
(Ast) -> Ast
Opt(enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast)
} derive(trait Show {
output(Self, &Logger) -> Unit
to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Hash {
hash_combine(Self, Hasher) -> Unit
hash(Self) -> Int
}
Trait for types that can be hashed
The hash
method should return a hash value for the type, which is used in hash tables and other data structures.
The hash_combine
method is used to combine the hash of the current value with another hash value,
typically used to hash composite types.
When two values are equal according to the Eq
trait, they should produce the same hash value.
The hash
method does not need to be implemented if hash_combine
is implemented,
When implemented separately, hash
does not need to produce a hash value that is consistent with hash_combine
.
Hash, trait Eq {
equal(Self, Self) -> Bool
op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq)
Additionally, we provide smart constructors to simplify regex construction:
fn enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(chr : Char) -> Ast
chr(Char
chr : Char
Char) -> enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast {
(Char) -> Ast
Chr(Char
chr)
}
fn enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(self : Ast, other : Ast) -> Ast
seq(Ast
self : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast, Ast
other : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast {
(Ast, Ast) -> Ast
Seq(Ast
self, Ast
other)
}
fn enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(self : Ast, n? : Int) -> Ast
rep(Ast
self : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast, Int?
n? : Int
Int) -> enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast {
(Ast, Int?) -> Ast
Rep(Ast
self, Int?
n)
}
fn enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(self : Ast) -> Ast
opt(Ast
self : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast {
Unit
@fs.
(Ast) -> Ast
Opt(Ast
self)
}
The AST defines four fundamental regex operations:
Chr(Char)
matches a single literal character.Seq(Ast, Ast)
matches one pattern followed by another through concatenation.Rep(Ast, Int?)
repeats a pattern either unlimited times whenNone
or exactly n times whenSome(n)
.Opt(Ast)
makes a pattern optional, equivalent topattern?
in standard regex syntax.
For example, we can build the regex (ab*)?
—an optional sequence of 'a' followed by zero or more 'b's—as:
Ast::chr('a').seq(Ast::chr('b').rep()).opt()
Brzozowski Derivative
The derivative-based approach transforms regular expressions algebraically using formal language theory. For each input character, it computes the "derivative" of the regex by asking: "what remains to be matched after consuming this character?" This creates a new regex representing the remaining pattern.
We extend the basic Ast
type to represent derivatives and nullability explicitly:
enum Exp {
Exp
Nil
Exp
Eps
(Char) -> Exp
Chr(Char
Char)
(Exp, Exp) -> Exp
Alt(enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp)
(Exp, Exp) -> Exp
Seq(enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp)
(Exp) -> Exp
Rep(enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp)
} derive(trait Show {
output(Self, &Logger) -> Unit
to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Hash {
hash_combine(Self, Hasher) -> Unit
hash(Self) -> Int
}
Trait for types that can be hashed
The hash
method should return a hash value for the type, which is used in hash tables and other data structures.
The hash_combine
method is used to combine the hash of the current value with another hash value,
typically used to hash composite types.
When two values are equal according to the Eq
trait, they should produce the same hash value.
The hash
method does not need to be implemented if hash_combine
is implemented,
When implemented separately, hash
does not need to produce a hash value that is consistent with hash_combine
.
Hash, trait Eq {
equal(Self, Self) -> Bool
op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq, trait Compare {
compare(Self, Self) -> Int
}
Trait for types whose elements are ordered
The return value of [compare] is:
- zero, if the two arguments are equal
- negative, if the first argument is smaller
- positive, if the first argument is greater
Compare)
The constructors in Exp
represent:
Nil
represents an impossible pattern that can never match anything.Eps
matches the empty string.Chr(Char)
matches a single character.Alt(Exp, Exp)
represents alternation, providing choice between patterns.Seq(Exp, Exp)
represents concatenation of two patterns.Rep(Exp)
represents repetition of a pattern.
We use the Exp::of_ast
function to convert the Ast
into the more expressive Exp
format:
fn enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
ast : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp {
match Ast
ast {
(Char) -> Ast
Chr(Char
c) => (Char) -> Exp
Chr(Char
c)
(Ast, Ast) -> Ast
Seq(Ast
a, Ast
b) => (Exp, Exp) -> Exp
Seq(enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a), enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
b))
(Ast, Int?) -> Ast
Rep(Ast
a, Int?
None) => (Exp) -> Exp
Rep(enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a))
(Ast, Int?) -> Ast
Rep(Ast
a, (Int) -> Int?
Some(Int
n)) => {
let Exp
sec = enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a)
let mut Exp
exp = Exp
sec
for _ in Int
1..<Int
n {
Exp
exp = (Exp, Exp) -> Exp
Seq(Exp
exp, Exp
sec)
}
Exp
exp
}
(Ast) -> Ast
Opt(Ast
a) => (Exp, Exp) -> Exp
Alt(enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a), Exp
Eps)
}
}
We also provide smart constructors for Exp
to simplify pattern building:
fn enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(a : Exp, b : Exp) -> Exp
seq(Exp
a : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, Exp
b : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp) -> enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp {
match (Exp
a, Exp
b) {
(Exp
Nil, _) | (_, Exp
Nil) => Exp
Nil
(Exp
Eps, Exp
b) => Exp
b
(Exp
a, Exp
Eps) => Exp
a
(Exp
a, Exp
b) => (Exp, Exp) -> Exp
Seq(Exp
a, Exp
b)
}
}
However, the smart constructor for Alt
is strictly necessary—it ensures that the constructed Exp
is normalized to "similarity" as mentioned in the original paper by Brzozowski. Two regexes are similar if one can be reduced to the other by applying the following rules:
Therefore, we normalize the Alt
construction to always use the same associativity and order of alternatives:
fn enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(a : Exp, b : Exp) -> Exp
alt(Exp
a : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, Exp
b : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp) -> enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp {
match (Exp
a, Exp
b) {
(Exp
Nil, Exp
b) => Exp
b
(Exp
a, Exp
Nil) => Exp
a
((Exp, Exp) -> Exp
Alt(Exp
a, Exp
b), Exp
c) => Exp
a.(a : Exp, b : Exp) -> Exp
alt(Exp
b.(a : Exp, b : Exp) -> Exp
alt(Exp
c))
(Exp
a, Exp
b) => {
if Exp
a (Exp, Exp) -> Bool
automatically derived
== Exp
b {
Exp
a
} else if Exp
a (self_ : Exp, other : Exp) -> Bool
> Exp
b {
(Exp, Exp) -> Exp
Alt(Exp
b, Exp
a)
} else {
(Exp, Exp) -> Exp
Alt(Exp
a, Exp
b)
}
}
}
}
The nullable
function determines if a pattern can match the empty string without consuming input:
fn enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(self : Exp) -> Bool
nullable(Exp
self : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp) -> Bool
Bool {
match Exp
self {
Exp
Nil => false
Exp
Eps => true
(Char) -> Exp
Chr(_) => false
(Exp, Exp) -> Exp
Alt(Exp
l, Exp
r) => Exp
l.(self : Exp) -> Bool
nullable() (Bool, Bool) -> Bool
|| Exp
r.(self : Exp) -> Bool
nullable()
(Exp, Exp) -> Exp
Seq(Exp
l, Exp
r) => Exp
l.(self : Exp) -> Bool
nullable() (Bool, Bool) -> Bool
&& Exp
r.(self : Exp) -> Bool
nullable()
(Exp) -> Exp
Rep(_) => true
}
}
The deriv
function computes the derivative of a pattern with respect to a character, transforming the pattern based on the rules defined in the Brzozowski derivative. We have reordered the rules to match the order in the deriv
function:
fn enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(self : Exp, c : Char) -> Exp
deriv(Exp
self : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, Char
c : Char
Char) -> enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp {
match Exp
self {
Exp
Nil => Exp
self
Exp
Eps => Exp
Nil
(Char) -> Exp
Chr(Char
d) if Char
d (self : Char, other : Char) -> Bool
Compares two characters for equality.
Parameters:
self
: The first character to compare.
other
: The second character to compare.
Returns true
if both characters represent the same Unicode code point,
false
otherwise.
Example:
let a = 'A'
let b = 'A'
let c = 'B'
inspect(a == b, content="true")
inspect(a == c, content="false")
== Char
c => Exp
Eps
(Char) -> Exp
Chr(_) => Exp
Nil
(Exp, Exp) -> Exp
Alt(Exp
l, Exp
r) => Exp
l.(self : Exp, c : Char) -> Exp
deriv(Char
c).(a : Exp, b : Exp) -> Exp
alt(Exp
r.(self : Exp, c : Char) -> Exp
deriv(Char
c))
(Exp, Exp) -> Exp
Seq(Exp
l, Exp
r) => {
let Exp
dl = Exp
l.(self : Exp, c : Char) -> Exp
deriv(Char
c)
if Exp
l.(self : Exp) -> Bool
nullable() {
Exp
dl.(a : Exp, b : Exp) -> Exp
seq(Exp
r).(a : Exp, b : Exp) -> Exp
alt(Exp
r.(self : Exp, c : Char) -> Exp
deriv(Char
c))
} else {
Exp
dl.(a : Exp, b : Exp) -> Exp
seq(Exp
r)
}
}
(Exp) -> Exp
Rep(Exp
e) => Exp
e.(self : Exp, c : Char) -> Exp
deriv(Char
c).(a : Exp, b : Exp) -> Exp
seq(Exp
self)
}
}
To simplify our implementation, we only perform strict matching—the pattern must match the entire input string. Therefore, we only check for nullability after the entire input has been consumed:
fn enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(self : Exp, s : String) -> Bool
matches(Exp
self : enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, String
s : String
String) -> Bool
Bool {
loop (Exp
self, String
s.(self : String, start_offset? : Int, end_offset? : Int) -> @string.View
Creates a View
into a String
.
Example
let str = "Hello🤣🤣🤣"
let view1 = str.view()
inspect(view1, content=
"Hello🤣🤣🤣"
)
let start_offset = str.offset_of_nth_char(1).unwrap()
let end_offset = str.offset_of_nth_char(6).unwrap() // the second emoji
let view2 = str.view(start_offset~, end_offset~)
inspect(view2, content=
"ello🤣"
)
view()) {
(Exp
Nil, _) => {
return false
}
(Exp
e, []) => {
return Exp
e.(self : Exp) -> Bool
nullable()
}
(Exp
e, @string.View
[Char
c@string.View
, .. s]) => {
continue (Exp
e.(self : Exp, c : Char) -> Exp
deriv(Char
c), @string.View
s)
}
}
}
Virtual Machine
The VM approach compiles regular expressions into bytecode instructions for a simple virtual machine. This method transforms the pattern-matching problem into program execution, where the VM simulates all possible paths through a non-deterministic finite automaton simultaneously.
Ken Thompson's 1968 paper described a regex engine that compiled patterns into IBM 7094 machine code. The key insight was to avoid exponential backtracking by maintaining multiple execution threads that advance through input in lockstep, processing one character at a time across all possible matching paths.
Instruction Set and Program Representation
The VM operates on four fundamental instructions that correspond to NFA operations:
enum Ops {
Ops
Done
(Char) -> Ops
Char(Char
Char)
(Int) -> Ops
Jump(Int
Int)
(Int) -> Ops
Fork(Int
Int)
} derive(trait Show {
output(Self, &Logger) -> Unit
to_string(Self) -> String
}
Trait for types that can be converted to String
Show)
Each instruction serves a specific purpose in NFA simulation. Done
marks successful completion of pattern matching, equivalent to Thompson's original match
. Char(c)
consumes input character c
and advances to the next instruction. Jump(addr)
provides unconditional jump to instruction at address addr
(Thompson's jmp
). Fork(addr)
creates two execution paths—one continues to the next instruction, another jumps to addr
(Thompson's split
).
The Fork
instruction is crucial for handling non-determinism in patterns like alternation and repetition, where multiple execution paths must be explored simultaneously. This maps directly to NFA ε-transitions, where execution can spontaneously branch without consuming input.
We define a Prg
that wraps an array of instructions with convenience methods for building and manipulating bytecode programs.
struct Prg(type Array[T]
An Array
is a collection of values that supports random access and can
grow in size.
Array[enum Ops {
Done
Char(Char)
Jump(Int)
Fork(Int)
} derive(Show)
Ops]) derive(trait Show {
output(Self, &Logger) -> Unit
to_string(Self) -> String
}
Trait for types that can be converted to String
Show)
fn type Prg Array[Ops] derive(Show)
Prg::(self : Prg, inst : Ops) -> Unit
push(Prg
self : type Prg Array[Ops] derive(Show)
Prg, Ops
inst : enum Ops {
Done
Char(Char)
Jump(Int)
Fork(Int)
} derive(Show)
Ops) -> Unit
Unit {
Prg
self.Array[Ops]
0.(self : Array[Ops], value : Ops) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
let v = []
v.push(3)
push(Ops
inst)
}
fn type Prg Array[Ops] derive(Show)
Prg::(self : Prg) -> Int
length(Prg
self : type Prg Array[Ops] derive(Show)
Prg) -> Int
Int {
Prg
self.Array[Ops]
0.(self : Array[Ops]) -> Int
Returns the number of elements in the array.
Parameters:
array
: The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
let arr = [1, 2, 3]
inspect(arr.length(), content="3")
let empty : Array[Int] = []
inspect(empty.length(), content="0")
length()
}
fn type Prg Array[Ops] derive(Show)
Prg::(self : Prg, index : Int, inst : Ops) -> Unit
op_set(Prg
self : type Prg Array[Ops] derive(Show)
Prg, Int
index : Int
Int, Ops
inst : enum Ops {
Done
Char(Char)
Jump(Int)
Fork(Int)
} derive(Show)
Ops) -> Unit
Unit {
Prg
self.Array[Ops]
Sets the element at the specified index in the array to a new value. The
original value at that index is overwritten.
Parameters:
array
: The array to modify.
index
: The position in the array where the value will be set.
value
: The new value to assign at the specified index.
Throws an error if index
is negative or greater than or equal to the length
of the array.
Example:
let arr = [1, 2, 3]
arr[1] = 42
inspect(arr, content="[1, 42, 3]")
0(Array[Ops], Int, Ops) -> Unit
Sets the element at the specified index in the array to a new value. The
original value at that index is overwritten.
Parameters:
array
: The array to modify.
index
: The position in the array where the value will be set.
value
: The new value to assign at the specified index.
Throws an error if index
is negative or greater than or equal to the length
of the array.
Example:
let arr = [1, 2, 3]
arr[1] = 42
inspect(arr, content="[1, 42, 3]")
[index] = Ops
inst
}
AST Compilation to Bytecode
The Prg::of_ast
function translates AST patterns into VM instructions using standard NFA construction techniques:
-
Seq(a, b)
:code for a code for b
-
Rep(a, None)
(unbounded repetition):Fork L1, L2 L1: code for a Jump L1 L2:
-
Rep(a, Some(n))
(fixed repetition):code for a code for a ... (n times) ...
-
Opt(a)
(optional):Fork L1, L2 L1: code for a L2:
Note that the Fork
constructor only accepts one address, because we always want to proceed to the next instruction after the Fork
.
fn type Prg Array[Ops] derive(Show)
Prg::(ast : Ast) -> Prg
of_ast(Ast
ast : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> type Prg Array[Ops] derive(Show)
Prg {
fn (Prg, Ast) -> Unit
compile(Prg
prog : type Prg Array[Ops] derive(Show)
Prg, Ast
ast : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> Unit
Unit {
match Ast
ast {
(Char) -> Ast
Chr(Char
chr) => Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Char) -> Ops
Char(Char
chr))
(Ast, Ast) -> Ast
Seq(Ast
l, Ast
r) => {
(Prg, Ast) -> Unit
compile(Prg
prog, Ast
l)
(Prg, Ast) -> Unit
compile(Prg
prog, Ast
r)
}
(Ast, Int?) -> Ast
Rep(Ast
e, Int?
None) => {
let Int
fork = Prg
prog.(self : Prg) -> Int
length()
Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Int) -> Ops
Fork(0))
(Prg, Ast) -> Unit
compile(Prg
prog, Ast
e)
Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Int) -> Ops
Jump(Int
fork))
Prg
prog(Prg, Int, Ops) -> Unit
[fork] = (Int) -> Ops
Fork(Prg
prog.(self : Prg) -> Int
length())
}
(Ast, Int?) -> Ast
Rep(Ast
e, (Int) -> Int?
Some(Int
n)) =>
for _ in Int
0..<Int
n {
(Prg, Ast) -> Unit
compile(Prg
prog, Ast
e)
}
(Ast) -> Ast
Opt(Ast
e) => {
let Int
fork_inst = Prg
prog.(self : Prg) -> Int
length()
Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Int) -> Ops
Fork(0))
(Prg, Ast) -> Unit
compile(Prg
prog, Ast
e)
Prg
prog(Prg, Int, Ops) -> Unit
[fork_inst] = (Int) -> Ops
Fork(Prg
prog.(self : Prg) -> Int
length())
}
}
}
let Prg
prog : type Prg Array[Ops] derive(Show)
Prg = []
(Prg, Ast) -> Unit
compile(Prg
prog, Ast
ast)
Prg
prog.(self : Prg, inst : Ops) -> Unit
push(Ops
Done)
Prg
prog
}
VM Execution Loop
In Rob Pike's implementation, the VM executes one-past the end of the input string to handle the final acceptance state. To make this explicit, our matches
function implements the core VM execution loop using a two-phase approach:
Phase 1 handles character processing. For each input character, it processes all active threads in the current context. Char
instructions that match the current character create new threads in the next context. Jump
and Fork
instructions immediately spawn new threads in the current context. After processing all threads, it swaps contexts and continues with the next character.
Phase 2 handles final acceptance. After consuming all input, it processes remaining threads looking for Done
instructions. It handles any final Jump
/Fork
instructions that don't consume input. It returns true
if any thread reaches a Done
instruction.
fn type Prg Array[Ops] derive(Show)
Prg::(self : Prg, data : @string.View) -> Bool
matches(Prg
self : type Prg Array[Ops] derive(Show)
Prg, @string.View
data : #builtin.valtype
type @string.View
A @string.View
represents a view of a String that maintains proper Unicode
character boundaries. It allows safe access to a substring while handling
multi-byte characters correctly.
@string.View) -> Bool
Bool {
let (Array[Ops]) -> Prg
Prg(Array[Ops]
prog) = Prg
self
let mut Ctx
curr = struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx::(length : Int) -> Ctx
new(Array[Ops]
prog.(self : Array[Ops]) -> Int
Returns the number of elements in the array.
Parameters:
array
: The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
let arr = [1, 2, 3]
inspect(arr.length(), content="3")
let empty : Array[Int] = []
inspect(empty.length(), content="0")
length())
let mut Ctx
next = struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx::(length : Int) -> Ctx
new(Array[Ops]
prog.(self : Array[Ops]) -> Int
Returns the number of elements in the array.
Parameters:
array
: The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
let arr = [1, 2, 3]
inspect(arr.length(), content="3")
let empty : Array[Int] = []
inspect(empty.length(), content="0")
length())
Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(0)
for Char
c in @string.View
data {
while Ctx
curr.(self : Ctx) -> Int?
pop() is (Int) -> Int?
Some(Int
pc) {
match Array[Ops]
prog(Array[Ops], Int) -> Ops
Retrieves an element from the array at the specified index.
Parameters:
array
: The array to get the element from.
index
: The position in the array from which to retrieve the element.
Returns the element at the specified index.
Throws a panic if the index is negative or greater than or equal to the
length of the array.
Example:
let arr = [1, 2, 3]
inspect(arr[1], content="2")
[pc] {
Ops
Done => ()
(Char) -> Ops
Char(Char
char) if Char
char (self : Char, other : Char) -> Bool
Compares two characters for equality.
Parameters:
self
: The first character to compare.
other
: The second character to compare.
Returns true
if both characters represent the same Unicode code point,
false
otherwise.
Example:
let a = 'A'
let b = 'A'
let c = 'B'
inspect(a == b, content="true")
inspect(a == c, content="false")
== Char
c => {
Ctx
next.(self : Ctx, pc : Int) -> Unit
add(Int
pc (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:
self
: The first integer operand.
other
: The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
inspect(42 + 1, content="43")
inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1)
}
(Int) -> Ops
Jump(Int
jump) =>
Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
jump)
(Int) -> Ops
Fork(Int
fork) => {
Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
fork)
Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
pc (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:
self
: The first integer operand.
other
: The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
inspect(42 + 1, content="43")
inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1)
}
_ => ()
}
}
let Ctx
temp = Ctx
curr
Ctx
curr = Ctx
next
Ctx
next = Ctx
temp
Ctx
next.(self : Ctx) -> Unit
reset()
}
while Ctx
curr.(self : Ctx) -> Int?
pop() is (Int) -> Int?
Some(Int
pc) {
match Array[Ops]
prog(Array[Ops], Int) -> Ops
Retrieves an element from the array at the specified index.
Parameters:
array
: The array to get the element from.
index
: The position in the array from which to retrieve the element.
Returns the element at the specified index.
Throws a panic if the index is negative or greater than or equal to the
length of the array.
Example:
let arr = [1, 2, 3]
inspect(arr[1], content="2")
[pc] {
Ops
Done => return true
(Int) -> Ops
Jump(Int
x) => Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
x)
(Int) -> Ops
Fork(Int
x) => {
Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
x)
Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
pc (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:
self
: The first integer operand.
other
: The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
inspect(42 + 1, content="43")
inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1)
}
_ => ()
}
}
false
}
In the original blog post, Rob Pike uses a recursive function to handle Fork
and Jump
instructions so that threads are executed according to their priorities. Instead, we use a stack-like structure to manage all threads of execution, which naturally respects thread priority:
struct Ctx {
@deque.Deque[Int]
deque : type @deque.Deque[A]
@deque.Deque[Int
Int]
FixedArray[Bool]
visit : type FixedArray[A]
FixedArray[Bool
Bool]
}
fn struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx::(length : Int) -> Ctx
new(Int
length : Int
Int) -> struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx {
{ @deque.Deque[Int]
deque: (capacity? : Int) -> @deque.Deque[Int]
Creates a new empty deque with an optional initial capacity.
Parameters:
capacity
: The initial capacity of the deque. If not specified, defaults
to 0 and will be automatically adjusted as elements are added.
Returns a new empty deque of type T[A]
where A
is the type of elements
the deque will hold.
Example
let dq : @deque.Deque[Int] = @deque.new()
inspect(dq.length(), content="0")
inspect(dq.capacity(), content="0")
let dq : @deque.Deque[Int] = @deque.new(capacity=10)
inspect(dq.length(), content="0")
inspect(dq.capacity(), content="10")
@deque.new(), FixedArray[Bool]
visit: type FixedArray[A]
FixedArray::(len : Int, init : Bool) -> FixedArray[Bool]
Creates a new fixed-size array with the specified length, initializing all
elements with the given value.
Parameters:
length
: The length of the array to create. Must be non-negative.
initial_value
: The value used to initialize all elements in the array.
Returns a new fixed-size array of type FixedArray[T]
with length
elements, where each element is initialized to initial_value
.
Throws a panic if length
is negative.
Example:
let arr = FixedArray::make(3, 42)
inspect(arr[0], content="42")
inspect(arr.length(), content="3")
WARNING: A common pitfall is creating with the same initial value, for example:
let two_dimension_array = FixedArray::make(10, FixedArray::make(10, 0))
two_dimension_array[0][5] = 10
assert_eq(two_dimension_array[5][5], 10)
This is because all the cells reference to the same object (the FixedArray[Int] in this case).
One should use makei() instead which creates an object for each index.
make(Int
length, false) }
}
fn struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx::(self : Ctx, pc : Int) -> Unit
add(Ctx
self : struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx, Int
pc : Int
Int) -> Unit
Unit {
if Bool
!Ctx
selfBool
.FixedArray[Bool]
visit(FixedArray[Bool], Int) -> Bool
Retrieves an element at the specified index from a fixed-size array. This
function implements the array indexing operator []
.
Parameters:
array
: The fixed-size array to access.
index
: The position in the array from which to retrieve the element.
Returns the element at the specified index.
Throws a runtime error if the index is out of bounds (negative or greater
than or equal to the length of the array).
Example:
let arr = FixedArray::make(3, 42)
inspect(arr[1], content="42")
[Bool
pc] {
Ctx
self.@deque.Deque[Int]
deque.(self : @deque.Deque[Int], value : Int) -> Unit
Adds an element to the back of the deque.
If the deque is at capacity, it will be reallocated.
Example
let dv = @deque.of([1, 2, 3, 4, 5])
dv.push_back(6)
assert_eq(dv.back(), Some(6))
push_back(Int
pc)
Ctx
self.FixedArray[Bool]
visit(FixedArray[Bool], Int, Bool) -> Unit
Sets a value at the specified index in a fixed-size array. The original value
at that index is overwritten.
Parameters:
array
: The fixed-size array to modify.
index
: The position in the array where the value will be set.
value
: The new value to assign at the specified index.
Throws a runtime error if the index is out of bounds (less than 0 or greater
than or equal to the length of the array).
Example:
let arr = [1, 2, 3]
arr[1] = 42
inspect(arr, content="[1, 42, 3]")
[pc] = true
}
}
fn struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx::(self : Ctx) -> Int?
pop(Ctx
self : struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx) -> Int
Int? {
match Ctx
self.@deque.Deque[Int]
deque.(self : @deque.Deque[Int]) -> Int?
Removes a back element from a deque and returns it, or None
if it is empty.
Example
let dv = @deque.of([1, 2, 3, 4, 5])
assert_eq(dv.pop_back(), Some(5))
pop_back() {
(Int) -> Int?
Some(Int
pc) => {
Ctx
self.FixedArray[Bool]
visit(FixedArray[Bool], Int, Bool) -> Unit
Sets a value at the specified index in a fixed-size array. The original value
at that index is overwritten.
Parameters:
array
: The fixed-size array to modify.
index
: The position in the array where the value will be set.
value
: The new value to assign at the specified index.
Throws a runtime error if the index is out of bounds (less than 0 or greater
than or equal to the length of the array).
Example:
let arr = [1, 2, 3]
arr[1] = 42
inspect(arr, content="[1, 42, 3]")
[pc] = false
(Int) -> Int?
Some(Int
pc)
}
Int?
None => Int?
None
}
}
fn struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx::(self : Ctx) -> Unit
reset(Ctx
self : struct Ctx {
deque: @deque.Deque[Int]
visit: FixedArray[Bool]
}
Ctx) -> Unit
Unit {
Ctx
self.@deque.Deque[Int]
deque.(self : @deque.Deque[Int]) -> Unit
Clears the deque, removing all values.
This method has no effect on the allocated capacity of the deque, only setting the length to 0.
Example
let dv = @deque.of([1, 2, 3, 4, 5])
dv.clear()
inspect(dv.length(), content="0")
clear()
Ctx
self.FixedArray[Bool]
visit.(self : FixedArray[Bool], value : Bool, start? : Int, end? : Int) -> Unit
Fill the array with a given value.
This method fills all or part of a FixedArray with the given value.
Parameters
value
: The value to fill the array with
start
: The starting index (inclusive, default: 0)
end
: The ending index (exclusive, optional)
If end
is not provided, fills from start
to the end of the array.
If start
equals end
, no elements are modified.
Panics
- Panics if
start
is negative or greater than or equal to the array length
- Panics if
end
is provided and is less than start
or greater than array length
- Does nothing if the array is empty
Example
// Fill entire array
let fa : FixedArray[Int] = [0, 0, 0, 0, 0]
fa.fill(3)
inspect(fa, content="[3, 3, 3, 3, 3]")
// Fill from index 1 to 3 (exclusive)
let fa2 : FixedArray[Int] = [0, 0, 0, 0, 0]
fa2.fill(9, start=1, end=3)
inspect(fa2, content="[0, 9, 9, 0, 0]")
// Fill from index 2 to end
let fa3 : FixedArray[String] = ["a", "b", "c", "d"]
fa3.fill("x", start=2)
inspect(fa3, content=(
#|["a", "b", "x", "x"]
))
fill(false)
}
The visit
array is used to drop low-priority threads. When a new thread is added, we first check if it is already in the deque
using the visit
array. If it is, we drop it; otherwise, we add it to the deque
and mark it as visited. This mechanism is necessary to avoid infinite loops or exponential blowup when the regex contains patterns that can be expanded indefinitely, such as (a?)*
.
Benchmarks and Performance Analysis
The benchmark demonstrates both approaches on a pathological case that challenges many regex implementations:
test (@bench.T
b : type @bench.T
@bench.T) {
let Int
n = 15
let String
txt = "a".(self : String, n : Int) -> String
Returns a new string with self
repeated n
times.
repeat(Int
n)
let Ast
chr = enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(chr : Char) -> Ast
chr('a')
let Ast
ast : enum Ast {
Chr(Char)
Seq(Ast, Ast)
Rep(Ast, Int?)
Opt(Ast)
} derive(Show, Hash, Eq)
Ast = Ast
chr.(self : Ast) -> Ast
opt().(self : Ast, n~ : Int) -> Ast
rep(Int
n~).(self : Ast, other : Ast) -> Ast
seq(Ast
chr.(self : Ast, n~ : Int) -> Ast
rep(Int
n~))
let Exp
exp = enum Exp {
Nil
Eps
Chr(Char)
Alt(Exp, Exp)
Seq(Exp, Exp)
Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
ast)
@bench.T
b.(self : @bench.T, name~ : String, f : () -> Unit, count? : UInt) -> Unit
Run a benchmark in batch mode
bench(String
name="derive", () => Exp
exp.(self : Exp, s : String) -> Bool
matches(String
txt) |> (t : Bool) -> Unit
Evaluates an expression and discards its result. This is useful when you want
to execute an expression for its side effects but don't care about its return
value, or when you want to explicitly indicate that a value is intentionally
unused.
Parameters:
value
: The value to be ignored. Can be of any type.
Example:
let x = 42
ignore(x) // Explicitly ignore the value
let mut sum = 0
ignore([1, 2, 3].iter().each((x) => { sum = sum + x })) // Ignore the Unit return value of each()
ignore())
let Prg
tvm = type Prg Array[Ops] derive(Show)
Prg::(ast : Ast) -> Prg
of_ast(Ast
ast)
@bench.T
b.(self : @bench.T, name~ : String, f : () -> Unit, count? : UInt) -> Unit
Run a benchmark in batch mode
bench(String
name="thompson", () => Prg
tvm.(self : Prg, data : @string.View) -> Bool
matches(String
txt) |> (t : Bool) -> Unit
Evaluates an expression and discards its result. This is useful when you want
to execute an expression for its side effects but don't care about its return
value, or when you want to explicitly indicate that a value is intentionally
unused.
Parameters:
value
: The value to be ignored. Can be of any type.
Example:
let x = 42
ignore(x) // Explicitly ignore the value
let mut sum = 0
ignore([1, 2, 3].iter().each((x) => { sum = sum + x })) // Ignore the Unit return value of each()
ignore())
}
This pattern (a?){n}a{n}
represents a classical exponential blowup case for backtracking engines. The pattern allows n different ways to match n 'a' characters, creating exponential search spaces in naive implementations.
name time (mean ± σ) range (min … max)
derive 41.78 µs ± 0.14 µs 41.61 µs … 42.13 µs in 10 × 2359 runs
thompson 12.79 µs ± 0.04 µs 12.74 µs … 12.84 µs in 10 × 7815 runs
The benchmark results show that the VM approach is significantly faster than the derivative-based approach for this case. The derivative method frequently allocates intermediate regex structures, leading to higher overhead and slower performance. In contrast, the VM executes a fixed set of instructions and rarely allocates new structures once the deque grows to its full size.
However, the derivative approach is easier to reason about. We can easily prove termination of the algorithm, as the number of derivatives to be computed is bounded by the size of the AST and strictly decreases with each recursive application of the deriv
function. The VM approach, on the other hand, can potentially run indefinitely if the input Prg
contains infinite loops, and requires careful handling of thread priority to avoid infinite loops and exponential blowup in the number of threads.