Two Approaches to Regex Engines: Derivative and Thompson VM

September 10, 2025 · 11 min read

Regular expression engines can be implemented using fundamentally different approaches, each with distinct trade-offs in performance, memory usage, and implementation complexity. This article explores two mathematically equivalent but practically different methods for regex matching: Brzozowski derivatives and Thompson's virtual machine approach.

Both methods operate on the same abstract syntax tree representation, providing a unified foundation for direct performance comparison. The key insight is how these seemingly different approaches solve identical problems through different computational strategies—one through algebraic transformation, the other through program execution.

Conventions & Definitions

To establish a common foundation, both regex engines start with a shared AST representation that captures the essential structure of regular expressions in a tree format:

enum Ast {
  (Char) -> Ast
Chr(Char
Char)
  (Ast, Ast) -> Ast
Seq(enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast, enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast)
  (Ast, Int?) -> Ast
Rep(enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast, Int
Int?)
  (Ast) -> Ast
Opt(enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast)
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Hash {
  hash_combine(Self, Hasher) -> Unit
  hash(Self) -> Int
}
Trait for types that can be hashed
The hash method should return a hash value for the type, which is used in hash tables and other data structures.
The hash_combine method is used to combine the hash of the current value with another hash value,
typically used to hash composite types.
When two values are equal according to the Eq trait, they should produce the same hash value.
The hash method does not need to be implemented if hash_combine is implemented,
When implemented separately, hash does not need to produce a hash value that is consistent with hash_combine.
Hash, trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq)

Additionally, we provide smart constructors to simplify regex construction:

fn enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(chr : Char) -> Ast
chr(Char
chr : Char
Char) -> enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast {
  (Char) -> Ast
Chr(Char
chr)
}

fn enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(self : Ast, other : Ast) -> Ast
seq(Ast
self : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast, Ast
other : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast {
  (Ast, Ast) -> Ast
Seq(Ast
self, Ast
other)
}

fn enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(self : Ast, n? : Int) -> Ast
rep(Ast
self : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast, Int?
n? : Int
Int) -> enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast {
  (Ast, Int?) -> Ast
Rep(Ast
self, Int?
n)
}

fn enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(self : Ast) -> Ast
opt(Ast
self : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast {
  Unit
@fs.
  (Ast) -> Ast
Opt(Ast
self)
}

The AST defines four fundamental regex operations:

Chr(Char) matches a single literal character.
Seq(Ast, Ast) matches one pattern followed by another through concatenation.
Rep(Ast, Int?) repeats a pattern either unlimited times when None or exactly n times when Some(n).
Opt(Ast) makes a pattern optional, equivalent to pattern? in standard regex syntax.

For example, we can build the regex (ab*)?—an optional sequence of 'a' followed by zero or more 'b's—as:

Ast::chr('a').seq(Ast::chr('b').rep()).opt()

Brzozowski Derivative

The derivative-based approach transforms regular expressions algebraically using formal language theory. For each input character, it computes the "derivative" of the regex by asking: "what remains to be matched after consuming this character?" This creates a new regex representing the remaining pattern.

We extend the basic Ast type to represent derivatives and nullability explicitly:

enum Exp {
  Exp
Nil
  Exp
Eps
  (Char) -> Exp
Chr(Char
Char)
  (Exp, Exp) -> Exp
Alt(enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp)
  (Exp, Exp) -> Exp
Seq(enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp)
  (Exp) -> Exp
Rep(enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp)
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show, trait Hash {
  hash_combine(Self, Hasher) -> Unit
  hash(Self) -> Int
}
Trait for types that can be hashed
The hash method should return a hash value for the type, which is used in hash tables and other data structures.
The hash_combine method is used to combine the hash of the current value with another hash value,
typically used to hash composite types.
When two values are equal according to the Eq trait, they should produce the same hash value.
The hash method does not need to be implemented if hash_combine is implemented,
When implemented separately, hash does not need to produce a hash value that is consistent with hash_combine.
Hash, trait Eq {
  equal(Self, Self) -> Bool
  op_equal(Self, Self) -> Bool
}
Trait for types whose elements can test for equality
Eq, trait Compare {
  compare(Self, Self) -> Int
}
Trait for types whose elements are ordered
The return value of [compare] is:

zero, if the two arguments are equal
negative, if the first argument is smaller
positive, if the first argument is greater
Compare)

The constructors in Exp represent:

Nil represents an impossible pattern that can never match anything.
Eps matches the empty string.
Chr(Char) matches a single character.
Alt(Exp, Exp) represents alternation, providing choice between patterns.
Seq(Exp, Exp) represents concatenation of two patterns.
Rep(Exp) represents repetition of a pattern.

We use the Exp::of_ast function to convert the Ast into the more expressive Exp format:

fn enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
ast : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp {
  match Ast
ast {
    (Char) -> Ast
Chr(Char
c) => (Char) -> Exp
Chr(Char
c)
    (Ast, Ast) -> Ast
Seq(Ast
a, Ast
b) => (Exp, Exp) -> Exp
Seq(enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a), enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
b))
    (Ast, Int?) -> Ast
Rep(Ast
a, Int?
None) => (Exp) -> Exp
Rep(enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a))
    (Ast, Int?) -> Ast
Rep(Ast
a, (Int) -> Int?
Some(Int
n)) => {
      let Exp
sec = enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a)
      let mut Exp
exp = Exp
sec
      for _ in Int
1..<Int
n {
        Exp
exp = (Exp, Exp) -> Exp
Seq(Exp
exp, Exp
sec)
      }
      Exp
exp
    }
    (Ast) -> Ast
Opt(Ast
a) => (Exp, Exp) -> Exp
Alt(enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
a), Exp
Eps)
  }
}

We also provide smart constructors for Exp to simplify pattern building:

fn enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(a : Exp, b : Exp) -> Exp
seq(Exp
a : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, Exp
b : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp) -> enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp {
  match (Exp
a, Exp
b) {
    (Exp
Nil, _) | (_, Exp
Nil) => Exp
Nil
    (Exp
Eps, Exp
b) => Exp
b
    (Exp
a, Exp
Eps) => Exp
a
    (Exp
a, Exp
b) => (Exp, Exp) -> Exp
Seq(Exp
a, Exp
b)
  }
}

However, the smart constructor for Alt is strictly necessary—it ensures that the constructed Exp is normalized to "similarity" as mentioned in the original paper by Brzozowski. Two regexes are similar if one can be reduced to the other by applying the following rules:

\begin{align} & A \mid \emptyset &&\rightarrow A \\ & A \mid B &&\rightarrow B \mid A \\ & A \mid (B \mid C) &&\rightarrow (A \mid B) \mid C \end{align}

Therefore, we normalize the Alt construction to always use the same associativity and order of alternatives:

fn enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(a : Exp, b : Exp) -> Exp
alt(Exp
a : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, Exp
b : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp) -> enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp {
  match (Exp
a, Exp
b) {
    (Exp
Nil, Exp
b) => Exp
b
    (Exp
a, Exp
Nil) => Exp
a
    ((Exp, Exp) -> Exp
Alt(Exp
a, Exp
b), Exp
c) => Exp
a.(a : Exp, b : Exp) -> Exp
alt(Exp
b.(a : Exp, b : Exp) -> Exp
alt(Exp
c))
    (Exp
a, Exp
b) => {
      if Exp
a (Exp, Exp) -> Bool
automatically derived
== Exp
b {
        Exp
a
      } else if Exp
a (self_ : Exp, other : Exp) -> Bool
> Exp
b {
        (Exp, Exp) -> Exp
Alt(Exp
b, Exp
a)
      } else {
        (Exp, Exp) -> Exp
Alt(Exp
a, Exp
b)
      }
    }
  }
}

The nullable function determines if a pattern can match the empty string without consuming input:

fn enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(self : Exp) -> Bool
nullable(Exp
self : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp) -> Bool
Bool {
  match Exp
self {
    Exp
Nil => false
    Exp
Eps => true
    (Char) -> Exp
Chr(_) => false
    (Exp, Exp) -> Exp
Alt(Exp
l, Exp
r) => Exp
l.(self : Exp) -> Bool
nullable() (Bool, Bool) -> Bool
|| Exp
r.(self : Exp) -> Bool
nullable()
    (Exp, Exp) -> Exp
Seq(Exp
l, Exp
r) => Exp
l.(self : Exp) -> Bool
nullable() (Bool, Bool) -> Bool
&& Exp
r.(self : Exp) -> Bool
nullable()
    (Exp) -> Exp
Rep(_) => true
  }
}

The deriv function computes the derivative of a pattern with respect to a character, transforming the pattern based on the rules defined in the Brzozowski derivative. We have reordered the rules to match the order in the deriv function:

\begin{align} D_{a} \emptyset &= \emptyset \\ D_{a} \epsilon &= \emptyset \\ D_{a} a &= \epsilon \\ D_{a} b &= \emptyset & \text{ for }(a \neq b) \\ D_{a} (P \mid Q) &= (D_{a} P) \mid (D_{a} Q) \\ D_{a} (P \cdot Q) &= (D_{a} P \cdot Q) \mid (\nu(P) \cdot D_{a} Q) \\ D_{a} (P\ast) &= D_{a} P \cdot P\ast \\ \end{align}

fn enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(self : Exp, c : Char) -> Exp
deriv(Exp
self : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, Char
c : Char
Char) -> enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp {
  match Exp
self {
    Exp
Nil => Exp
self
    Exp
Eps => Exp
Nil
    (Char) -> Exp
Chr(Char
d) if Char
d (self : Char, other : Char) -> Bool
Compares two characters for equality.
Parameters:

self : The first character to compare.
other : The second character to compare.
Returns true if both characters represent the same Unicode code point,
false otherwise.
Example:
  let a = 'A'
  let b = 'A'
  let c = 'B'
  inspect(a == b, content="true")
  inspect(a == c, content="false")
== Char
c => Exp
Eps
    (Char) -> Exp
Chr(_) => Exp
Nil
    (Exp, Exp) -> Exp
Alt(Exp
l, Exp
r) => Exp
l.(self : Exp, c : Char) -> Exp
deriv(Char
c).(a : Exp, b : Exp) -> Exp
alt(Exp
r.(self : Exp, c : Char) -> Exp
deriv(Char
c))
    (Exp, Exp) -> Exp
Seq(Exp
l, Exp
r) => {
      let Exp
dl = Exp
l.(self : Exp, c : Char) -> Exp
deriv(Char
c)
      if Exp
l.(self : Exp) -> Bool
nullable() {
        Exp
dl.(a : Exp, b : Exp) -> Exp
seq(Exp
r).(a : Exp, b : Exp) -> Exp
alt(Exp
r.(self : Exp, c : Char) -> Exp
deriv(Char
c))
      } else {
        Exp
dl.(a : Exp, b : Exp) -> Exp
seq(Exp
r)
      }
    }
    (Exp) -> Exp
Rep(Exp
e) => Exp
e.(self : Exp, c : Char) -> Exp
deriv(Char
c).(a : Exp, b : Exp) -> Exp
seq(Exp
self)
  }
}

To simplify our implementation, we only perform strict matching—the pattern must match the entire input string. Therefore, we only check for nullability after the entire input has been consumed:

fn enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(self : Exp, s : String) -> Bool
matches(Exp
self : enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp, String
s : String
String) -> Bool
Bool {
  loop (Exp
self, String
s.(self : String, start_offset? : Int, end_offset? : Int) -> StringView
Creates a View into a String.
Example
  let str = "Hello🤣🤣🤣"
  let view1 = str.view()
  inspect(view1, content=
   "Hello🤣🤣🤣"
  )
  let start_offset = str.offset_of_nth_char(1).unwrap()
  let end_offset = str.offset_of_nth_char(6).unwrap() // the second emoji
  let view2 = str.view(start_offset~, end_offset~)
  inspect(view2, content=
   "ello🤣"
  )
view()) {
    (Exp
Nil, _) => {
      return false
    }
    (Exp
e, []) => {
      return Exp
e.(self : Exp) -> Bool
nullable()
    }
    (Exp
e, StringView
[Char
cStringView
, .. s]) => {
      continue (Exp
e.(self : Exp, c : Char) -> Exp
deriv(Char
c), StringView
s)
    }
  }
}

Virtual Machine

The VM approach compiles regular expressions into bytecode instructions for a simple virtual machine. This method transforms the pattern-matching problem into program execution, where the VM simulates all possible paths through a non-deterministic finite automaton simultaneously.

Ken Thompson's 1968 paper described a regex engine that compiled patterns into IBM 7094 machine code. The key insight was to avoid exponential backtracking by maintaining multiple execution threads that advance through input in lockstep, processing one character at a time across all possible matching paths.

Instruction Set and Program Representation

The VM operates on four fundamental instructions that correspond to NFA operations:

enum Ops {
  Ops
Done
  (Char) -> Ops
Char(Char
Char)
  (Int) -> Ops
Jump(Int
Int)
  (Int) -> Ops
Fork(Int
Int)
} derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show)

Each instruction serves a specific purpose in NFA simulation. Done marks successful completion of pattern matching, equivalent to Thompson's original match. Char(c) consumes input character c and advances to the next instruction. Jump(addr) provides unconditional jump to instruction at address addr (Thompson's jmp). Fork(addr) creates two execution paths—one continues to the next instruction, another jumps to addr (Thompson's split).

The Fork instruction is crucial for handling non-determinism in patterns like alternation and repetition, where multiple execution paths must be explored simultaneously. This maps directly to NFA ε-transitions, where execution can spontaneously branch without consuming input.

We define a Prg that wraps an array of instructions with convenience methods for building and manipulating bytecode programs.

struct Prg(type Array[T]
An Array is a collection of values that supports random access and can
grow in size.
Array[enum Ops {
  Done
  Char(Char)
  Jump(Int)
  Fork(Int)
} derive(Show)
Ops]) derive(trait Show {
  output(Self, &Logger) -> Unit
  to_string(Self) -> String
}
Trait for types that can be converted to String
Show)

fn type Prg Array[Ops] derive(Show)
Prg::(self : Prg, inst : Ops) -> Unit
push(Prg
self : type Prg Array[Ops] derive(Show)
Prg, Ops
inst : enum Ops {
  Done
  Char(Char)
  Jump(Int)
  Fork(Int)
} derive(Show)
Ops) -> Unit
Unit {
  Prg
self.Array[Ops]
0.(self : Array[Ops], value : Ops) -> Unit
Adds an element to the end of the array.
If the array is at capacity, it will be reallocated.
Example
  let v = []
  v.push(3)
push(Ops
inst)
}

fn type Prg Array[Ops] derive(Show)
Prg::(self : Prg) -> Int
length(Prg
self : type Prg Array[Ops] derive(Show)
Prg) -> Int
Int {
  Prg
self.Array[Ops]
0.(self : Array[Ops]) -> Int
Returns the number of elements in the array.
Parameters:

array : The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
  let arr = [1, 2, 3]
  inspect(arr.length(), content="3")
  let empty : Array[Int] = []
  inspect(empty.length(), content="0")
length()
}

fn type Prg Array[Ops] derive(Show)
Prg::(self : Prg, index : Int, inst : Ops) -> Unit
op_set(Prg
self : type Prg Array[Ops] derive(Show)
Prg, Int
index : Int
Int, Ops
inst : enum Ops {
  Done
  Char(Char)
  Jump(Int)
  Fork(Int)
} derive(Show)
Ops) -> Unit
Unit {
  Prg
self.Array[Ops]
Sets the element at the specified index in the array to a new value. The
original value at that index is overwritten.
Parameters:

array : The array to modify.
index : The position in the array where the value will be set.
value : The new value to assign at the specified index.
Throws an error if index is negative or greater than or equal to the length
of the array.
Example:
  let arr = [1, 2, 3]
  arr[1] = 42
  inspect(arr, content="[1, 42, 3]")
0[Int
index] = Ops
inst
}

AST Compilation to Bytecode

The Prg::of_ast function translates AST patterns into VM instructions using standard NFA construction techniques:

Seq(a, b):
```
code for a
code for b
```

Rep(a, None) (unbounded repetition):

    Fork L1, L2
L1: code for a
    Jump L1
L2:

Rep(a, Some(n)) (fixed repetition):

code for a
code for a
... (n times) ...

Opt(a) (optional):
```
    Fork L1, L2
L1: code for a
L2:
```

Note that the Fork constructor only accepts one address, because we always want to proceed to the next instruction after the Fork.

fn type Prg Array[Ops] derive(Show)
Prg::(ast : Ast) -> Prg
of_ast(Ast
ast : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> type Prg Array[Ops] derive(Show)
Prg {
  fn (Prg, Ast) -> Unit
compile(Prg
prog : type Prg Array[Ops] derive(Show)
Prg, Ast
ast : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast) -> Unit
Unit {
    match Ast
ast {
      (Char) -> Ast
Chr(Char
chr) => Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Char) -> Ops
Char(Char
chr))
      (Ast, Ast) -> Ast
Seq(Ast
l, Ast
r) => {
        (Prg, Ast) -> Unit
compile(Prg
prog, Ast
l)
        (Prg, Ast) -> Unit
compile(Prg
prog, Ast
r)
      }
      (Ast, Int?) -> Ast
Rep(Ast
e, Int?
None) => {
        let Int
fork = Prg
prog.(self : Prg) -> Int
length()
        Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Int) -> Ops
Fork(0))
        (Prg, Ast) -> Unit
compile(Prg
prog, Ast
e)
        Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Int) -> Ops
Jump(Int
fork))
        Prg
prog(Prg, Int, Ops) -> Unit
[fork] = (Int) -> Ops
Fork(Prg
prog.(self : Prg) -> Int
length())
      }
      (Ast, Int?) -> Ast
Rep(Ast
e, (Int) -> Int?
Some(Int
n)) =>
        for _ in Int
0..<Int
n {
          (Prg, Ast) -> Unit
compile(Prg
prog, Ast
e)
        }
      (Ast) -> Ast
Opt(Ast
e) => {
        let Int
fork_inst = Prg
prog.(self : Prg) -> Int
length()
        Prg
prog.(self : Prg, inst : Ops) -> Unit
push((Int) -> Ops
Fork(0))
        (Prg, Ast) -> Unit
compile(Prg
prog, Ast
e)
        Prg
prog(Prg, Int, Ops) -> Unit
[fork_inst] = (Int) -> Ops
Fork(Prg
prog.(self : Prg) -> Int
length())
      }
    }
  }

  let Prg
prog : type Prg Array[Ops] derive(Show)
Prg = []
  (Prg, Ast) -> Unit
compile(Prg
prog, Ast
ast)
  Prg
prog.(self : Prg, inst : Ops) -> Unit
push(Ops
Done)
  Prg
prog
}

VM Execution Loop

In Rob Pike's implementation, the VM executes one-past the end of the input string to handle the final acceptance state. To make this explicit, our matches function implements the core VM execution loop using a two-phase approach:

Phase 1 handles character processing. For each input character, it processes all active threads in the current context. Char instructions that match the current character create new threads in the next context. Jump and Fork instructions immediately spawn new threads in the current context. After processing all threads, it swaps contexts and continues with the next character.

Phase 2 handles final acceptance. After consuming all input, it processes remaining threads looking for Done instructions. It handles any final Jump/Fork instructions that don't consume input. It returns true if any thread reaches a Done instruction.

fn type Prg Array[Ops] derive(Show)
Prg::(self : Prg, data : StringView) -> Bool
matches(Prg
self : type Prg Array[Ops] derive(Show)
Prg, StringView
data : type StringView
StringView represents a view of a String that maintains proper Unicode
character boundaries. It allows safe access to a substring while handling
multi-byte characters correctly.
@string.View) -> Bool
Bool {
  let (Array[Ops]) -> Prg
Prg(Array[Ops]
prog) = Prg
self
  let mut Ctx
curr = struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx::(length : Int) -> Ctx
new(Array[Ops]
prog.(self : Array[Ops]) -> Int
Returns the number of elements in the array.
Parameters:

array : The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
  let arr = [1, 2, 3]
  inspect(arr.length(), content="3")
  let empty : Array[Int] = []
  inspect(empty.length(), content="0")
length())
  let mut Ctx
next = struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx::(length : Int) -> Ctx
new(Array[Ops]
prog.(self : Array[Ops]) -> Int
Returns the number of elements in the array.
Parameters:

array : The array whose length is to be determined.
Returns the number of elements in the array as an integer.
Example:
  let arr = [1, 2, 3]
  inspect(arr.length(), content="3")
  let empty : Array[Int] = []
  inspect(empty.length(), content="0")
length())
  Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(0)
  for Char
c in StringView
data {
    while Ctx
curr.(self : Ctx) -> Int?
pop() is (Int) -> Int?
Some(Int
pc) {
      match Array[Ops]
prog[Int
pc] {
        Ops
Done => ()
        (Char) -> Ops
Char(Char
char) if Char
char (self : Char, other : Char) -> Bool
Compares two characters for equality.
Parameters:

self : The first character to compare.
other : The second character to compare.
Returns true if both characters represent the same Unicode code point,
false otherwise.
Example:
  let a = 'A'
  let b = 'A'
  let c = 'B'
  inspect(a == b, content="true")
  inspect(a == c, content="false")
== Char
c => {
          Ctx
next.(self : Ctx, pc : Int) -> Unit
add(Int
pc (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1)
        }
        (Int) -> Ops
Jump(Int
jump) =>
          Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
jump)
        (Int) -> Ops
Fork(Int
fork) => {
          Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
fork)
          Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
pc (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1)
        }
        _ => ()
      }
    }
    let Ctx
temp = Ctx
curr
    Ctx
curr = Ctx
next
    Ctx
next = Ctx
temp
    Ctx
next.(self : Ctx) -> Unit
reset()
  }
  while Ctx
curr.(self : Ctx) -> Int?
pop() is (Int) -> Int?
Some(Int
pc) {
    match Array[Ops]
prog[Int
pc] {
      Ops
Done => return true
      (Int) -> Ops
Jump(Int
x) => Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
x)
      (Int) -> Ops
Fork(Int
x) => {
        Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
x)
        Ctx
curr.(self : Ctx, pc : Int) -> Unit
add(Int
pc (self : Int, other : Int) -> Int
Adds two 32-bit signed integers. Performs two's complement arithmetic, which
means the operation will wrap around if the result exceeds the range of a
32-bit integer.
Parameters:

self : The first integer operand.
other : The second integer operand.
Returns a new integer that is the sum of the two operands. If the
mathematical sum exceeds the range of a 32-bit integer (-2,147,483,648 to
2,147,483,647), the result wraps around according to two's complement rules.
Example:
  inspect(42 + 1, content="43")
  inspect(2147483647 + 1, content="-2147483648") // Overflow wraps around to minimum value
+ 1)
      }
      _ => ()
    }
  }
  false
}

In the original blog post, Rob Pike uses a recursive function to handle Fork and Jump instructions so that threads are executed according to their priorities. Instead, we use a stack-like structure to manage all threads of execution, which naturally respects thread priority:

struct Ctx {
  @deque.Deque[Int]
deque : type @deque.Deque[A]
@deque.Deque[Int
Int]
  FixedArray[Bool]
visit : type FixedArray[A]
FixedArray[Bool
Bool]
}

fn struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx::(length : Int) -> Ctx
new(Int
length : Int
Int) -> struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx {
  { @deque.Deque[Int]
deque: (capacity? : Int) -> @deque.Deque[Int]
Creates a new empty deque with an optional initial capacity.
Parameters:

capacity : The initial capacity of the deque. If not specified, defaults
to 0 and will be automatically adjusted as elements are added.
Returns a new empty deque of type T[A] where A is the type of elements
the deque will hold.
Example
  let dq : @deque.Deque[Int] = @deque.new()
  inspect(dq.length(), content="0")
  inspect(dq.capacity(), content="0")

  let dq : @deque.Deque[Int] = @deque.new(capacity=10)
  inspect(dq.length(), content="0")
  inspect(dq.capacity(), content="10")
@deque.new(), FixedArray[Bool]
visit: type FixedArray[A]
FixedArray::(len : Int, init : Bool) -> FixedArray[Bool]
Creates a new fixed-size array with the specified length, initializing all
elements with the given value.
Parameters:

length : The length of the array to create. Must be non-negative.
initial_value : The value used to initialize all elements in the array.
Returns a new fixed-size array of type FixedArray[T] with length
elements, where each element is initialized to initial_value.
Throws a panic if length is negative.
Example:
  let arr = FixedArray::make(3, 42)
  inspect(arr[0], content="42")
  inspect(arr.length(), content="3")
WARNING: A common pitfall is creating with the same initial value, for example:
  let two_dimension_array = FixedArray::make(10, FixedArray::make(10, 0))
  two_dimension_array[0][5] = 10
  assert_eq(two_dimension_array[5][5], 10)
This is because all the cells reference to the same object (the FixedArray[Int] in this case).
One should use makei() instead which creates an object for each index.
make(Int
length, false) }
}

fn struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx::(self : Ctx, pc : Int) -> Unit
add(Ctx
self : struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx, Int
pc : Int
Int) -> Unit
Unit {
  if Bool
!Ctx
selfBool
.FixedArray[Bool]
visitBool
[Int
pcBool
] {
    Ctx
self.@deque.Deque[Int]
deque.(self : @deque.Deque[Int], value : Int) -> Unit
Adds an element to the back of the deque.
If the deque is at capacity, it will be reallocated.
Example
  let dv = @deque.of([1, 2, 3, 4, 5])
  dv.push_back(6)
  assert_eq(dv.back(), Some(6))
push_back(Int
pc)
    Ctx
self.FixedArray[Bool]
visit[Int
pc] = true
  }
}

fn struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx::(self : Ctx) -> Int?
pop(Ctx
self : struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx) -> Int
Int? {
  match Ctx
self.@deque.Deque[Int]
deque.(self : @deque.Deque[Int]) -> Int?
Removes a back element from a deque and returns it, or None if it is empty.
Example
  let dv = @deque.of([1, 2, 3, 4, 5])
  assert_eq(dv.pop_back(), Some(5))
pop_back() {
    (Int) -> Int?
Some(Int
pc) => {
      Ctx
self.FixedArray[Bool]
visit[Int
pc] = false
      (Int) -> Int?
Some(Int
pc)
    }
    Int?
None => Int?
None
  }
}

fn struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx::(self : Ctx) -> Unit
reset(Ctx
self : struct Ctx {
  deque: @deque.Deque[Int]
  visit: FixedArray[Bool]
}
Ctx) -> Unit
Unit {
  Ctx
self.@deque.Deque[Int]
deque.(self : @deque.Deque[Int]) -> Unit
Clears the deque, removing all values.
This method has no effect on the allocated capacity of the deque, only setting the length to 0.
Example
  let dv = @deque.of([1, 2, 3, 4, 5])
  dv.clear()
  inspect(dv.length(), content="0")
clear()
  Ctx
self.FixedArray[Bool]
visit.(self : FixedArray[Bool], value : Bool, start? : Int, end? : Int) -> Unit
Fill the array with a given value.
This method fills all or part of a FixedArray with the given value.
Parameters

value: The value to fill the array with
start: The starting index (inclusive, default: 0)
end: The ending index (exclusive, optional)
If end is not provided, fills from start to the end of the array.
If start equals end, no elements are modified.
Panics

Panics if start is negative or greater than or equal to the array length
Panics if end is provided and is less than start or greater than array length
Does nothing if the array is empty
Example
// Fill entire array
let fa : FixedArray[Int] = [0, 0, 0, 0, 0]
fa.fill(3)
inspect(fa, content="[3, 3, 3, 3, 3]")

// Fill from index 1 to 3 (exclusive)
let fa2 : FixedArray[Int] = [0, 0, 0, 0, 0]
fa2.fill(9, start=1, end=3)
inspect(fa2, content="[0, 9, 9, 0, 0]")

// Fill from index 2 to end
let fa3 : FixedArray[String] = ["a", "b", "c", "d"]
fa3.fill("x", start=2)
inspect(fa3, content=(
  #|["a", "b", "x", "x"]
))
fill(false)
}

The visit array is used to drop low-priority threads. When a new thread is added, we first check if it is already in the deque using the visit array. If it is, we drop it; otherwise, we add it to the deque and mark it as visited. This mechanism is necessary to avoid infinite loops or exponential blowup when the regex contains patterns that can be expanded indefinitely, such as (a?)*.

Benchmarks and Performance Analysis

The benchmark demonstrates both approaches on a pathological case that challenges many regex implementations:

test (@bench.Bench
b : type @bench.Bench
@bench.T) {
  let Int
n = 15
  let String
txt = "a".(self : String, n : Int) -> String
Returns a new string with self repeated n times.
repeat(Int
n)
  let Ast
chr = enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast::(chr : Char) -> Ast
chr('a')
  let Ast
ast : enum Ast {
  Chr(Char)
  Seq(Ast, Ast)
  Rep(Ast, Int?)
  Opt(Ast)
} derive(Show, Hash, Eq)
Ast = Ast
chr.(self : Ast) -> Ast
opt().(self : Ast, n~ : Int) -> Ast
rep(Int
n~).(self : Ast, other : Ast) -> Ast
seq(Ast
chr.(self : Ast, n~ : Int) -> Ast
rep(Int
n~))
  let Exp
exp = enum Exp {
  Nil
  Eps
  Chr(Char)
  Alt(Exp, Exp)
  Seq(Exp, Exp)
  Rep(Exp)
} derive(Show, Hash, Eq, Compare)
Exp::(ast : Ast) -> Exp
of_ast(Ast
ast)
  @bench.Bench
b.(self : @bench.Bench, name~ : String, f : () -> Unit, count? : UInt) -> Unit
Run a benchmark in batch mode
bench(String
name="derive", () => Exp
exp.(self : Exp, s : String) -> Bool
matches(String
txt) |> (t : Bool) -> Unit
Evaluates an expression and discards its result. This is useful when you want
to execute an expression for its side effects but don't care about its return
value, or when you want to explicitly indicate that a value is intentionally
unused.
Parameters:

value : The value to be ignored. Can be of any type.
Example:
  let x = 42
  ignore(x) // Explicitly ignore the value
  let mut sum = 0
  ignore([1, 2, 3].iter().each((x) => { sum = sum + x })) // Ignore the Unit return value of each()
ignore())
  let Prg
tvm = type Prg Array[Ops] derive(Show)
Prg::(ast : Ast) -> Prg
of_ast(Ast
ast)
  @bench.Bench
b.(self : @bench.Bench, name~ : String, f : () -> Unit, count? : UInt) -> Unit
Run a benchmark in batch mode
bench(String
name="thompson", () => Prg
tvm.(self : Prg, data : StringView) -> Bool
matches(String
txt) |> (t : Bool) -> Unit
Evaluates an expression and discards its result. This is useful when you want
to execute an expression for its side effects but don't care about its return
value, or when you want to explicitly indicate that a value is intentionally
unused.
Parameters:

value : The value to be ignored. Can be of any type.
Example:
  let x = 42
  ignore(x) // Explicitly ignore the value
  let mut sum = 0
  ignore([1, 2, 3].iter().each((x) => { sum = sum + x })) // Ignore the Unit return value of each()
ignore())
}

This pattern (a?){n}a{n} represents a classical exponential blowup case for backtracking engines. The pattern allows n different ways to match n 'a' characters, creating exponential search spaces in naive implementations.

name     time (mean ± σ)         range (min … max)
derive     41.78 µs ±   0.14 µs    41.61 µs …  42.13 µs  in 10 ×   2359 runs
thompson   12.79 µs ±   0.04 µs    12.74 µs …  12.84 µs  in 10 ×   7815 runs

The benchmark results show that the VM approach is significantly faster than the derivative-based approach for this case. The derivative method frequently allocates intermediate regex structures, leading to higher overhead and slower performance. In contrast, the VM executes a fixed set of instructions and rarely allocates new structures once the deque grows to its full size.

However, the derivative approach is easier to reason about. We can easily prove termination of the algorithm, as the number of derivatives to be computed is bounded by the size of the AST and strictly decreases with each recursive application of the deriv function. The VM approach, on the other hand, can potentially run indefinitely if the input Prg contains infinite loops, and requires careful handling of thread priority to avoid infinite loops and exponential blowup in the number of threads.

Conventions & Definitions​

Brzozowski Derivative​

Example

Virtual Machine​

Instruction Set and Program Representation​

Example

AST Compilation to Bytecode​

VM Execution Loop​

Example

Example

Example

Example

Parameters

Panics

Example

Benchmarks and Performance Analysis​

Conventions & Definitions

Brzozowski Derivative

Virtual Machine

Instruction Set and Program Representation

AST Compilation to Bytecode

VM Execution Loop

Benchmarks and Performance Analysis