Compiler: Implementation
Before we get into eval and apply let's consider how we will support variables and lexical scoping.
Variables and Scopes
As seen in the previous chapter, variable accesses come in three types, as far as the compiler and VM are concerned: local, nonlocal and global. Each access uses a different bytecode operation, and so the compiler must be able to determine what operations to emit at compile time.
Given that we have named function parameters and let
, we have syntax for
explicit variable declaration within function definitions. This means that we
can easily keep track of whether a variable reference is local, nonlocal or
global.
If a variable wasn't declared as a parameter or in a let
block, it
must be global and global variables are accessed dynamically by name.
As far as local and nonlocal variables are concerned, the VM does not care about or consider their names. At the VM level, local and nonlocal variables are numbered registers. That is, each function's local variables are mapped to a register numbered between 2 and 255. The compiler must generate the mapping from variable names to register numbers.
For generating and maintaining mappings, we need data structures for keeping track of:
- function local variables and their mappings to register numbers
- references to nonlocal variables and their relative stack offsets
- nested scopes within functions
Named variables
Our first data structure will define a register based variable:
/// A variable is a named register. It has compile time metadata about how it is used by closures.
struct Variable {
register: Register,
closed_over: Cell<bool>,
}
For every named, non-global variable (created by defining function parameters
and let
blocks) a Variable
instance is created in the compiler.
The member closed_over
defaults to false
. If the compiler detects that the
variable must escape the stack as part of a closure, this flag will be flipped
to true
(it cannot be set back to false
.)
Scope structure
The data structures that manage nesting of scopes and looking up a Variable
by name are defined here.
/// A Scope contains a set of local variable to register bindings
struct Scope {
/// symbol -> variable mapping
bindings: HashMap<String, Variable>,
}
/// A nonlocal reference will turn in to an Upvalue at VM runtime.
/// This struct stores the non-zero frame offset and register values of a parent function call
/// frame where a binding will be located.
struct Nonlocal {
upvalue_id: u8,
frame_offset: u8,
frame_register: u8,
}
/// A Variables instance represents a set of nested variable binding scopes for a single function
/// definition.
struct Variables<'parent> {
/// The parent function's variables.
parent: Option<&'parent Variables<'parent>>,
/// Nested scopes, starting with parameters/arguments on the outermost scope and let scopes on
/// the inside.
scopes: Vec<Scope>,
/// Mapping of referenced nonlocal nonglobal variables and their upvalue indexes and where to
/// find them on the stack.
nonlocals: RefCell<HashMap<String, Nonlocal>>,
/// The next upvalue index to assign when a new nonlocal is encountered.
next_upvalue: Cell<u8>,
}
For every function defined, the compiler maintains an instance of the type
Variables
.
Each function's Variables
has a stack of Scope
instances, each of which has
it's own set of name to Variable
register number mappings. The outermost
Scope
contains the mapping of function parameters to registers.
A nested function's Variables
, when the function refers to a nesting
function's variable, builds a mapping of nonlocal variable name to relative
stack position of that variable. This is a NonLocal
- a relative stack frame
offset and the register number within that stack frame of the variable.
In summary, under these definitions:
- A
Nonlocal
instance caches a relative stack location of a nonlocal variable for compiling upvalues Scope
manages the mapping of a variable name to theVariable
register number within a single scopeVariables
maintains all the nested scopes for a function during compilation and caches all the nonlocal references. It also keeps a reference to a parent nesting function if there is one, in order to handle lexically scoped lookups.
Retrieving named variables
Whenever a variable is referenced in source code, the mapping to it's register
must be looked up. The result of a lookup is Option<Binding>
.
/// A binding can be either local or via an upvalue depending on how a closure refers to it.
#[derive(Copy, Clone, PartialEq)]
enum Binding {
/// An local variable is local to a function scope
Local(Register),
/// An Upvalue is an indirection for pointing at a nonlocal variable on the stack
Upvalue(UpvalueId),
}
The lookup process checks the local function scopes first.
If the variable is found to be declared there, Some(Local)
enum variant is
returned. In terms of bytecode, this will translate to a direct register
reference.
Next, any outer function scopes are searched. If the variable is found in any
outer scope, Some(Upvalue)
variant is returned. The compiler will emit
instructions to copy the value refered to by the upvalue into a function-local
temporary register.
If the lookup for the variable returns None
, a global lookup instruction is
emitted that will dynamically look up the variable name in the global namespace
and copy the result into a function-local temporary register or raise an error
if the binding does not exist.
Evaluation
We've just somewhat described what happens in the lower levels of eval. Let's finish the job and put eval in a code context. Here is the definition of a function compilation data structure:
struct Compiler<'parent> {
bytecode: CellPtr<ByteCode>,
/// Next available register slot.
next_reg: Register,
/// Optional function name
name: Option<String>,
/// Function-local nested scopes bindings list (including parameters at outer level)
vars: Variables<'parent>,
}
The two interesting members are
bytecode
, which is an instance of ByteCodevars
, an instance ofVariables
which we've described above. This instance will be the outermost scope of thelet
or function block being compiled.
The main entrypoint to this structure is the function compile_function()
:
fn compile_function<'guard>(
mut self,
mem: &'guard MutatorView,
name: TaggedScopedPtr<'guard>,
params: &[TaggedScopedPtr<'guard>],
exprs: &[TaggedScopedPtr<'guard>],
) -> Result<ScopedPtr<'guard, Function>, RuntimeError> {
...
}
This function will set up a Variables
scope with the given parameters and call
into function compile_eval()
for each expression in the function. The full
definition of compile_eval()
is below, and we'll go into the details of
compile_function()
later.
fn compile_eval<'guard>(
&mut self,
mem: &'guard MutatorView,
ast_node: TaggedScopedPtr<'guard>,
) -> Result<Register, RuntimeError> {
match *ast_node {
Value::Pair(p) => self.compile_apply(mem, p.first.get(mem), p.second.get(mem)),
Value::Symbol(s) => {
match s.as_str(mem) {
"nil" => {
let dest = self.acquire_reg();
self.push(mem, Opcode::LoadNil { dest })?;
Ok(dest)
}
"true" => self.push_load_literal(mem, mem.lookup_sym("true")),
// Search scopes for a binding; if none do a global lookup
_ => {
match self.vars.lookup_binding(ast_node)? {
Some(Binding::Local(register)) => Ok(register),
Some(Binding::Upvalue(upvalue_id)) => {
// Retrieve the value via Upvalue indirection
let dest = self.acquire_reg();
self.push(
mem,
Opcode::GetUpvalue {
dest,
src: upvalue_id,
},
)?;
Ok(dest)
}
None => {
// Otherwise do a late-binding global lookup
let name = self.push_load_literal(mem, ast_node)?;
let dest = name; // reuse the register
self.push(mem, Opcode::LoadGlobal { dest, name })?;
Ok(dest)
}
}
}
}
}
_ => self.push_load_literal(mem, ast_node),
}
}
Note that the return type is Result<Register, RuntimeError>
. That is, a
successful eval will return a register where the result will be stored at
runtime.
In the function body, the match branches fall into three categories:
- keywords literals (
nil
,true
) - all other literals
- named variables represented by
Symbol
s
What's in the evaluation of the Symbol
AST type? Locals, nonlocals and
globals!
We can see the generation of special opcodes for retrieving nonlocal and global values here, whereas a local will resolve directly to an existing register without the need to generate any additional opcodes.
Application
To evaluate a function call, we switch over to apply:
match *ast_node {
...
Value::Pair(p) => self.compile_apply(mem, p.first.get(mem), p.second.get(mem)),
...
}
This is the evaluation of the Pair
AST type. This represents, visually, the
syntax (function_name arg1 arg2 argN)
which is, of course, a function call.
Eval cannot tell us the value of a function call, the function must be applied
to it's arguments first. Into apply we recurse.
The first argument to compile_apply()
is the function name Symbol
, the
second argument is the list of function arguments.
Since we included the full compile_eval()
function earlier, it wouldn't be
fair to leave out the definition of compile_apply()
. Here it is:
fn compile_apply<'guard>(
&mut self,
mem: &'guard MutatorView,
function: TaggedScopedPtr<'guard>,
args: TaggedScopedPtr<'guard>,
) -> Result<Register, RuntimeError> {
match *function {
Value::Symbol(s) => match s.as_str(mem) {
"quote" => self.push_load_literal(mem, value_from_1_pair(mem, args)?),
"atom?" => self.push_op2(mem, args, |dest, test| Opcode::IsAtom { dest, test }),
"nil?" => self.push_op2(mem, args, |dest, test| Opcode::IsNil { dest, test }),
"car" => self.push_op2(mem, args, |dest, reg| Opcode::FirstOfPair { dest, reg }),
"cdr" => self.push_op2(mem, args, |dest, reg| Opcode::SecondOfPair { dest, reg }),
"cons" => self.push_op3(mem, args, |dest, reg1, reg2| Opcode::MakePair {
dest,
reg1,
reg2,
}),
"cond" => self.compile_apply_cond(mem, args),
"is?" => self.push_op3(mem, args, |dest, test1, test2| Opcode::IsIdentical {
dest,
test1,
test2,
}),
"set" => self.compile_apply_assign(mem, args),
"def" => self.compile_named_function(mem, args),
"lambda" => self.compile_anonymous_function(mem, args),
"\\" => self.compile_anonymous_function(mem, args),
"let" => self.compile_apply_let(mem, args),
_ => self.compile_apply_call(mem, function, args),
},
// Here we allow the value in the function position to be evaluated dynamically
_ => self.compile_apply_call(mem, function, args),
}
}
The function
parameter is expected to be a Symbol
, that is, have a name
represented by a Symbol
. Thus, the function is match
ed on the Symbol
.
Caling nil?
Let's follow the compilation of a simple function: nil?
. This is where we'll
start seeing some of the deeper details of compilation, such as register
allocation and
...
"nil?" => self.push_op2(mem, args, |dest, test| Opcode::IsNil { dest, test }),
...
The function nil?
takes a single argument and returns:
- the
Symbol
fortrue
if the value of the argument isnil
nil
if the argument is notnil
.
In compiling this function call, a single bytecode opcode will be pushed on to
the ByteCode
array. This is done in the Compiler::push_op2()
function. It is
named push_op2
because the opcode takes two operands: an argument register
and a result destination register. This function is used to compile all simple
function calls that follow the pattern of one argument, one result value. Here
is push_op2()
:
fn push_op2<'guard, F>(
&mut self,
mem: &'guard MutatorView,
params: TaggedScopedPtr<'guard>,
f: F,
) -> Result<Register, RuntimeError>
where
F: Fn(Register, Register) -> Opcode,
{
let result = self.acquire_reg();
let reg1 = self.compile_eval(mem, value_from_1_pair(mem, params)?)?;
self.bytecode.get(mem).push(mem, f(result, reg1))?;
Ok(result)
}
Let's break the function body down, line by line:
-
let result = self.acquire_reg();
self.acquire_reg()
: is called to get an unused register. In this case, we need a register to store the result value in. This register acquisition follows a stack approach. Registers are acquired (pushed on to the stack window) as new variables are declared within a scope, and popped when the scope is exited.- The type of
result
isRegister
which is an alias foru8
- an unsigned int from 0 to 255.
-
let reg1 = self.compile_eval(mem, value_from_1_pair(mem, params)?)?;
value_from_1_pair(mem, params)?
: inspects the argument list and returns the argument if there is a single one, otherwise returns an error.self.compile_eval(mem, <arg>)?
: recurses into the argument to compile it down to a something that can be applied to the function call.let reg1 = <value>;
: wherereg1
will be the argument register to the opcode.
-
self.bytecode.get(mem).push(mem, f(result, reg1))?;
f(result, reg1)
: calls functionf
that will return the opcode with operands applied inByteCode
format.- In the case of calling
nil?
, the argumentf
is:|dest, test| Opcode::IsNil { dest, test }
self.bytecode.get(mem).push(mem, <opcode>)?;
: gets theByteCode
reference and pushes the opcode on to the end of the bytecode array.
-
Ok(result)
- the result register is returned to the
compile_apply()
function
- the result register is returned to the
... and compile_apply()
itself returns the result register to it's caller.
The pattern for compiling function application, more generally, is this:
- acquire a result register
- acquire any temporary intermediate result registers
- recurse into arguments to compile them first
- emit bytecode for the function, pushing opcodes on to the bytecode array and putting the final result in the result register
- release any intermediate registers
- return the result register number
Compiling nil?
was hopefully quite simple. Let's look at something much more
involved, now.
Compiling anonymous functions
An anonymous function is defined, syntactically, as:
(lambda (param1 param2)
(expr1)
(expr2)
(return-expr))
There are 0 or more parameters and 1 or more expresssions in the body of the function. The last expression of the body provides the return value.
Function compilation is initiated by apply. This is because a function is a
compound expression and cannot be reduced to a value by a single eval. Here's
the line in compile_apply()
that calls anonymous function compilation:
...
"lambda" => self.compile_anonymous_function(mem, args),
...
Let's look at the type signature of compile_anonymous_function()
:
fn compile_anonymous_function<'guard>(
&mut self,
mem: &'guard MutatorView,
params: TaggedScopedPtr<'guard>,
) -> Result<Register, RuntimeError> {
The params
parameter will be expected to be a Pair
list: firstly, a list of
parameter names, followed by function body expressions.
The return value from is the same as all the other compilation functions so far:
Result<Register>
. The compiled code will return a pointer to the function
object in a register.
Here is the function in full:
fn compile_anonymous_function<'guard>(
&mut self,
mem: &'guard MutatorView,
params: TaggedScopedPtr<'guard>,
) -> Result<Register, RuntimeError> {
let items = vec_from_pairs(mem, params)?;
if items.len() < 2 {
return Err(err_eval(
"An anonymous function definition must have at least (lambda (params) expr)",
));
}
// a function consists of (name (params) expr1 .. exprn)
let fn_params = vec_from_pairs(mem, items[0])?;
let fn_exprs = &items[1..];
// compile the function to a Function object
let fn_object = compile_function(mem, Some(&self.vars), mem.nil(), &fn_params, fn_exprs)?;
// load the function object as a literal
let dest = self.push_load_literal(mem, fn_object)?;
// if fn_object has nonlocal refs, compile a MakeClosure instruction in addition, replacing
// the Function register with a Partial with a closure environment
match *fn_object {
Value::Function(f) => {
if f.is_closure() {
self.push(
mem,
Opcode::MakeClosure {
function: dest,
dest,
},
)?;
}
}
// 's gotta be a function
_ => unreachable!(),
}
Ok(dest)
}
After converting Pair
lists to Vec
s for convenience (wherein parameter names
and function body expressions are separated) the process calls into function
compile_function()
, which brings us full circle to eval.
In compile_function()
, below:
- a
Scope
is instantiated and the parameters are pushed on to this outermost scope. - the function body expressions are iterated over, eval-ing each one
- any upvalues that will be closed over as the compiled-function exits and goes out of scope have upvalue instructions generated
- a
Function
object is returned with all details necessary to running the function in the VM environment
Here is compile_function()
:
fn compile_function<'guard>(
mut self,
mem: &'guard MutatorView,
name: TaggedScopedPtr<'guard>,
params: &[TaggedScopedPtr<'guard>],
exprs: &[TaggedScopedPtr<'guard>],
) -> Result<ScopedPtr<'guard, Function>, RuntimeError> {
// validate function name
self.name = match *name {
Value::Symbol(s) => Some(String::from(s.as_str(mem))),
Value::Nil => None,
_ => {
return Err(err_eval(
"A function name may be nil (anonymous) or a symbol (named)",
))
}
};
let fn_name = name;
// validate arity
if params.len() > 254 {
return Err(err_eval("A function cannot have more than 254 parameters"));
}
// put params into a list for the Function object
let fn_params = List::from_slice(mem, params)?;
// also assign params to the first level function scope and give each one a register
let mut param_scope = Scope::new();
self.next_reg = param_scope.push_bindings(params, self.next_reg)?;
self.vars.scopes.push(param_scope);
// validate expression list
if exprs.len() == 0 {
return Err(err_eval("A function must have at least one expression"));
}
// compile expressions
let mut result_reg = 0;
for expr in exprs.iter() {
result_reg = self.compile_eval(mem, *expr)?;
}
// pop parameter scope
let closing_instructions = self.vars.pop_scope();
for opcode in &closing_instructions {
self.push(mem, *opcode)?;
}
// finish with a return
let fn_bytecode = self.bytecode.get(mem);
fn_bytecode.push(mem, Opcode::Return { reg: result_reg })?;
let fn_nonlocals = self.vars.get_nonlocals(mem)?;
Ok(Function::alloc(
mem,
fn_name,
fn_params,
fn_bytecode,
fn_nonlocals,
)?)
}
Note that in addition to generating upvalue instructions as the
compiled-function goes out of scope, the calling compiler function
compile_anonymous_function()
will issue a MakeClosure
opcode such that a
closure object is put in the return register rather than a direct Function
object reference.
In our language, a closure object is represented by the Partial
data structure
- a struct that represents a
Function
object pointer plus closed over values and/or partially applied arguments. This data structure was described in the chapter Virtual Machine: Implementation.
Thus ends our tour of our interpreter.
Concluding remarks
In this section, we've looked at a ground-up compiler and virtual machine implementation within a memory-safe allocation system.
There is, of course, much more to explore in the VM and compiler source code. The reader is encouraged to experiment with running and modifying the source.