Drawing from the VM design, the compiler must support the following language constructs:
- function definitions
- anonymous functions
- function calls
- lexical scoping
- local variables
- global variables
This is a minimally critical set of features that any further language constructs can be built on while ensuring that our compiler remains easy to understand for the purposes of this book.
Our parser, recall, reads in s-expression syntax
and produces a nested
Symbol based abstract syntax tree. Adding
other types - integers, strings, arrays etc - is mostly a matter of expanding
the parser. The compiler as described here, being for a dynamically typed
language, will support them without refactoring.
Our compiler design is based on the eval/apply pattern.
In this pattern we recursively descend into the
Pair AST, calling eval on
the root node of the expression to be compiled.
Eval is, of course, short for "evaluate" - we want to evaluate the given expression. In the case of a compiler, we don't want the result yet, rather the sequence of instructions that will generate the result.
More concretely, eval looks at the node in the AST it is given and if it resolves to fetching a value for a variable, it generates that instruction; otherwise if it is a compound expression, the arguments are evaluated and then the function and arguments are passed to apply, which generates appropriate function call instructions.
Designing an Eval function
Eval looks at the given node and attempts to generate an instruction for it that would resolve the node to a value - that is, evaluate it.
If the node is a special symbol, such as
true, then it is treated as
a literal and an instruction is generated to load that literal symbol into the
next available register.
Otherwise if the node is any other symbol, it is assumed to be bound to a value (it must be a variable) and an instruction is generated for fetching the value into a register.
Variables come in three kinds: local, nonlocal or global.
Local: the symbol has been declared earlier in the expression (either it is
a function parameter or it was declared using
let) and the compiler already
has a record of it. The symbol is already associated with a local register
index and a simple register copy instruction is generated.
Nonlocal: the symbol has been bound in a parent nesting function. Again, the compiler already has a record of the declaration, which register is associated with the symbol and which relative call frame will contain that register. An upvalue lookup instruction is generated.
Global: if the symbol isn't found as a local binding or a nonlocal binding, it is assumed to be a global, and a late-binding global lookup instruction is generated. In the event the programmer has misspelled a variable name, this is possibly the instruction that will be generated and the programmer will see an unknown-variable error at runtime.
Expressions and function calls
When eval is passed a
Pair, this represents the beginning of an expression,
a function call. A composition of things.
In s-expression syntax, all expressions and function calls looks like
(function_name arg1 arg2). That is parsed into a
Pair tree, which takes
Pair( Symbol(function_name), Pair( Symbol(arg1), Pair( Symbol(arg2), nil ) ) )
It is apply's job to handle this case, so eval extracts the first and
second values from the outermost
Pair and passes them into apply. In more
general terms, eval calls apply with the function name and the argument
list and leaves the rest up to apply.
Designing an Apply function
Apply takes a function name and a list of arguments. First it recurses into eval for each argument expression, then generates instructions to call the function with the argument results.
Functions are either built into to the language and VM or are library/user-defined functions composed of other functions.
In every case, the simplified pattern for function calls is:
- allocate a register to write the return value into
- eval each of the arguments in sequence, allocating their resulting values into consequent registers
- compile the function call opcode, giving it the number of argument registers it should expect
Compiling a call to a builtin function might translate directly to a dedicated
bytecode operation. For example, querying whether a value is
nil with builtin
nil? compiles 1:1 to a bytecode operation that directly represents
Compiling a call to a user defined function is a more involved. In it's more general form, supporting first class functions and closures, a function call requires two additional pointers to be placed in registers. The complete function call register allocation looks like this:
|0||reserved for return value|
|1||reserved for closure environment pointer|
If a closure is called, the closure object itself contains a pointer to it's
environment and the function to call and those pointers can be copied over to
registers. Otherwise, the closure environment pointer will be a
The VM, when entering a new function, will represent the return value register always as the zeroth register.
When the function call returns, all registers except the return value are discarded.
Let's look at a simple function definition:
(def is_true (x) (is? x true))
This function has a name
is_true, takes one argument
x and evaluates one
(is? x true).
The same function may be written without a name:
(lambda (x) (is? x true))
Compiling a function requires a few inputs:
- an optional reference to a parent nesting function
- an optional function name
- a list of argument names
- a list of expressions that will compute the return value
The desired output is a data structure that combines:
- the optional function name
- the argument names
- the compiled bytecode
First, a scope structure is established. A scope is a lexical block in which variables are bound and unbound. In the compiler, this structure is simply a mapping of variable name to the register number that contains the value.
The first variables to be bound in the function's scope are the argument names. The compiler, given the list of argument names to the function and the order in which the arguments are given, associates each argument name with the register number that will contain it's value. As we saw above, these are predictably and reliably registers 2 and upward, one for each argument.
A scope may have a parent scope if the function is defined within another function. This is how nonlocal variable references will be looked up. We will go further into that when we discuss closures.
The second step is to eval each expression in the function, assigning the result to register 0, the preallocated return value register. The result of compiling each expression via eval is bytecode.
Thirdly and finally, a function object is instantiated, given it's name, the argument names and the bytecode.
During compilation of the expressions within a function, if any of those expressions reference nonlocal variables (that is, variables not declared within the scope of the function) then the function object needs additional data to describe how to access those nonlocal variables at runtime.
In the below example, the anonymous inner function references the parameter
n to the outer function,
n. When the inner function is returned, the value
n must be carried with it even after the stack scope of the outer function
is popped and later overwritten with values for other functions.
(def make_adder (n) (lambda (x) (+ x n)) )
Eval, when presented with a symbol to evaluate that has not been declared in
the function scope, searches outer scopes next. If a binding is found in an
outer scope, a nonlocal reference is added to the function's local scope
that points to the outer scope and a
GetUpvalue instruction is compiled.
This nonlocal reference is a combination of two values: a count of stack frames to skip over to find the outer scope variable and the register offset in that stack frame.
Non-local references are added to the function object that is returned by the function compiler. The VM will use these to identify the absolute location on the stack where a nonlocal variable should be read from and create upvalue objects at runtime when a variable is closed over.
Let is the declaration of variables and assigning values: the binding of values, or the results of expressions, to symbols. Secondly, it provides space to evaluate expressions that incorporate those variables.
Here we bind the result of
(make_adder 3) - a function - to the symbol
add_3 and then call
add_3 with argument
(let ((add_3 (make_adder 3))) (add_3 4))
The result of the entire
let expression should be
let simply introduces additional scopes within a function scope.
That is, instead of a function containing a single scope for all it's
variables, scopes are nested. A stack of scopes is needed, with the parameters
occupying the outermost scope.
First a new scope is pushed on to the scope stack and each symbol being bound is added to the new scope.
To generate code, a result register is reserved and a register for each binding is reserved.
Finally, each expression is evaluated and the scope is popped, removing the bindings from view.
A function call may make use of no more than 256 registers. Recall from earlier that the 0th register is reserved for the function return value and subsequent registers are reserved for the function arguments.
Beyond these initial registers the compiler uses a simple strategy in register
allocation: if a variable (a parameter or a
let binding) is declared, it is
allocated a register based on a stack discipline. Thus, variables are
essentially pushed and popped off the register stack as they come into and out
This strategy primarily ensures code simplicity - there is no register allocation optimization.
That covers the VM and compiler design at an overview level. We've glossed over a lot of detail but the next chapters will expose the implementation detail. Get ready!