Allocating objects and dereferencing safely

In this chapter we'll build some safe Rust abstractions over the allocation API defined in the Sticky Immix crate.

Let's first recall this interface:

pub trait AllocRaw {
    /// An implementation of an object header type
    type Header: AllocHeader;

    /// Allocate a single object of type T.
    fn alloc<T>(&self, object: T) -> Result<RawPtr<T>, AllocError>
    where
        T: AllocObject<<Self::Header as AllocHeader>::TypeId>;

    /// Allocating an array allows the client to put anything in the resulting data
    /// block but the type of the memory block will simply be 'Array'. No other
    /// type information will be stored in the object header.
    /// This is just a special case of alloc<T>() for T=u8 but a count > 1 of u8
    /// instances.  The caller is responsible for the content of the array.
    fn alloc_array(&self, size_bytes: ArraySize) -> Result<RawPtr<u8>, AllocError>;

    /// Given a bare pointer to an object, return the expected header address
    fn get_header(object: NonNull<()>) -> NonNull<Self::Header>;

    /// Given a bare pointer to an object's header, return the expected object address
    fn get_object(header: NonNull<Self::Header>) -> NonNull<()>;
}

These are the functions we'll be calling. When we allocate an object, we'll get back a RawPtr<T> which has no safe way to dereference it. This is impractical, we very much do not want to wrap every dereferencing in unsafe { ... }. We'll need a layer over RawPtr<T> where we can guarantee safe dereferencing.

Pointers

In safe Rust, mutable (&mut) and immutable (&) references are passed around to access objects. These reference types are compile-time constrained pointers where the constraints are

the mutability of the access
the lifetime of the access

For our layer over RawPtr<T> we'll have to consider both these constraints.

Mutability

This constraint is concerned with shared access to an object. In other words, it cares about how many pointers there are to an object at any time and whether they allow mutable or immutable access.

The short of it is:

Either only one &mut reference may be held in a scope
Or many & immutable references may be held in a scope

The compiler must be able to determine that a &mut reference is the only live reference in it's scope that points at an object in order for mutable access to that object to be safe of data races.

In a runtime memory managed language such as the interpreter we are building, we will not have compile time knowledge of shared access to objects. We won't know at compile time how many pointers to an object we may have at any time. This is the normal state of things in languages such as Python, Ruby or Javascript.

This means that we can't allow &mut references in our safe layer at all!

If we're restricted to & immutable references everywhere, that then means we must apply the interior mutability pattern everywhere in our design in order to comply with the laws of safe Rust.

Lifetime

The second aspect to references is their lifetime. This concerns the duration of the reference, from inception until it goes out of scope.

The key concept to think about now is "scope."

In an interpreted language there are two major operations on the objects in memory:

fn run_mutator() {
    parse_source_code();
    compile();
    execute_bytecode();
}

and

fn run_garbage_collection() {
   trace_objects();
   free_dead_objects();
}

A few paragraphs earlier we determined that we can't have &mut references to objects in our interpreter.

By extension, we can't safely hold a mutable reference to the entire heap as a data structure.

Except, that is exactly what garbage collection requires. The nature of garbage collection is that it views the entire heap as a single data structure in it's own right that it needs to traverse and modify. It wants the heap to be &mut.

Consider, especially, that some garbage collectors move objects, so that pointers to moved objects, wherever they may be, must be modified by the garbage collector without breaking the mutator! The garbage collector must be able to reliably discover every single pointer to moved objects to avoid leaving invalid pointers scattered around¹.

Thus we have two mutually exclusive interface requirements, one that must only hold & object references and applies interior mutability to the heap and the other that wants the whole heap to be &mut.

For this part of the book, we'll focus on the use of the allocator and save garbage collection for a later part.

This mutual exclusivity constraint on the allocator results in the statements:

When garbage collection is running, it is not safe to run the mutator²
When garbage collection is not running, it is safe to run the mutator

Thus our abstraction must encapsulate a concept of a time when "it is safe to run the mutator" and since we're working with safe Rust, this must be a compile time concept.

Scopes and lifetimes are perfect for this abstraction. What we'll need is some way to define a lifetime (that is, a scope) within which access to the heap by the mutator is safe.

Some pointer types

First, let's define a simple pointer type that can wrap an allocated type T in a lifetime:

pub struct ScopedPtr<'guard, T: Sized> {
    value: &'guard T,
}

This type will implement Clone, Copy and Deref - it can be passed around freely within the scope and safely dereferenced.

As you can see we have a lifetime 'guard that we'll use to restrict the scope in which this pointer can be accessed. We need a mechanism to restrict this scope.

The guard pattern is what we'll use, if the hint wasn't strong enough.

We'll construct some types that ensure that safe pointers such as ScopedPtr<T>, and access to the heap at in any way, are mediated by an instance of a guard type that can provide access.

We will end up passing a reference to the guard instance around everywhere. In most cases we won't care about the instance type itself so much as the lifetime that it carries with it. As such, we'll define a trait for this type to implement that so that we can refer to the guard instance by this trait rather than having to know the concrete type. This'll also allow other types to proxy the main scope-guarding instance.

pub trait MutatorScope {}

You may have noticed that we've jumped from RawPtr<T> to ScopedPtr<T> with seemingly nothing to bridge the gap. How do we get a ScopedPtr<T>?

We'll create a wrapper around RawPtr<T> that will complete the picture. This wrapper type is what will hold pointers at rest inside any data structures.

#[derive(Clone)]
pub struct CellPtr<T: Sized> {
    inner: Cell<RawPtr<T>>,
}

This is straightforwardly a RawPtr<T> in a Cell to allow for modifying the pointer. We won't allow dereferencing from this type either though.

Remember that dereferencing a heap object pointer is only safe when we are in the right scope? We need to create a ScopedPtr<T> from a CellPtr<T> to be able to use it.

First we'll add a helper function to RawPtr<T> in our interpreter crate so we can safely dereference a RawPtr<T>. This code says that, given an instance of a MutatorScope-implementing type, give me back a reference type with the same lifetime as the guard that I can safely use. Since the _guard parameter is never used except to define a lifetime, it should be optimized out by the compiler!

pub trait ScopedRef<T> {
    fn scoped_ref<'scope>(&self, guard: &'scope dyn MutatorScope) -> &'scope T;
}

impl<T> ScopedRef<T> for RawPtr<T> {
    fn scoped_ref<'scope>(&self, _guard: &'scope dyn MutatorScope) -> &'scope T {
        unsafe { &*self.as_ptr() }
    }
}

We'll use this in our CellPtr<T> to obtain a ScopedPtr<T>:

impl<T: Sized> CellPtr<T> {
    pub fn get<'guard>(&self, guard: &'guard dyn MutatorScope) -> ScopedPtr<'guard, T> {
        ScopedPtr::new(guard, self.inner.get().scoped_ref(guard))
    }
}

Thus, anywhere (structs, enums) that needs to store a pointer to something on the heap will use CellPtr<T> and any code that accesses these pointers during the scope-guarded mutator code will obtain ScopedPtr<T> instances that can be safely dereferenced.

The heap and the mutator

The next question is: where do we get an instance of MutatorScope from?

The lifetime of an instance of a MutatorScope will define the lifetime of any safe object accesses. By following the guard pattern, we will find we have:

a heap struct that contains an instance of the Sticky Immix heap
a guard struct that proxies the heap struct for the duration of a scope
a mechanism to enforce the scope limit

A heap struct

Let's make a type alias for the Sticky Immix heap so we aren't referring to it as such throughout the interpreter:

pub type HeapStorage = StickyImmixHeap<ObjectHeader>;

The let's put that into a heap struct, along with any other interpreter-global storage:

struct Heap {
    heap: HeapStorage,
    syms: SymbolMap,
}

We'll discuss the SymbolMap type in the next chapter.

Now, since we've wrapped the Sticky Immix heap in our own Heap struct, we'll need to impl an alloc() method to proxy the Sticky Immix allocation function.

impl Heap {
    fn alloc<T>(&self, object: T) -> Result<RawPtr<T>, RuntimeError>
    where
        T: AllocObject<TypeList>,
    {
        Ok(self.heap.alloc(object)?)
    }
}

A couple things to note about this function:

It returns RuntimeError in the error case, this type converts From the Sticky Immix crate's error type.
The where constraint is similar to that of AllocRaw::alloc() but in now we have a concrete TypeList type to bind to. We'll look at TypeList in the next chapter along with SymbolMap.

A guard struct

This next struct will be used as a scope-limited proxy for the Heap struct with one major difference: function return types will no longer be RawPtr<T> but ScopedPtr<T>.

pub struct MutatorView<'memory> {
    heap: &'memory Heap,
}

Here in this struct definition, it becomes clear that all we are doing is borrowing the Heap instance for a limited lifetime. Thus, the lifetime of the MutatorView instance will be the lifetime that all safe object access is constrained to.

A look at the alloc() function now:

impl<'memory> MutatorView<'memory> {
    pub fn alloc<T>(&self, object: T) -> Result<ScopedPtr<'_, T>, RuntimeError>
    where
        T: AllocObject<TypeList>,
    {
        Ok(ScopedPtr::new(
            self,
            self.heap.alloc(object)?.scoped_ref(self),
        ))
    }
}

Very similar to Heap::alloc() but the return type is now a ScopedPtr<T> whose lifetime is the same as the MutatorView instance.

Enforcing a scope limit

We now have a Heap and a guard, MutatorView, but we want one more thing: to prevent an instance of MutatorView from being returned from anywhere - that is, enforcing a scope within which an instance of MutatorView will live and die. This will make it easier to separate mutator operations and garbage collection operations.

First we'll apply a constraint on how a mutator gains heap access: through a trait.

pub trait Mutator: Sized {
    type Input;
    type Output;

    fn run(&self, mem: &MutatorView, input: Self::Input) -> Result<Self::Output, RuntimeError>;

    // TODO
    // function to return iterator that iterates over roots
}

If a piece of code wants to access the heap, it must implement this trait!

Secondly, we'll apply another wrapper struct, this time to the Heap type. This is so that we can borrow the heap member instance.

pub struct Memory {
    heap: Heap,
}

This Memory struct and the Mutator trait are now tied together with a function:

impl Memory {
    pub fn mutate<M: Mutator>(&self, m: &M, input: M::Input) -> Result<M::Output, RuntimeError> {
        let mut guard = MutatorView::new(self);
        m.run(&mut guard, input)
    }

}

The key to the scope limitation mechanism is that this mutate function is the only way to gain access to the heap. It creates an instance of MutatorView that goes out of scope at the end of the function and thus can't leak outside of the call stack.

An example

Let's construct a simple example to demonstrate these many parts. This will omit defining a TypeId and any other types that we didn't discuss above.

struct Stack {}

impl Stack {
    fn say_hello(&self) {
        println!("I'm the stack!");
    }
}

struct Roots {
    stack: CellPtr<Stack>
}

impl Roots {
    fn new(stack: ScopedPtr<'_, Stack>) -> Roots {
        Roots {
            stack: CellPtr::new_with(stack)
        }
    }
}

struct Interpreter {}

impl Mutator for Interpreter {
    type Input: ();
    type Output: Roots;

    fn run(&self, mem: &MutatorView, input: Self::Input) -> Result<Self::Output, RuntimeError> {
        let stack = mem.alloc(Stack {})?;   // returns a ScopedPtr<'_, Stack>
        stack.say_hello();

        let roots = Roots::new(stack);

        let stack_ptr = roots.stack.get(mem);  // returns a ScopedPtr<'_, Stack>
        stack_ptr.say_hello();

        Ok(roots)
    }
}

fn main() {
    ...
    let interp = Interpreter {};

    let result = memory.mutate(&interp, ());

    let roots = result.unwrap();

    // no way to do this - compile error
    let stack = roots.stack.get();
    ...
}

In this simple, contrived example, we instantiated a Stack on the heap. An instance of Roots is created on the native stack and given a pointer to the Stack instance. The mutator returns the Roots object, which continues to hold a pointer to a heap object. However, outside of the run() function, the stack member can't be safely accesed.

Up next: using this framework to implement parsing!

This is the topic of discussion in Felix Klock's series GC and Rust which is recommended reading.

while this distinction exists at the interface level, in reality there are multiple phases in garbage collection and not all of them require exclusive access to the heap. This is an advanced topic that we won't bring into consideration yet.

Writing Interpreters in Rust: a Guide