Symbols and Pairs

To bootstrap our compiler, we'll parse s-expressions into Symbol ad Pair types, where a Pair is essentially a Lisp cons cell.

The definition of Symbol is just the raw components of a &str:

#[derive(Copy, Clone)]
pub struct Symbol {
    name_ptr: *const u8,
    name_len: usize,
}

Why this is how Symbol is defined and how we handle these raw components will be covered in just a bit. First though, we'll delve into the Pair type.

Pairs of pointers

The definition of Pair is

#[derive(Clone)]
pub struct Pair {
    pub first: TaggedCellPtr,
    pub second: TaggedCellPtr,
    // Possible source code positions of the first and second values
    pub first_pos: Cell<Option<SourcePos>>,
    pub second_pos: Cell<Option<SourcePos>>,
}

The type of first and second is TaggedCellPtr, as seen in the previous chapter. This pointer type can point at any dynamic type. By the end of this chapter we'll be able to build a nested linked list of Pairs and Symbols.

Since this structure will be used for parsing and compiling, the Pair struct has a couple of extra members that optionally describe the source code line and character number of the values pointed at by first and second. These will be useful for reporting error messages. We'll come back to these in the chapter on parsing.

To instantiate a Pair function with first and second set to nil, let's create a new() function:

impl Pair {
    pub fn new() -> Pair {
        Pair {
            first: TaggedCellPtr::new_nil(),
            second: TaggedCellPtr::new_nil(),
            first_pos: Cell::new(None),
            second_pos: Cell::new(None),
        }
    }
}

That function, as it's not being allocated into the heap, doesn't require the lifetime guard. Let's look at a more interesting function: cons(), which assigns a value to first and second and puts the Pair on to the heap:

pub fn cons<'guard>(
    mem: &'guard MutatorView,
    head: TaggedScopedPtr<'guard>,
    rest: TaggedScopedPtr<'guard>,
) -> Result<TaggedScopedPtr<'guard>, RuntimeError> {
    let pair = Pair::new();
    pair.first.set(head);
    pair.second.set(rest);
    mem.alloc_tagged(pair)
}

Here we have the lifetime 'guard associated with the MutatorView instance which grants access to the allocator alloc_tagged() method and the getter and setter on TaggedScopedPtr.

The other two args, head and rest are required to share the same 'guard lifetime as the MutatorView instance, or rather, 'guard must at least be a subtype of their lifetimes. Their values, of type TaggedScopedPtr<'guard>, can be written directly to the first and second members of Pair with the setter TaggedCellPtr::set().

We'll also add a couple impl methods for appending an object to a Pair in linked-list fashion:

impl Pair {
    pub fn append<'guard>(
        &self,
        mem: &'guard MutatorView,
        value: TaggedScopedPtr<'guard>,
    ) -> Result<TaggedScopedPtr<'guard>, RuntimeError> {
        let pair = Pair::new();
        pair.first.set(value);

        let pair = mem.alloc_tagged(pair)?;
        self.second.set(pair);

        Ok(pair)
    }
}

This method, given a value to append, creates a new Pair whose member first points at the value, then sets the second of the &self Pair to that new Pair instance. This is in support of s-expression notation (a b) which describes a linked-list of Pairs arranged, in pseudo-Rust:

Pair {
    first: a,
    second: Pair {
        first: b,
        second: nil,
    },
}

The second method is for directly setting the value of the second for s-expression dot-notation style: (a . b) is represented by first pointing at a, dotted with b which is pointed at by second. In our pseudo representation:

Pair {
    first: a,
    second: b,
}

The implementation is simply:

impl Pair {
    pub fn dot<'guard>(&self, value: TaggedScopedPtr<'guard>) {
        self.second.set(value);
    }
}

The only other piece to add, since Pair must be able to be passed into our allocator API, is the AllocObject impl for Pair:

impl AllocObject<TypeList> for Pair {
    const TYPE_ID: TypeList = TypeList::Pair;
}

This impl pattern will repeat for every type in TypeList so it'll be a great candidate for a macro.

And that's it! We have a cons-cell style Pair type and some elementary methods for creating and allocating them.

Now, back to Symbol, which seems like it should be even simpler, but as we'll see has some nuance to it.

Symbols and pointers

Let's recap the definition of Symbol and that it is the raw members of a &str:

#[derive(Copy, Clone)]
pub struct Symbol {
    name_ptr: *const u8,
    name_len: usize,
}

By this definition, a symbol has a name string, but does not own the string itself. What means this?

Symbols are in fact pointers to interned strings. Since each symbol points to a unique string, we can identify a symbol by it's pointer value rather than needing to look up the string itself.

However, symbols do need to be discovered by their string name, and symbol pointers must dereference to return their string form. i.e. a we need a bidirectional mapping of string to pointer and pointer to string.

In our implementation, we use a HashMap<String, RawPtr<Symbol>> to map from name strings to symbol pointers, while the Symbol object itself points back to the name string.

This is encapsulated in a SymbolMap struct:

pub struct SymbolMap {
    map: RefCell<HashMap<String, RawPtr<Symbol>>>,
    arena: Arena,
}

where we use RefCell to wrap operations in interior mutability, just like all other allocator functionality.

The second struct member Arena requires further explanation: since symbols are unique strings that can be identified and compared by their pointer values, these pointer values must remain static throughout the program lifetime. Thus, Symbol objects cannot be managed by a heap that might perform object relocation. We need a separate heap type for objects that are never moved or freed unil the program ends, the Arena type.

The Arena type is simple. It, like Heap, wraps StickyImmixHeap but unlike Heap, it will never run garbage collection.

pub struct Arena {
    heap: StickyImmixHeap<ArenaHeader>,
}

The ArenaHeader is a simple object header type to fulfill the allocator API requirements but whose methods will never be needed.

Allocating a Symbol will use the Arena::alloc() method which calls through to the StickyImmixHeap instance.

We'll add a method for getting a Symbol from it's name string to the SymbolMap at the allocator API level:

impl SymbolMap {
    pub fn lookup(&self, name: &str) -> RawPtr<Symbol> {
        {
            if let Some(ptr) = self.map.borrow().get(name) {
                return *ptr;
            }
        }

        let name = String::from(name);
        let ptr = self.arena.alloc(Symbol::new(&name)).unwrap();
        self.map.borrow_mut().insert(name, ptr);
        ptr
    }
}

Then we'll add wrappers to the Heap and MutatorView impls to scope-restrict access:

impl Heap {
    fn lookup_sym(&self, name: &str) -> TaggedPtr {
        TaggedPtr::symbol(self.syms.lookup(name))
    }
}

and

impl<'memory> MutatorView<'memory> {
    pub fn lookup_sym(&self, name: &str) -> TaggedScopedPtr<'_> {
        TaggedScopedPtr::new(self, self.heap.lookup_sym(name))
    }
}

This scope restriction is absolutely necessary, despite these objects never being freed or moved during runtime. This is because Symbol, as a standalone struct, remains unsafe to use with it's raw &str components. These components can only safely be accessed when there is a guarantee that the backing Hashmap is still in existence, which is only when the MutatorView is accessible.

Two methods on Symbol guard access to the &str, one unsafe to reassemble the &str from raw components, the other safe when given a MutatorScope guard instance.

impl Symbol {
    pub unsafe fn unguarded_as_str<'desired_lifetime>(&self) -> &'desired_lifetime str {
        let slice = slice::from_raw_parts(self.name_ptr, self.name_len);
        str::from_utf8(slice).unwrap()
    }

    pub fn as_str<'guard>(&self, _guard: &'guard dyn MutatorScope) -> &'guard str {
        unsafe { self.unguarded_as_str() }
    }
}

Finally, to make Symbols allocatable in the Sticky Immix heap, we need to implement AllocObject for it:

impl AllocObject<TypeList> for Symbol {
    const TYPE_ID: TypeList = TypeList::Symbol;
}

Moving on swiftly

Now we've got the elemental pieces of s-expressions, lists and symbols, we can move on to parsing s-expression strings.

Since the focus of this book is the underlying mechanisms of memory management in Rust and the details of runtime implementation, parsing will receive less attention. We'll make it quick!