Symbols and Pairs
To bootstrap our compiler, we'll parse s-expressions into Symbol ad Pair
types, where a Pair is essentially a Lisp cons cell.
The definition of Symbol is just the raw components of a &str:
#[derive(Copy, Clone)]
pub struct Symbol {
name_ptr: *const u8,
name_len: usize,
}
Why this is how Symbol is defined and how we handle these raw components will
be covered in just a bit. First though, we'll delve into the Pair type.
Pairs of pointers
The definition of Pair is
#[derive(Clone)]
pub struct Pair {
pub first: TaggedCellPtr,
pub second: TaggedCellPtr,
// Possible source code positions of the first and second values
pub first_pos: Cell<Option<SourcePos>>,
pub second_pos: Cell<Option<SourcePos>>,
}
The type of first and second is TaggedCellPtr, as seen in the previous
chapter. This pointer type can point at any dynamic type. By the
end of this chapter we'll be able to build a nested linked list of Pairs
and Symbols.
Since this structure will be used for parsing and compiling, the Pair
struct has a couple of extra members that optionally describe the source
code line and character number of the values pointed at by first and
second. These will be useful for reporting error messages. We'll come back
to these in the chapter on parsing.
To instantiate a Pair function with first and second set to nil, let's
create a new() function:
impl Pair {
pub fn new() -> Pair {
Pair {
first: TaggedCellPtr::new_nil(),
second: TaggedCellPtr::new_nil(),
first_pos: Cell::new(None),
second_pos: Cell::new(None),
}
}
}
That function, as it's not being allocated into the heap, doesn't require the
lifetime guard. Let's look at a more interesting function: cons(), which
assigns a value to first and second and puts the Pair on to the heap:
pub fn cons<'guard>(
mem: &'guard MutatorView,
head: TaggedScopedPtr<'guard>,
rest: TaggedScopedPtr<'guard>,
) -> Result<TaggedScopedPtr<'guard>, RuntimeError> {
let pair = Pair::new();
pair.first.set(head);
pair.second.set(rest);
mem.alloc_tagged(pair)
}
Here we have the lifetime 'guard associated with the MutatorView instance
which grants access to the allocator alloc_tagged() method and the getter
and setter on TaggedScopedPtr.
The other two args, head and rest are required to share the same 'guard
lifetime as the MutatorView instance, or rather, 'guard must at least be
a subtype of their lifetimes. Their values, of type TaggedScopedPtr<'guard>,
can be written directly to the first and second members of Pair with
the setter TaggedCellPtr::set().
We'll also add a couple impl methods for appending an object to a Pair
in linked-list fashion:
impl Pair {
pub fn append<'guard>(
&self,
mem: &'guard MutatorView,
value: TaggedScopedPtr<'guard>,
) -> Result<TaggedScopedPtr<'guard>, RuntimeError> {
let pair = Pair::new();
pair.first.set(value);
let pair = mem.alloc_tagged(pair)?;
self.second.set(pair);
Ok(pair)
}
}
This method, given a value to append, creates a new Pair whose member first
points at the value, then sets the second of the &self Pair to that new
Pair instance. This is in support of s-expression notation (a b) which
describes a linked-list of Pairs arranged, in pseudo-Rust:
Pair {
first: a,
second: Pair {
first: b,
second: nil,
},
}
The second method is for directly setting the value of the second for
s-expression dot-notation style: (a . b) is represented by first pointing
at a, dotted with b which is pointed at by second. In our pseudo
representation:
Pair {
first: a,
second: b,
}
The implementation is simply:
impl Pair {
pub fn dot<'guard>(&self, value: TaggedScopedPtr<'guard>) {
self.second.set(value);
}
}
The only other piece to add, since Pair must be able to be passed into
our allocator API, is the AllocObject impl for Pair:
impl AllocObject<TypeList> for Pair {
const TYPE_ID: TypeList = TypeList::Pair;
}
This impl pattern will repeat for every type in TypeList so it'll be a great
candidate for a macro.
And that's it! We have a cons-cell style Pair type and some elementary
methods for creating and allocating them.
Now, back to Symbol, which seems like it should be even simpler, but as we'll
see has some nuance to it.
Symbols and pointers
Let's recap the definition of Symbol and that it is the raw members of a
&str:
#[derive(Copy, Clone)]
pub struct Symbol {
name_ptr: *const u8,
name_len: usize,
}
By this definition, a symbol has a name string, but does not own the string itself. What means this?
Symbols are in fact pointers to interned strings. Since each symbol points to a unique string, we can identify a symbol by it's pointer value rather than needing to look up the string itself.
However, symbols do need to be discovered by their string name, and symbol pointers must dereference to return their string form. i.e. a we need a bidirectional mapping of string to pointer and pointer to string.
In our implementation, we use a HashMap<String, RawPtr<Symbol>> to map from
name strings to symbol pointers, while the Symbol object itself points back
to the name string.
This is encapsulated in a SymbolMap struct:
pub struct SymbolMap {
map: RefCell<HashMap<String, RawPtr<Symbol>>>,
arena: Arena,
}
where we use RefCell to wrap operations in interior mutability, just like
all other allocator functionality.
The second struct member Arena requires further explanation: since symbols are
unique strings that can be identified and compared by their pointer values,
these pointer values must remain static throughout the program lifetime.
Thus, Symbol objects cannot be managed by a heap that might perform object
relocation. We need a separate heap type for objects that are never
moved or freed unil the program ends, the Arena type.
The Arena type is simple. It, like Heap, wraps StickyImmixHeap but
unlike Heap, it will never run garbage collection.
pub struct Arena {
heap: StickyImmixHeap<ArenaHeader>,
}
The ArenaHeader is a simple object header type to fulfill the allocator
API requirements but whose methods will never be needed.
Allocating a Symbol will use the Arena::alloc() method which calls through
to the StickyImmixHeap instance.
We'll add a method for getting a Symbol from it's name string to the
SymbolMap at the allocator API level:
impl SymbolMap {
pub fn lookup(&self, name: &str) -> RawPtr<Symbol> {
{
if let Some(ptr) = self.map.borrow().get(name) {
return *ptr;
}
}
let name = String::from(name);
let ptr = self.arena.alloc(Symbol::new(&name)).unwrap();
self.map.borrow_mut().insert(name, ptr);
ptr
}
}
Then we'll add wrappers to the Heap and MutatorView impls to scope-restrict
access:
impl Heap {
fn lookup_sym(&self, name: &str) -> TaggedPtr {
TaggedPtr::symbol(self.syms.lookup(name))
}
}
and
impl<'memory> MutatorView<'memory> {
pub fn lookup_sym(&self, name: &str) -> TaggedScopedPtr<'_> {
TaggedScopedPtr::new(self, self.heap.lookup_sym(name))
}
}
This scope restriction is absolutely necessary, despite these objects never
being freed or moved during runtime. This is because Symbol, as a standalone
struct, remains unsafe to use with it's raw &str components. These components
can only safely be accessed when there is a guarantee that the backing
Hashmap is still in existence, which is only when the MutatorView is
accessible.
Two methods on Symbol guard access to the &str, one unsafe to reassemble
the &str from raw components, the other safe when given a MutatorScope
guard instance.
impl Symbol {
pub unsafe fn unguarded_as_str<'desired_lifetime>(&self) -> &'desired_lifetime str {
let slice = slice::from_raw_parts(self.name_ptr, self.name_len);
str::from_utf8(slice).unwrap()
}
pub fn as_str<'guard>(&self, _guard: &'guard dyn MutatorScope) -> &'guard str {
unsafe { self.unguarded_as_str() }
}
}
Finally, to make Symbols allocatable in the Sticky Immix heap, we need to
implement AllocObject for it:
impl AllocObject<TypeList> for Symbol {
const TYPE_ID: TypeList = TypeList::Symbol;
}
Moving on swiftly
Now we've got the elemental pieces of s-expressions, lists and symbols, we can move on to parsing s-expression strings.
Since the focus of this book is the underlying mechanisms of memory management in Rust and the details of runtime implementation, parsing will receive less attention. We'll make it quick!