Dicts

The implementation of dicts, or hash tables, is going to combine a reuse of the RawArray type and closely follow the Crafting Interpreters design:

  • open addressing
  • linear probing
  • FNV hashing

Go read the corresponding chapter in Crafting Interpreters and then come back here. We won't duplicate much of Bob's excellent explanation of the above terms and we'll assume you are familiar with his chapter when reading ours.

Code design

A Dict in our interpreter will allow any hashable value as a key and any type as a value. We'll store pointers to the key and the value together in a struct DictItem.

Here, we'll also introduce the single diversion from Crafting Interpreters' implementation in that we'll cache the hash value and use it as part of a tombstone indicator. This adds an extra word per entry but we will also take the stance that if two keys have the same hash value then the keys are equal. This simplifies our implementation as we won't need to implement object equality comparisons just yet.

#[derive(Clone)]
pub struct DictItem {
    key: TaggedCellPtr,
    value: TaggedCellPtr,
    hash: u64,
}

The Dict itself mirrors Crafting Interpreters' implementation of a count of used entries and an array of entries. Since tombstones are counted as used entries, we'll add a separate length that excludes tombstones so we can accurately report the number of items in a dict.

pub struct Dict {
    /// Number of items stored
    length: Cell<ArraySize>,
    /// Total count of items plus tombstones
    used_entries: Cell<ArraySize>,
    /// Backing array for key/value entries
    data: Cell<RawArray<DictItem>>,
}

Hashing

To implement our compiler we will need to be able to hash the Symbol type and integers (inline in tagged pointers.)

The Rust standard library defines trait std::hash::Hash that must be implemented by types that want to be hashed. This trait requires the type to implement method fn hash<H>(&self, state: &mut H) where H: Hasher.

This signature requires a reference to the type &self to access it's data. In our world, this is insufficient: we also require a &MutatorScope lifetime to access an object. We will have to wrap std::hash::Hash in our own trait that extends, essentially the same signature, with this scope guard parameter. This trait is named Hashable:

/// Similar to Hash but for use in a mutator lifetime-limited scope
pub trait Hashable {
    fn hash<'guard, H: Hasher>(&self, _guard: &'guard dyn MutatorScope, hasher: &mut H);
}

We can implement this trait for Symbol - it's a straightforward wrap of calling Hash::hash():

impl Hashable for Symbol {
    fn hash<'guard, H: Hasher>(&self, guard: &'guard dyn MutatorScope, h: &mut H) {
        self.as_str(guard).hash(h)
    }
}

Then finally, because this is all for a dynamically typed interpreter, we'll write a function that can take any type - a TaggedScopedPtr - and attempt to return a 64 bit hash value from it:

fn hash_key<'guard>(
    guard: &'guard dyn MutatorScope,
    key: TaggedScopedPtr<'guard>,
) -> Result<u64, RuntimeError> {
    match *key {
        Value::Symbol(s) => {
            let mut hasher = FnvHasher::default();
            s.hash(guard, &mut hasher);
            Ok(hasher.finish())
        }
        Value::Number(n) => Ok(n as u64),
        _ => Err(RuntimeError::new(ErrorKind::UnhashableError)),
    }
}

Now we can take a Symbol or a tagged integer and use them as keys in our Dict.

Finding an entry

The methods that a dictionary typically provides, lookup, insertion and deletion, all hinge around one internal function, find_entry().

This function scans the internal RawArray<DictItem> array for a slot that matches the hash value argument. It may find an exact match for an existing key-value entry; if it does not, it will return the first available slot for the hash value, whether an empty never-before used slot or the tombstone entry of a formerly used slot.

A tombstone, remember, is a slot that previously held a key-value pair but has been deleted. These slots must be specially marked so that when searching for an entry that generated a hash for an earlier slot but had to be inserted at a later slot, we know to keep looking rather than stop searching at the empty slot of a deleted entry.

SlotContent
n - 1empty
nX: hash % capacity == n
n + 1tombstone
n + 2Y: hash % capacity == n
n + 3empty

For example, in the above table:

  • Key X's hash maps to slot n.
  • At some point another entry was inserted at slot n + 1.
  • Then Y, with hash mapping also to slot n, was inserted, but had to be bumped to slot n + 2 because the previous two slots were occupied.
  • Then the entry at slot n + 1 was deleted and marked as a tombstone.

If slot n + 1 was simply marked as empty after it's occupant was deleted, then when searching for Y we wouldn't know to keep searching and find Y in slot n + 2. Hence, deleted entries are marked differently to empty slots.

Here is the code for the Find Entry function:

/// Given a key, generate the hash and search for an entry that either matches this hash
/// or the next available blank entry.
fn find_entry<'guard>(
    _guard: &'guard dyn MutatorScope,
    data: &RawArray<DictItem>,
    hash: u64,
) -> Result<&'guard mut DictItem, RuntimeError> {
    // get raw pointer to base of array
    let ptr = data
        .as_ptr()
        .ok_or(RuntimeError::new(ErrorKind::BoundsError))?;

    // calculate the starting index into `data` to begin scanning at
    let mut index = (hash % data.capacity() as u64) as ArraySize;

    // the first tombstone we find will be saved here
    let mut tombstone: Option<&mut DictItem> = None;

    loop {
        let entry = unsafe { &mut *(ptr.offset(index as isize) as *mut DictItem) as &mut DictItem };

        if entry.hash == TOMBSTONE && entry.key.is_nil() {
            // this is a tombstone: save the first tombstone reference we find
            if tombstone.is_none() {
                tombstone = Some(entry);
            }
        } else if entry.hash == hash {
            // this is an exact match slot
            return Ok(entry);
        } else if entry.key.is_nil() {
            // this is a non-tombstone empty slot
            if let Some(earlier_entry) = tombstone {
                // if we recorded a tombstone, return _that_ slot to be reused
                return Ok(earlier_entry);
            } else {
                return Ok(entry);
            }
        }

        // increment the index, wrapping back to 0 when we get to the end of the array
        index = (index + 1) % data.capacity();
    }
}

To begin with, it calculates the index in the array from which to start searching. Then it iterates over the internal array, examining each entry's hash and key as it goes.

  • The first tombstone that is encountered is saved. This may turn out to be the entry that should be returned if an exact hash match isn't found by the time a never-before used slot is reached. We want to reuse tombstone entries, of course.
  • If no tombstone was found and we reach a never-before used slot, return that slot.
  • If an exact match is found, return that slot of course.

The external API

Just as we defined some conainer traits for Array<T> to define access to arrays based on stack or indexed style access, we'll define a container trait for Dict:

pub trait HashIndexedAnyContainer {
    /// Return a pointer to to the object associated with the given key.
    /// Absence of an association should return an error.
    fn lookup<'guard>(
        &self,
        guard: &'guard dyn MutatorScope,
        key: TaggedScopedPtr,
    ) -> Result<TaggedScopedPtr<'guard>, RuntimeError>;

    /// Associate a key with a value.
    fn assoc<'guard>(
        &self,
        mem: &'guard MutatorView,
        key: TaggedScopedPtr<'guard>,
        value: TaggedScopedPtr<'guard>,
    ) -> Result<(), RuntimeError>;

    /// Remove an association by its key.
    fn dissoc<'guard>(
        &self,
        guard: &'guard dyn MutatorScope,
        key: TaggedScopedPtr,
    ) -> Result<TaggedScopedPtr<'guard>, RuntimeError>;

    /// Returns true if the key exists in the container.
    fn exists<'guard>(
        &self,
        guard: &'guard dyn MutatorScope,
        key: TaggedScopedPtr,
    ) -> Result<bool, RuntimeError>;
}

This trait contains the external API that Dict will expose for managing keys and values. The implementation of each of these methods will be in terms of the find_entry() function described above. Let's look at a couple of the more complex examples, assoc() and dissoc().

assoc

impl HashIndexedAnyContainer for Dict {
    fn assoc<'guard>(
        &self,
        mem: &'guard MutatorView,
        key: TaggedScopedPtr<'guard>,
        value: TaggedScopedPtr<'guard>,
    ) -> Result<(), RuntimeError> {
        let hash = hash_key(mem, key)?;

        let mut data = self.data.get();
        // check the load factor (what percentage of the capacity is or has been used)
        if needs_to_grow(self.used_entries.get() + 1, data.capacity()) {
            // create a new, larger, backing array, and copy all existing entries over
            self.grow_capacity(mem)?;
            data = self.data.get();
        }

        // find the slot whose entry matches the hash or is the nearest available entry
        let entry = find_entry(mem, &data, hash)?;

        // update counters if necessary
        if entry.key.is_nil() {
            // if `key` is nil, this entry is unused: increment the length
            self.length.set(self.length.get() + 1);
            if entry.hash == 0 {
                // if `hash` is 0, this entry has _never_ been used: increment the count
                // of used entries
                self.used_entries.set(self.used_entries.get() + 1);
            }
        }

        // finally, write the key, value and hash to the entry
        entry.key.set(key);
        entry.value.set(value);
        entry.hash = hash;

        Ok(())
    }
}

dissoc

impl HashIndexedAnyContainer for Dict {
    fn dissoc<'guard>(
        &self,
        guard: &'guard dyn MutatorScope,
        key: TaggedScopedPtr,
    ) -> Result<TaggedScopedPtr<'guard>, RuntimeError> {
        let hash = hash_key(guard, key)?;

        let data = self.data.get();
        let entry = find_entry(guard, &data, hash)?;

        if entry.key.is_nil() {
            // a nil key means the key was not found in the Dict
            return Err(RuntimeError::new(ErrorKind::KeyError));
        }

        // decrement the length but not the `used_entries` count
        self.length.set(self.length.get() - 1);

        // write the "tombstone" markers to the entry
        entry.key.set_to_nil();
        entry.hash = TOMBSTONE;

        // return the value that was associated with the key
        Ok(entry.value.get(guard))
    }
}

As you can see, once find_entry() is implemented as a separate function, these methods become fairly easy to comprehend.

Conclusion

If you haven't read Bob Nystron's chapter on hash tables in Crafting Interpreters we encourage you to do so: it will help make sense of this chapter.

Now, we'll transition to some compiler and virtual machine design before we continue with code implementation.