MAL - Preliminaries

1. Overview
2. Object Model = Tagged Union + Metadata
- 2.1. Helper Functions
- 2.2. Object Model References
3. Atoms
4. Linked Lists
5. Hash Tables

1. Overview

This contains a discussion of the implementation of data structures for a Lisp interpreter. The focus will be on the data structures used for the data types provided by the Lisp interpreter.

2. Object Model = Tagged Union + Metadata

The basic data type for a simple Lisp implementation in C amounts to a tagged union. Well, we stick the tag plus other metadata in an "Object header" (borrowing the term from other languages, links below). The idea is we have a "base class" type implemented as a struct

/* types.h */
typedef enum ObjectType {
    // ...
} ObjectType;

typedef struct Object {
    ObjectType type;
    // gc metadata
    // other metadata
} Object;

Note: we could (should) use bitfields to make the Object 1 word (4 bytes on 32-bit machines, 8 bytes on 64-bit machines). Otherwise, it could end up multiple words in size (i.e., too big).

We then make the first field, in any type in our "object model", an Object instance:

struct Example {
    Object header;
    // ...
};

This takes advantage of C standards permitting casting a pointer to a struct to (and from) a pointer to its first field. In other words, we get a standards-compliant form of "polymorphism".

2.1. Helper Functions

There are a number of functions we'll use for S-expressions generically, like computing the hash code for an S-expression. (We could, further, cache this as metadata in the struct Object.)

/* types.h */
typedef int hash_t;

hash_t hashCode(Object *object);
bool equals(Object *lhs, Object *rhs);

The contracts we demand of these functions:

Let Object *x, *y be non-null. If equals(x, y) is true, then hashCode(x) == hashCode(y).
- We do not require the inverse of this proposition — i.e., we do not require if equals(x, y) is false, then hashCode(x) != hashCode(y).
The equals() function is an equivalence relation, so
- Reflexivity: for any non-null Object *x, we have equals(x, x)
- Symmetry: for non-null Object *x, *y, if equals(x, y), then equals(y, x)
- Transitivity: for non-null Object *x, *y, *z, if equals(x, y) and equals(y, z), then we have equals(x, z)
Further, for any non-null Object *x and Object *null = NULL, we have equals(x, NULL) and Object(x, null) both return false
We have consistency:
- equals(): for any non-null Object *x, *y we have equals(x, y) return the same value if we compute it repeatedly without any other computation in between
- hashCode(): for any non-null Object *x, we have hashCode(x) return the same value if called repeatedly and no other computation occurs in between these calls

2.2. Object Model References

3. Atoms

There are two types of atoms: symbols and primitive values. The primitive values are numbers, strings, booleans, or characters. Symbols are associated to some value.

3.1. Primitive Values

Primitive values would naively be represented by structures like:

/* types.h */
struct Int {
    Object header;
    int value;
};

struct Char {
    Object header;
    char codepoint;
};

struct Boolean {
    Object header;
    bool value;
};

These will be two words in size. There are tricks to compactify their memory usage (NaN boxing, tagged object pointers since x64 uses 48-bit addresses giving us 16 bits of metadata).

3.2. Strings

There are several different ways to represent a string. C uses a null-terminated sequence of characters. Pascal uses length-prefixed strings — by convention, the first k bytes are interpreted as an integer, encoding the length of the string. C++ (and, I guess, Java) encodes a string as a record (well, Java uses classes). We'll follow the structure encoding:

/* types.h */
typedef struct String String;

String* string_new(char *str, int length);
void string_free(String **string);
bool string_isEqual(String *this, String *other);
hash_t string_hashCode(String *this);

The constructor string_new takes an additional parameter, length, which tells us how many characters to take from the source. The special value of length < 0 will copy until a null character is encountered.

The string API can be extended as necessary.

The data structure for a string would be implemented as:

/* types.c */
struct String {
    Object header;
    size_t length;
    char *text;
};

3.2.1. Constructor

The constructor copies the data from the given pointer. There are basically three key moments to this constructor:

Allocate the memory for the string data structure
Initialize the length (which needs to be computed if the length parameter is negative)
Allocate and initialize/copy the char[] array of text

/* types.c */

String* string_new(char *str, int length) {
    if (NULL == src) return NULL;

    String *string = alloc(sizeof(*string));
    string->header.type = TYPE_STRING;

    /* initialize length */
    if (length < 0) {
        string->length = strlen(str);
    } else {
        string->length = (size_t)length;
    }

    /* allocate and initialize text */
    string->text = alloc_array(string->length, sizeof(char));
    if (0 == string->length) {
        string->text = NULL;
    } else {
        memcpy(string->text, text, string->length);
    }
    return string;
}

3.2.2. Destructor

The destructor for a string needs to also free the underlying array of characters. Then we free the data and metadata for the string object. Finally we make sure to update the pointer to the string to point to NULL.

void string_free(String **this) {
    if ((NULL == this) || (NULL == *this)) return;

    if (NULL != (*this)->text) {
        free((*this)->text);
    }
    free((*this));
    *this = NULL;
}

3.2.3. String Equality

We can use a couple shortcuts to compute equality. First, if both arguments are NULL pointers, then return true. But if only one is NULL, return false.

The default case then moves on to non-null pointers. In that case, we compare the lengths of the strings to rule out unequal strings quickly. Should the strings have the same length, we compare the contents of the underlying character array.

bool string_equals(String *lhs, String *rhs) {
    if (lhs == rhs) return true;
    if (NULL == lhs) return false;
    if (NULL == rhs) return false;
    if ((lhs->length) != (rhs->length)) return false;

    return 0 == memcmp(lhs->text, rhs->text, lhs->length);
}

3.2.4. Computing Hash Code

There are a number of ways to compute the hash code for a string. Following the generic strategy of computing the hash code for a list (discussed below), treating the String as a list of characters. (If we wanted to borrow, e.g., Fowler–Noll–Vo hashing, we would replace h*31 by h*FNV_prime, and the addition by an XOR. The initial value of the hash would be the FNV_offset_basis.)

hash_t string_hashCode(String *string) {
    if (NULL == string) return 0;
    if (0 == string->length) return 0;
    if (NULL == string->text) return 0;
    hash_t h = 0;
    for (size_t i = 0; i < string->length; i++) {
        h = h*31 + (hash_t)(string->text[i]);
    }
    return h;
}

3.2.5. References

Java's String.java (jdk8-b120, jdk7-b24)

3.3. Symbols

We intern symbols to make at most one object which corresponds with a symbol name. This could be done using a Hash Table, a binary tree, or some other data structure.

/* types.h */
typedef struct Symbol Symbol;

Symbol* intern(char *name);
void symbol_free(Symbol **symbol);
hash_t symbol_hashCode(Symbol *this);

We make public the symbol_free() function, though we may want to hide it, and make some method to register it with the garbage collector hidden from the user.

Again, we are using opaque pointers (since it's good style):

/* types.c */

struct Symbol {
    Object header;
    String *name;
};

The contract is, for any non-empty strings x and y such that they are equal (in the sense that 0 == strcmp(x, y)), then we have intern(x) == intern(y) and specifically intern(x) != NULL.

3.3.1. Constructor

The trick to this is we will use a Hash Table to lookup the symbol with a given name. If it exists, we return that; otherwise, we create a new Symbol object, save it in the Hash Table, and return the Symbol object.

One word of warning: if we extend our interpreter to include multithreading, then we must make sure our HashMap is concurrent. Otherwise, we may end up with race conditions regarding the symbol_table.

static HashMap *symbol_table = NULL;

static void init_symbol_table() {
    if (NULL == symbol_table) {
        symbol_table = hashmap_new();
    }
}

static Symbol* symbol_new(String *name) {
    Symbol *symbol = alloc(sizeof(*symbol));
    symbol->header.type = TYPE_SYMBOL;
    symbol->name = name;
    return symbol;
}

Symbol* intern(char *name) {
    init_symbol_table();
    String *key = string_new(name, -1);
    Symbol *symbol = hashmap_get(&symbol_table, key);

    if (NULL == symbol) {
        symbol = symbol_name(key);
        hashmap_put(&symbol_table, key, symbol);
    } else {
        // cache hit
        string_free(&key);
    }
    return symbol;
}

3.3.2. Destructor

I'm not sure I need this, exactly. But I'll vacate the symbol table, and delete the symbol and its underlying string.

void symbol_free(Symbol **symbol) {
    if ((NULL == symbol) || (NULL == *symbol)) return;
    hashmap_remove(symbol_table, (*symbol)->name);
    string->free(&((*symbol)->name));
    free(*symbol);
    (*symbol) = NULL;
}

3.3.3. Hash Code

The hash code for a symbol amounts to computing the hash code for the associated string.

hash_t symbol_hashCode(Symbol *symbol) {
    if (NULL == symbol) return 0;
    return string_hashCode(symbol->name);
}

3.3.4. Equality

Since we're interning symbols, we could use pointer equality. But ostensibly we'll extend symbols to a multithreaded environment…so we could end up in a tricky situation where pointer comparison does not work.

bool symbol_equals(Symbol *lhs, Symbol *rhs) {
    if (lhs == rhs) return true;
    if (NULL == lhs) return false;
    if (NULL == rhs) return true;

#if defined(MULTITHREADED_PARANOID)
#if MULTITHREADED_PARANOID
    return string_equals(lhs->name, rhs->name);
#endif /* MULTITHREADED_PARANOID */
#else
    return false;
#endif /* defined(MULTITHREADED_PARANOID) */
}

4. Linked Lists

Also called "cons cells", linked lists are what you'd expect. They're the only data type with pointers, so far.

/* types.h */
typedef struct Cons Cons;

Cons* cons_new(Object *car, Object *cdr);
void cons_free(Cons **this);
hash_t cons_hashCode(Cons *this);

The opaque pointer refers to:

/* types.c */
struct Cons {
    Object header;
    Object **car;
    Object **cdr;
};
#define safe_car(cell) (NULL == (cell)->car ? NULL : *((cell)->car))
#define safe_cdr(cell) (NULL == (cell)->cdr ? NULL : *((cell)->cdr))

Why use double pointers? Well, really, we should think of typedef struct Object* ObjectRef, and a cons-cell consists of a pointer to an ObjectRef. When we free the memory for the object pointed by the ObjectRef, we update the ObjectRef to be null. This requires using double pointers for this update to persist.

4.1. Constructor

The constructor works as we would expect: allocate the memory for the header and addresses. If one of the arguments is NULL, then its double pointer will point to NULL.

/* types.c */
Cons* cons_new(Object *car, Object *cdr) {
    Cons *cell = alloc(sizeof(*cell));
    cell->header.type = TYPE_CONS;
    cell->car = (NULL == car) ? NULL : &car;
    cell->cdr = (NULL == cdr) ? NULL : &cdr;
    return cell;
}

4.2. Destructor

This requires a bit more care. We'll iterate through the list, recursively freeing the car and cdr, before finally freeing the wrapper and metadata for the Cons cell itself.

/* types.c */
void cons_free(Cons **this) {
    if ((NULL == this) || (NULL == *this)) return;
    obj_free((*this)->car);
    //@ assert (\null == \old(*this)->car) || (\null == *((*this)->car));
    obj_free((*this)->cdr);
    //@ assert (\null == \old(*this)->cdr) || (\null == *((*this)->cdr));
    free(*this);
    (*this) = NULL;
}

4.3. Hash Code

The basic computation amounts to

hash_t list_hashCode(List *this) {
    hash_t hashCode = 1;
    for (Elt e : this)
        hashCode = 31*hashCode + ((NULL == e) ? 0 : hashCode(e));
    return hashCode;
}

The actual source code implementation needs to handle the special case where cell == nil, when it should be zero.

/* types.c */
hash_t cons_hashCode(Cons *cell) {
    if (NULL == cell) return 0;

    hash_t h = 1;
    Object *car = safe_car(cell);
    Object *cdr = safe_cdr(cell);

    if ((NULL == car) && (NULL == cdr)) {
        h = 0;
    } else {
        h = 31*h + hashCode(car);
        h = 31*h + hashCode(cdr);
    }
    return h;
}

4.4. Equality

We test for equality by checking the car of the left-hand side equals the car of the right-hand side, and the cdr of both sides are equal. Component-wise equality, in other words.

/* types.c */
bool cons_equals(Cons *lhs, Cons *rhs) {
    if (lhs == rhs) return true;
    if (NULL == lhs) return false;
    if (NULL == rhs) return false;

    if (equals(safe_car(lhs), safe_car(rhs))) {
        return equals(safe_cdr(lhs), safe_cdr(rhs));
    }
    return false;
}

5. Hash Tables

There are many different implementations. Basically the difference boils down to using "chaining" (the buckets are a linked list — Java did this prior to version 8) or "open addressing".

Some implementations don't require hash tables, but there are two reasons I'm including them. The first, Common Lisp provides them. The second, we can use them as the data structure for the environment (dictionary binding symbols to their values).

Arguably, we could implement hash tables using cons-cells and atomic types. In fact, that's what most Common Lisp implementations do.

We will implement a Hash Table in C.

/* types.h */
typedef struct HashMap HashMap;

HashMap* hashmap_new();
void hashmap_free(HashMap **this);
hash_t hashmap_hashCode(HashMap *this);
bool hashmap_equals(HashMap **lhs, HashMap **rhs);
Object* hashmap_get(HashMap *this, Object *key);
Object* hashmap_put(HashMap *this, Object *key, Object *value);
void hashmap_remove(HashMap *this, Object *key);
size_t hashmap_size(HashMap *this);

Arguably, there may be other functions to consider implementing.

5.1. Data Structures

We will use simple chaining. The idea is to use an array of buckets (a bucket is a linked list of "Entries" — key-value pairs plus some extra metadata), associating a key-value pair to exactly one bucket. When there are too many entries, we increase the number of buckets, then rebalance the entries.

5.1.1. Entry

An entry is a data structure with some extra metadata, namely the hash code for the key (so we compute it once).

/* types.c */

typedef struct Entry {
    Object *key;
    Object *value;
    hash_t hash;
    struct Entry *next;
} Entry;

5.1.1.1. Constructor

We have the usual boiler plate code for it, namely a constructor and destructor.

/* types.c */

Entry* entry_new(Object *key, Object *value, hash_t hash) {
    Entry *entry = alloc(sizeof(*entry));
    entry->key = key;
    entry->value = value;
    entry->next = NULL;
    entry->hash = hash;
    return entry;
}

5.1.1.2. Destructor

The only nuance to the Entry data structure is, unlike a Cons cell, its destructor will not recursively destroy its next Entry.

/* types.c */

void entry_free(Entry **entry) {
    if ((NULL == entry) || (NULL == *entry)) return;

    free(*entry);
    (*entry) = NULL;
}

5.1.1.3. Sorted Insertion

Given a pointer to (a pointer to) an entry, and an entry we wish to insert into the linked list, insert the new entry in a manner that the linked list is sorted by hash code.

/* types.c */
void entry_insert(Entry **list, Entry *entry) {
    assert(NULL != entry);
    if (NULL == list) return;
    if (NULL == *list) {
        (*list) = entry;
        entry->next = NULL;
        return;
    } else if ((entry->hash) <= ((*list)->hash)) {
        entry->next = (*list);
        (*list) = entry;
        return;
    } else {
        assert((entry->hash) > ((*list)->hash));
        Entry *prev = *item;
        Entry *e = NULL;
        for (e = prev->next; NULL != *e; e = e->next) {
            // if (prev < entry <= e)
            if (((prev->hash) < (entry->hash)) &&
                ((entry->hash) <= (e->hash))) {
                prev->next = entry;
                entry->next = e;
                return;
            }
            prev = e;
        }
        // append at the end of the linked list
        assert(NULL == e);
        assert(NULL != prev);
        prev->next = entry;
        entry->next = NULL;
    }
}

5.1.2. Hash Table

The Hash Map consists of an array of Entry pointers. We track how many entries belong to the hash map in its size parameter; and when it exceeds the capacity, we grow the number of buckets and rebalance.

/* types.c */

struct HashMap {
    Object header;
    Entry **table;
    size_t capacity;
    size_t size;
};

5.2. Constructor

The constructor expects no parameters. The default capacity for a hash map would be 16, its initial size 0.

/* types.c */

#define INITIAL_HASH_MAP_CAPACITY 16 /* must be power of 2 */

HashMap* hashmap_new() {
    HashMap *map = alloc(sizeof(*map));
    map->header.type = TYPE_HASH_MAP;
    map->capacity = INITIAL_HASH_MAP_CAPACITY;
    map->size = 0;
    map->table = alloc_array(sizeof(struct Entry*), map->capacity);
    return map;
}

5.3. Destructor

When we destroy a HashMap object, we also have to iteratively destroy the table entries.

/* types.c */

void hashmap_free(HashMap **this) {
    if ((NULL == this) || (NULL == *this)) return;

    for (size_t row = 0; row < (*this)->capacity; row++) {
        Entry *iter = (*this)->table[row];
        while (NULL != iter) {
            Entry *next = iter->next;
            entry_free(iter);
            iter = next;
        }
    }

    free(*this);
    (*this) = NULL;
}

5.4. Getting a Value for a Key

We then need to isolate a helper function to get the index for a given hash code. Since our code makes sure there are 2ⁿ buckets in the table, we can use the fact hash % length == hash & (length - 1). Using bitwise-and is faster than the modulo operator.

/* types.c */

static size_t indexFor(hash_t hash, size_t length) {
    return hash & (length - 1);
}

Now the process for obtaining the value for a given key is quite straightforward traversing the linked list ("bucket") selected by that index, and when finding that key, returning the associated value:

/* types.c */

Object* hashmap_get(HashMap *this, Object *key) {
    assert (NULL != this);

    hash_t h = hashCode(key);
    size_t index = indexFor(h, this->capacity);
    for (Entry *e = this->table[index]; e != NULL; e = e->next) {
        if (equals(key, e->key)) {
            return e->value;
        }
    }
    return NULL;
}

5.5. Associating a Value for a Key

We associate a value to a key in a given hash map by finding the bucket associated for the hash code for that key. First, we try finding the key in that bucket of entries by simple traversal. If we're lucky, it's already associated with a value. In that case, we just update the entry to point to the new value, and return the old value.

But if that bucket has not been started (so it's a null pointer) or if the bucket does not contain the key, then we create a new entry, and prepend it to that row of the table. Since no value is associated to the key, we return a null pointer.

/* types.c */

#define CAPACITY_FACTOR 0.75

static void resize(HashMap *this, size_t new_capacity);

Object* hashmap_put(HashMap *this, Object *key, Object *value) {
    assert(NULL != this);

    hash_t h = hashCode(key);
    size_t index = indexFor(h, this->capacity);
    Entry *entry = this->table[index];
    if (NULL != entry) {
        // try updating the value associated to key, if key is present
        Entry *e = entry;
        while (e != NULL) {
            if ((e->hash == h) && equals(keys, e->key)) {
                Object *old_val = e->value;
                e->value = value;
                return old_val;
            }
            e = e->next;
        }
        // fall-through: no key associated with that value, then prepend entry
    }
    // key is not present, so add entry
    Entry *new_entry = entry_new(key, value, h);
    // entry_insert(&(this->table[index]), new_entry);
    new_entry->next = this->table[index];
    this->table[index] = new_entry;
    this->size++;

    if ((this->size) >= (size_t)((this->capacity)*CAPACITY_FACTOR)) {
        resize(this, 2*(this->capacity));
    }
    return NULL;
}

5.5.1. Resizing the Entry Table

The basic idea is to resize the hash map's underlying table by creating a new array of entry pointers, then transfer the contents of the array, free the old array of pointers (just the array). We'll also need to update the table pointer, and the hash map's capacity.

/* types.c */

static void transfer(Entry **old_table, Entry **new_table,
                     size_t length, size_t new_length);

void resize(HashMap *this, size_t new_capacity) {
    if ((this->capacity) >= new_capacity) return;

    Entry **new_table = alloc_array(sizeof(struct Entry*), new_capacity);
    transfer(this->table, new_table, this->capacity, new_capacity);
    free(this->table);
    this->table = new_table;
    this->capacity = new_capacity;
}

5.5.2. Transferring Table Contents

Transferring the contents amounts to transferring rows in bucket i, for i=0 to the number of buckets.

/* types.c */

static void transfer_rows(Entry *row, Entry **new_table, size_t new_length);

void transfer(Entry **old_table, Entry **new_table,
              size_t length, size_t new_length) {
    for(size_t i = 0; i < length; i++) {
        transfer_rows(old_table[i]; new_table, new_length);
    }
}

void transfer_rows(Entry *row, Entry **new_table, size_t new_length) {
    Entry *e = row;
    while (NULL != e) {
        Entry *next = e->next;
        size_t index = indexFor(e->hash, new_length);
        // entry_insert(&new_table[index], e);
        if (NULL == new_table[index]) {
            e->next = NULL;
        } else {
            e->next = new_table[index];
        }
        new_table[index] = e;
        e = next;
    }
}

5.6. Removing a key from the Hash Map

If we want to disassociate a key from the hash map altogether (assuming the hash map contains it), then we simply update the bucket containing the associated hash.

Yes, I know about Linus Torvalds's commentary about using a double pointer here, but I found it less clear after a second glance.

/* types.c */

void hashmap_remove(HashMap *this, Object *key) {
    if (NULL == this) return;
    hash_t h = hashCode(key);
    size_t index = indexFor(h, this->capacity);
    Entry *entry = this->table[index];
    Entry *prev = NULL;

    while (NULL != entry) {
        if ((entry->hash == h) && equals(entry->key, key)) {
            if (NULL == prev) {
                this->table[index] = entry->next;
            } else {
                prev->next = entry->next;
            }
            entry_free(&entry);
            this->size--;
            return;
        }
    }
}

5.7. Size of the Hash Map

The size of the hash map has been abstracted away to a helper function. It just accesses the size field. For *this null pointers, it defaults to zero.

/* types.c */

size_t hashmap_size(HashMap *this) {
    if (NULL == this) return 0;

    return this->size;
}

5.8. Hash Code

We can implement the hashcode computation as treating it as a list of key-value entries.

(N.B. Java computes the hash code of a HashMap by iterating over the entry set, and using their hash code computation for a generic list. See, e.g., AbstractMap::hashCode in JDK8).

/* types.c */

static hash_t entry_hashCode(Entry *entry) {
    if (NULL == entry) return 0;
    else return hashCode(entry->key) ^ hashCode(entry->value);
}

hash_t hashmap_hashCode(HashMap *this) {
    hash_t h = 0;
    for (size_t row = 0; row < (this->capacity); row++) {
        for (Entry *e = (this->table[row]); NULL != e; e = e->next) {
            h += entry_hashCode(e);
        }
    }
    return h;
}

5.9. Equality of Hash Maps

This is trickier, but basically checks if every key in the lhs map appears in the rhs map, and if that key is associated to the same values. And if they have the same size, then the two hash maps are equal, and the method returns true. All other scenarios return false.

/* types.c */

bool hashmap_equals(HashMap *lhs, HashMap *rhs) {
    if (lhs == rhs) return true;
    if (NULL == lhs) return false;
    if (NULL == rhs) return false;
    if ((lhs->size) != (rhs->size)) return false;

    for (size_t row = 0; row < (lhs->capacity); row++) {
        for (Entry *e = (lhs->table[row]); NULL != e; e = e->next) {
            if (!equals(e->value, hashmap_get(rhs, e->key))) {
                return false;
            }
        }
    }
    return true;
}

5.10. Hash Table References

How does a HashMap work in JAVA
- OpenJDK's implementation changed from a linked list (jdk8-b01) to use a binary tree for buckets with 8 or more elements (by jdk8-b120)
Intel's concurrent_unordered_map
Attractive Chaos's khash
- Implementation Notes
- Comparison of Hash Table Libraries
An Analysis of Hash Map Implementations in Popular Languages
Common Lisp implementations all seem to use chaining
- SBCL
- CCL
- ABCL