书名：Hands-On Concurrency with Rust
作者名：Brian L. Troutwine
本章字数：970字
更新时间：2021-06-25 21:11:52

Standard library HashMap

Let's poke around in HashMap's internals. A good starting place, I find, for inspecting unfamiliar Rust data structures is jumping into the source to the struct definition itself. Especially in the Rust codebase, there will be public rustdoc comments and private comments explaining implementation ambitions. The reader is warmly encouraged to inspect the HashMap comments for themselves. In this book, we're inspecting Rust at SHA da569fa9ddf8369a9809184d43c600dc06bd4b4d. The comments of src/libstd/collections/hash/map.rs explain that the HashMap is implemented with linear probing Robin Hood bucket stealing. Linear probing implies that we'll find HashMap implemented in terms of a cell storage—probably a contiguous memory structure for reasons we'll discuss shortly—and should be able to understand the implementation pretty directly, linear probing being a common method of implementing associative arrays. Robin Hood bucket stealing is maybe less common, but the private comments describe it as the main performance trick in this hashmap and then goes on to quote Pedro Celis' 1986 Robin Hood Hashing:

"If an insertion collides with an existing element, and that element's "probe distance" (how far away the element is from its ideal location) is higher than how far we've already probed, swap the elements."

If you pull the thesis yourself, you'll further find:

"(Robin Hood Hashing) leads to a new search algorithm which requires less than 2.6 probes on average to perform a successful search even when the table is nearly full. Unsuccessful searches require only O(ln n) probes."

Not bad. Here's HashMap:

pub struct HashMap<K, V, S = RandomState> {
    // All hashes are keyed on these values, to prevent 
    // hash collision attacks.
    hash_builder: S,

    table: RawTable<K, V>,

    resize_policy: DefaultResizePolicy,
}

Okay, so, it interacts with RawTable<K, V> and we can jump to that definition:

pub struct RawTable<K, V> {
    capacity_mask: usize,
    size: usize,
    hashes: TaggedHashUintPtr,

    // Because K/V do not appear directly in any of the
    // types in the struct, inform rustc that in fact 
    // instances of K and V are reachable from here.
    marker: marker::PhantomData<(K, V)>,
}

But it's maybe not the most illuminating struct definition either. Common operations on a collection are often a good place to dig in, so let's look at HashMap::insert:

    pub fn insert(&mut self, k: K, v: V) -> Option<V> {
        let hash = self.make_hash(&k);
        self.reserve(1);
        self.insert_hashed_nocheck(hash, k, v)
    }

And then on into HashMap::reserve:

    pub fn reserve(&mut self, additional: usize) {
        let remaining = self.capacity() - self.len(); // this can't 
        overflow
        if remaining < additional {
            let min_cap = 
            self.len().checked_add(additional)
                      .expect("reserve overflow");
            let raw_cap = self.resize_policy.raw_capacity(min_cap);
            self.resize(raw_cap);
        } else if self.table.tag() && remaining <= self.len() {
            // Probe sequence is too long and table is half full,
            // resize early to reduce probing length.
            let new_capacity = self.table.capacity() * 2;
            self.resize(new_capacity);
        }
    }

Whether reserve expands the underlying storage capacity because we've got near the current total capacity of that storage or to reduce probe lengths, HashMap::resize is called. That's:

    #[inline(never)]
    #[cold]
    fn resize(&mut self, new_raw_cap: usize) {
        assert!(self.table.size() <= new_raw_cap);
        assert!(new_raw_cap.is_power_of_two() || new_raw_cap == 0);

        let mut old_table = replace(&mut self.table, 
        RawTable::new(new_raw_cap));
        let old_size = old_table.size();

        if old_table.size() == 0 {
            return;
        }

        let mut bucket = Bucket::head_bucket(&mut old_table);

        // This is how the buckets might be laid out in memory:
        // ($ marks an initialized bucket)
        //  ________________
        // |$$$_$$$$$$_$$$$$|
        //
        // But we've skipped the entire initial cluster of buckets
        // and will continue iteration in this order:
        //  ________________
        //     |$$$$$$_$$$$$
        //                  ^ wrap around once end is reached
        //  ________________
        //  $$$_____________|
        //    ^ exit once table.size == 0
        loop {
            bucket = match bucket.peek() {
                Full(bucket) => {
                    let h = bucket.hash();
                    let (b, k, v) = bucket.take();
                    self.insert_hashed_ordered(h, k, v);
                    if b.table().size() == 0 {
                        break;
                    }
                    b.into_bucket()
                }
                Empty(b) => b.into_bucket(),
            };
            bucket.next();
        }

        assert_eq!(self.table.size(), old_size);
    }

We're back at RawTable and have a lead on a structure called Bucket:

pub struct Bucket<K, V, M> {
    raw: RawBucket<K, V>,
    table: M,
}

Okay, we're going in some circles here. The bucket parameterizes the table on M rather than the table holding some collection of buckets. The RawBucket is described as an unsafe view of a RawTable bucket of which there are two variants—EmptyBucket and FullBucket. These variants are used to populate a BucketState enumeration:

pub enum BucketState<K, V, M> {
    Empty(EmptyBucket<K, V, M>),
    Full(FullBucket<K, V, M>),
}

Here, we can see it being used back in HashMap::insert_hashed_ordered:

    fn insert_hashed_ordered(&mut self, hash: SafeHash, k: K, v: V) {
        let mut buckets = Bucket::new(&mut self.table, hash);
        let start_index = buckets.index();

        loop {
            // We don't need to compare hashes for value swap.
            // Not even DIBs for Robin Hood.
            buckets = match buckets.peek() {
                Empty(empty) => {
                    empty.put(hash, k, v);
                    return;
                }
                Full(b) => b.into_bucket(),
            };
            buckets.next();
            debug_assert!(buckets.index() != start_index);
        }
    }

If we explore down into empty.push(hash, k, v), we find ourselves at the following:

    pub fn put(mut self, hash: SafeHash, key: K, value: V) 
        -> FullBucket<K, V, M> 
    {
        unsafe {
            *self.raw.hash() = hash.inspect();
            ptr::write(self.raw.pair(), (key, value));

            self.table.borrow_table_mut().size += 1;
        }

        FullBucket {
            raw: self.raw,
            table: self.table,
        }
    }

Tidy. ptr::write(self.raw.pair(), (key, value)) demonstrates that we're working with a structure built out of raw memory pointer manipulation, befitting a structure that will see a lot of use in critical paths. self.raw.pair(), which returns the appropriate offset to move (key, value) into, matching the HashMap::insert move semantics we're already familiar with. Take a look at the definition of RawBucket:

pub struct RawBucket<K, V> {
    hash_start: *mut HashUint,
    // We use *const to ensure covariance with respect to K and V
    pair_start: *const (K, V),
    idx: usize,
    _marker: marker::PhantomData<(K, V)>,
}

Here, we've got two raw pointers: hash_start and pair_start. The former is the first location in memory of our stored pairs' hashes, the latter is the first location in the memory of the pairs. What this ends up being is contiguous storage in memory of two sets of data, which is about what you'd expect. The table module documentation refers to this approach as unzipped arrays. RawTable holds the capacity and the data, kind of. In reality, the data held by RawTable is carefully placed in memory, as we've seen, but there's no owner as the Rust type system understands it. That's where marker: marker::PhantomData<(K, V)> comes in. PhantomData instructs the compiler to behave as if RawTable<K, V> owns pairs of (K, V), even though with all the unsafe pointer manipulation we're doing that can't actually be proven by Rust. We human observers can determine by inspection that RawTable owns its data via RawTable::raw_bucket_at as it computes where in memory the data exists:

    fn raw_bucket_at(&self, index: usize) -> RawBucket<K, V> {
        let hashes_size = self.capacity() * size_of::<HashUint>();
        let pairs_size = self.capacity() * size_of::<(K, V)>();

        let (pairs_offset, _, oflo) =
            calculate_offsets(hashes_size, pairs_size, 
                              align_of::<(K, V)>());
        debug_assert!(!oflo, "capacity overflow");

        let buffer = self.hashes.ptr() as *mut u8;
        unsafe {
            RawBucket {
                hash_start: buffer as *mut HashUint,
                pair_start: buffer.offset(pairs_offset as isize) 
                              as *const (K, V),
                idx: index,
                _marker: marker::PhantomData,
            }
        }
    }

Well, by inspection and testing, as you can see at the bottom of the module.