- Hands-On Concurrency with Rust
- Brian L. Troutwine
- 1180字
- 2021-06-25 21:11:53
Naive HashMap
How else could we implement an associative array in Rust and how would it compare to standard library's HashMap? The standard library HashMap is clever, clearly, storing just enough information to reduce the total probes from ideal offsets by sharing information between modifications to the underlying table. In fact, the comments in the table module assert that the design we've just worked through is a lot faster than a table structured as Vec<Option<(u64, K, V)>>—where u64 is the key hash, but presumably still using Robin Hood hashing and linear probing. What if we went even simpler? We'll support only two operations—insert and lookup—to keep things straightforward for ourselves. We'll also keep more or less the same type constraints as HashMap so we compare similar things.
Start a new Rust project called naive_hashmap:
> cargo new naive_hashmap Created library `naive_hashmap` project
Edit naive_hashmap/Cargo.toml to look like so:
[package] name = "naive_hashmap" version = "0.1.0" [dependencies] rand = "0.4.2" [dev-dependencies] quickcheck = "0.6" criterion = "0.1" [[bench]] name = "naive" harness = false [[bin]] name = "naive_interpreter" doc = false [[bin]] name = "standard" doc = false [[bin]] name = "naive" doc = false [[bench]] name = "specialized" harness = false [[bin]] name = "specialized_interpreter" doc = false [[bin]] name = "specialized" doc = false
Don't worry too much right now about some of the development dependencies, they'll be explained in due course. Please note that in the following discussion we will run compilation commands against the project before all the code has been discussed. If you are following along with the book's pre-written source code open already, you should have no issues. If you are writing the source out as the book goes along, you will need to comment targets out. Now, open naive_hashmap/src/lib.rs and add the following preamble:
#[cfg(test)] extern crate quickcheck; use std::hash::{BuildHasher, Hash, Hasher}; use std::borrow::Borrow; use std::collections::hash_map::RandomState; use std::{cmp, mem, ptr};
Most crate roots begin with a fairly long preamble, and naive_hashmap is no exception. Next up, our HashMap struct:
#[derive(Default)] pub struct HashMap<K, V, S = RandomState> where K: Eq, V: ::std::fmt::Debug, { hash_builder: S, data: Vec<(u64, K, V)>, }
How does our naive HashMap differ from the standard library's HashMap we've seen so far? The naive implementation maintains the parameterized hasher and type constraints, plus a constraint on V to implement the Debug trait for ease of fiddling. The primary difference, in terms of underlying structure, is the use of a Vec—as called out in the comments of the standard library HashMap—but without an Option wrapper. We're not implementing a delete operation so there's no reason to have an Option but, even if we were, the plan for this implementation would be to rely solely on Vec::remove. The fact that it is less than ideal that Vec::remove shifts all elements from the right of the removal index to the left should be well understood. Folks, this won't be a fast implementation. Now:
impl<K: Eq, V> HashMap<K, V, RandomState> where K: Eq + Hash, V: ::std::fmt::Debug, { pub fn new() -> HashMap<K, V> { HashMap { hash_builder: RandomState::new(), data: Vec::new(), } } } fn make_hash<T: ?Sized, S>(hash_builder: &S, t: &T) -> u64 where T: Hash, S: BuildHasher, { let mut state = hash_builder.build_hasher(); t.hash(&mut state); state.finish() } impl<K, V, S> HashMap<K, V, S> where K: Eq + Hash, S: BuildHasher, V: ::std::fmt::Debug, { pub fn with_hasher(hash_builder: S) -> HashMap<K, V, S> { HashMap { hash_builder: hash_builder, data: Vec::new(), } }
Our naive implementation is following along with the standard library here, implementing a HashMap that is parameterized on RandomState—so users don't have to think about the underlying hasher—and in which the hasher is swappable, via HashMap::with_hasher. The Rust team chose to implement RandomState in terms of a verified cryptographically secure hash algorithm, befitting a language that is intended for use on the public internet. Some users won't desire this property—opting instead for a much faster, potentially vulnerable hash—and will swap RandomState out for something else. Our naive HashMap retains this ability.
Let's examine insertion into our naive HashMap:
pub fn insert(&mut self, k: K, v: V) -> Option<V> { let hash = make_hash(&self.hash_builder, &k); let end = self.data.len(); for idx in 0..end { match self.data[idx].0.cmp(&hash) { cmp::Ordering::Greater => { self.data.insert(idx, (hash, k, v)); return None; } cmp::Ordering::Less => continue, cmp::Ordering::Equal => { let old = mem::replace(&mut self.data[idx].2, v); return Some(old); } } } self.data.push((hash, k, v)); None }
Our naive implementation maintains the API of the standard library HashMap but that's about it. The key is hashed and then a linear search is done through the entire data store to find an index in that store where one of two conditions hold:
- Our new hash is greater than some hash in the store, in which case we can insert our key/value pair
- Our new hash is equal to some hash in the store, in which case we can replace the existing value
Key/value pairs are stored in terms of their ordered hashes. The expense of an insert includes:
- Hashing the key
- Searching for the insert index, a linear operation to the number of stored key/value pairs
- Potentially shifting key/value pairs in memory to accommodate a new key to the store
Lookup follows a similar scheme:
pub fn get<Q: ?Sized>(&mut self, k: &Q) -> Option<&V> where K: Borrow<Q> + ::std::fmt::Debug, Q: Hash + Eq + ::std::fmt::Debug, { let hash = make_hash(&self.hash_builder, k); for &(bucket_hash, _, ref v) in &self.data { if hash == bucket_hash { return Some(v); } } None }
The key is hashed and a linear search is done for that exact hash in storage. Here, we only pay the cost for:
- Hashing the key
- Searching for the retrieval offset, if one exists
Let's take a moment and consider this before asking ourselves if our program is fast if it is correct. The usual way of demonstrating fitness for purpose of software is through unit testing, a minimal setup for which is built right into the Rust language. Unit testing is two processes wrapped into a single method. When writing unit tests, a programmer will:
- Produce example data that exercises some code path in the software
- Write further code to demonstrate that when the example data is applied to the system under test, a desirable property/properties holds
This is a good testing methodology for demonstrating that the happy path of a system under test works as expected. The weak point of unit testing comes in the first process, where the programmer must think very hard and produce an example dataset that exercises the correct code paths, demonstrates the lack of edge cases, and so on. Human beings are poor at this task, owing to it being tedious, biased toward demonstrating the functioning of a thing made by one's own hands, chronic blind spots, or otherwise. What we are pretty good at doing is cooking up high-level properties for our systems that must always hold or hold in particular situations. Fortunately for us programmers, computers are exceptionally good at doing tedious, repetitive tasks and the generation of example data for tests is such a thing.