Hash Tables
Hash Functions and Hash Tables
- A hash function h maps keys of a given type to integers in a fixed interval [0, N − 1]
- Example:
h(x) = x mod N
is a hash function for integer keys - The integer h(x) is called the hash value of key x
- A hash table for a given key type consists of
- Hash function h
- Array (called table) of size N
- When implementing a dictionary with a hash table, the goal is to store item (k, o) at index i = h(k)
Example
- We design a hash table for a dictionary storing items (SSN, Name), where SSN (social security number) is a nine-digit positive integer
- Our hash table uses an array of size N = 10,000 and the hash function h(x) = last four digits of x
Hash Functions
- A hash function is usually specified as the composition of two functions:
- Hash code map: h1: keys → integers
- Compression map: h2: integers → [0, N − 1]
- The hash code map is applied first, and the compression map is applied next on the result, i.e., h(x) = h2(h1(x))
- The goal of the hash function is to “disperse” the keys in an apparently random way
Hash Code Maps
- Memory address:
- We reinterpret the memory address of the key object as an integer (default hash code of all Java objects)
- Good in general, except for numeric and string keys
- Integer cast:
- We reinterpret the bits of the key as an integer
- Suitable for keys of length less than or equal to the number of bits of the integer type (e.g., byte, short, int and float in Java)
- Component sum:
- We partition the bits of the key into components of fixed length (e.g., 16 or 32 bits) and we sum the components (ignoring overflows)
- Suitable for numeric keys of fixed length greater than or equal to the number of bits of the integer type (e.g., long and double in Java)
- Polynomial accumulation:
- We partition the bits of the key into a sequence of components of fixed length (e.g., 8, 16 or 32 bits) a0 a1 …an−1
- We evaluate the polynomial p(z) = a0 + a1 z + a2 z2 + … …+ an−1zn−1 at a fixed value z, ignoring overflows
- Especially suitable for strings (e.g., the choice z = 33 gives at most 6 collisions on a set of 50,000 English words)
- Polynomial p(z) can be evaluated in O(n) time using Horner’s rule:
- The following polynomials are successively computed, each from the previous one in O(1) time
p0(z) = an−1
pi (z) = an−i−1 + zpi−1(z)
(i = 1, 2, …, n −1)
- The following polynomials are successively computed, each from the previous one in O(1) time
- We have p(z) = pn−1(z)
Compression Maps
- Division:
- h2 (y) = y mod N
- The size N of the hash table is usually chosen to be a prime
- The reason has to do with number theory and is beyond the scope of this course
- Multiply, Add and Divide (MAD):
- h2 (y) = (ay + b) mod N
- a and b are nonnegative integers such that a mod N ≠ 0
- Otherwise, every integer would map to the same value b
Collision Handling
- Collisions occur when different elements are mapped to the same cell
- Chaining:
- let each cell in the table point to a linked list of elements that map there
- Chaining is simple, but requires additional memory outside the table
- Linear Probing
- Open addressing: the colliding item is placed in a different cell of the table
- Linear probing handles collisions by placing the colliding item in the next (circularly) available table cell
- Each table cell inspected is referred to as a “probe”
- Colliding items lump together, causing future collisions to cause a longer sequence of probes
- Example: h(x) = x mod 13; Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order
- Double Hashing
- Double hashing uses a secondary hash function d(k) and handles collisions by placing an item in the first available cell of the series (i + jd(k)) mod N for j = 0, 1, … , N − 1
- The secondary hash function d(k) cannot have zero values
- The table size N must be a prime to allow probing of all the cells
- Common choice of compression map for the secondary hash function: d2(k) = q − k mod q where q < N q is a prime
- The possible values for d2(k) are 1, 2, … , q
Search with Linear Probing
- Consider a hash table A that uses linear probing
- findElement(k)
- We start at cell h(k)
- We probe consecutive locations until one of the following occurs
- An item with key k is found, or
- An empty cell is found, or
- N cells have been unsuccessfully probed
Algorithm findElement(k)
i ← h(k)
p ← 0
repeat
c ← A[i]
if c = ∅
return NO_SUCH_KEY
else if c.key () = k
return c.element()
else
i ← (i + 1) mod N
p ← p + 1
until p = N
return NO_SUCH_KEY
Updates with Linear Probing
- To handle insertions and deletions, we introduce a special object, called AVAILABLE, which replaces deleted elements
- removeElement(k)
- We search for an item with key k
- If such an item (k, o) is found, we replace it with the special item AVAILABLE and we return element o
- Else, we return NO_SUCH_KEY
- insert Item(k, o)
- We throw an exception if the table is full
- We start at cell h(k)
- We probe consecutive cells until one of the following occurs
- A cell i is found that is either empty or stores AVAILABLE, or
- N cells have been unsuccessfully probed
- We store item (k, o) in cell i
Example of Double Hashing
Consider a hash table storing integer keys that handles collision with double hashing
N = 13
h(k) = k mod 13
d(k) = 7 − k mod 7
Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order
Performance of Hashing
- In the worst case, searches, insertions and removals on a hash table take O(n) time
- The worst case occurs when all the keys inserted into the dictionary collide
- The load factor α = n/N affects the performance of a hash table
- Assuming that the hash values are like random numbers, it can be shown that the expected number of probes for an insertion with open addressing is 1 / (1 − α)
- The expected running time of all the dictionary ADT operations in a hash table is O(1)
- In practice, hashing is very fast provided the load factor is not close to 100%
- Applications of hash tables: small databases, compilers, browser caches
Universal Hashing
- A family of hash functions is universal if, for any 0<i,j<M-1, Pr(h(j)=h(k)) < 1/N.
- Choose p as a prime between M and 2M.
- Randomly select 0<a<p and 0<b<p, and define h(k)=(ak+b mod p) mod N
- Theorem: The set of all functions, h, as defined here, is universal.
TODO: Proof of Universality