Hash Tables

Hash Functions and Hash Tables

A hash function h maps keys of a given type to integers in a fixed interval [0, N − 1]
Example:
h(x) = x mod N
is a hash function for integer keys
The integer h(x) is called the hash value of key x
A hash table for a given key type consists of
- Hash function h
- Array (called table) of size N
When implementing a dictionary with a hash table, the goal is to store item (k, o) at index i = h(k)

Example

We design a hash table for a dictionary storing items (SSN, Name), where SSN (social security number) is a nine-digit positive integer
Our hash table uses an array of size N = 10,000 and the hash function h(x) = last four digits of x

Hash Functions

A hash function is usually specified as the composition of two functions:
- Hash code map: h1: keys → integers
- Compression map: h2: integers → [0, N − 1]
The hash code map is applied first, and the compression map is applied next on the result, i.e., h(x) = h2(h1(x))
The goal of the hash function is to “disperse” the keys in an apparently random way

Hash Code Maps

Memory address:
- We reinterpret the memory address of the key object as an integer (default hash code of all Java objects)
- Good in general, except for numeric and string keys
Integer cast:
- We reinterpret the bits of the key as an integer
- Suitable for keys of length less than or equal to the number of bits of the integer type (e.g., byte, short, int and float in Java)
Component sum:
- We partition the bits of the key into components of fixed length (e.g., 16 or 32 bits) and we sum the components (ignoring overflows)
- Suitable for numeric keys of fixed length greater than or equal to the number of bits of the integer type (e.g., long and double in Java)
Polynomial accumulation:
- We partition the bits of the key into a sequence of components of fixed length (e.g., 8, 16 or 32 bits) a0 a1 …an−1
- We evaluate the polynomial p(z) = a0 + a1 z + a2 z2 + … …+ an−1zn−1 at a fixed value z, ignoring overflows
- Especially suitable for strings (e.g., the choice z = 33 gives at most 6 collisions on a set of 50,000 English words)
- Polynomial p(z) can be evaluated in O(n) time using Horner’s rule:
  - The following polynomials are successively computed, each from the previous one in O(1) time
    p0(z) = an−1
    pi (z) = an−i−1 + zpi−1(z)
    (i = 1, 2, …, n −1)
- We have p(z) = pn−1(z)

Compression Maps

Division:
- h2 (y) = y mod N
- The size N of the hash table is usually chosen to be a prime
- The reason has to do with number theory and is beyond the scope of this course
Multiply, Add and Divide (MAD):
- h2 (y) = (ay + b) mod N
- a and b are nonnegative integers such that a mod N ≠ 0
- Otherwise, every integer would map to the same value b

Collision Handling

Collisions occur when different elements are mapped to the same cell
Chaining:
- let each cell in the table point to a linked list of elements that map there
- Chaining is simple, but requires additional memory outside the table
Linear Probing
- Open addressing: the colliding item is placed in a different cell of the table
- Linear probing handles collisions by placing the colliding item in the next (circularly) available table cell
- Each table cell inspected is referred to as a “probe”
- Colliding items lump together, causing future collisions to cause a longer sequence of probes
- Example:􀂄 h(x) = x mod 13; 􀂄 Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order
Double Hashing
- Double hashing uses a secondary hash function d(k) and handles collisions by placing an item in the first available cell of the series (i + jd(k)) mod N for j = 0, 1, … , N − 1
- The secondary hash function d(k) cannot have zero values
- The table size N must be a prime to allow probing of all the cells
- Common choice of compression map for the secondary hash function: d2(k) = q − k mod q where􀂄 q < N􀂄 q is a prime
- The possible values for d2(k) are 1, 2, … , q

Search with Linear Probing

Consider a hash table A that uses linear probing
findElement(k)
- We start at cell h(k)
- We probe consecutive locations until one of the following occurs
  - An item with key k is found, or
  - An empty cell is found, or
  - N cells have been unsuccessfully probed

Algorithm findElement(k)
  i ← h(k)
  p ← 0
  repeat
    c ← A[i]
    if c = ∅
      return NO_SUCH_KEY
    else if c.key () = k
      return c.element()
    else
      i ← (i + 1) mod N
      p ← p + 1
  until p = N
  return NO_SUCH_KEY

Updates with Linear Probing

To handle insertions and deletions, we introduce a special object, called AVAILABLE, which replaces deleted elements
removeElement(k)
- We search for an item with key k
- If such an item (k, o) is found, we replace it with the special item AVAILABLE and we return element o
- Else, we return NO_SUCH_KEY
insert Item(k, o)
- We throw an exception if the table is full
- We start at cell h(k)
- We probe consecutive cells until one of the following occurs
  - A cell i is found that is either empty or stores AVAILABLE, or 􀂊
  - N cells have been unsuccessfully probed
- We store item (k, o) in cell i

Example of Double Hashing

Consider a hash table storing integer keys that handles collision with double hashing N = 13
h(k) = k mod 13
d(k) = 7 − k mod 7

Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order

Performance of Hashing

In the worst case, searches, insertions and removals on a hash table take O(n) time
The worst case occurs when all the keys inserted into the dictionary collide
The load factor α = n/N affects the performance of a hash table
Assuming that the hash values are like random numbers, it can be shown that the expected number of probes for an insertion with open addressing is 1 / (1 − α)
The expected running time of all the dictionary ADT operations in a hash table is O(1)
In practice, hashing is very fast provided the load factor is not close to 100%
Applications of hash tables: small databases, compilers, browser caches

Universal Hashing

A family of hash functions is universal if, for any 0<i,j<M-1, Pr(h(j)=h(k)) < 1/N.
Choose p as a prime between M and 2M.
Randomly select 0<a<p and 0<b<p, and define h(k)=(ak+b mod p) mod N
Theorem: The set of all functions, h, as defined here, is universal.

TODO: Proof of Universality