Hash Tables

Hash Functions and Hash Tables

  • A hash function h maps keys of a given type to integers in a fixed interval [0, N − 1]
  • Example:
    h(x) = x mod N
    is a hash function for integer keys
  • The integer h(x) is called the hash value of key x
  • A hash table for a given key type consists of
    • Hash function h
    • Array (called table) of size N
  • When implementing a dictionary with a hash table, the goal is to store item (k, o) at index i = h(k)

Example

  • We design a hash table for a dictionary storing items (SSN, Name), where SSN (social security number) is a nine-digit positive integer
  • Our hash table uses an array of size N = 10,000 and the hash function h(x) = last four digits of x

Hash Functions

  • A hash function is usually specified as the composition of two functions:
    • Hash code map: h1: keys → integers
    • Compression map: h2: integers → [0, N − 1]
  • The hash code map is applied first, and the compression map is applied next on the result, i.e., h(x) = h2(h1(x))
  • The goal of the hash function is to “disperse” the keys in an apparently random way

Hash Code Maps

  • Memory address:
    • We reinterpret the memory address of the key object as an integer (default hash code of all Java objects)
    • Good in general, except for numeric and string keys
  • Integer cast:
    • We reinterpret the bits of the key as an integer
    • Suitable for keys of length less than or equal to the number of bits of the integer type (e.g., byte, short, int and float in Java)
  • Component sum:
    • We partition the bits of the key into components of fixed length (e.g., 16 or 32 bits) and we sum the components (ignoring overflows)
    • Suitable for numeric keys of fixed length greater than or equal to the number of bits of the integer type (e.g., long and double in Java)
  • Polynomial accumulation:
    • We partition the bits of the key into a sequence of components of fixed length (e.g., 8, 16 or 32 bits) a0 a1 …an−1
    • We evaluate the polynomial p(z) = a0 + a1 z + a2 z2 + … …+ an−1zn−1 at a fixed value z, ignoring overflows
    • Especially suitable for strings (e.g., the choice z = 33 gives at most 6 collisions on a set of 50,000 English words)
    • Polynomial p(z) can be evaluated in O(n) time using Horner’s rule:
      • The following polynomials are successively computed, each from the previous one in O(1) time
        p0(z) = an−1
        pi (z) = an−i−1 + zpi−1(z)
        (i = 1, 2, …, n −1)
    • We have p(z) = pn−1(z)

Compression Maps

  • Division:
    • h2 (y) = y mod N
    • The size N of the hash table is usually chosen to be a prime
    • The reason has to do with number theory and is beyond the scope of this course
  • Multiply, Add and Divide (MAD):
    • h2 (y) = (ay + b) mod N
    • a and b are nonnegative integers such that a mod N ≠ 0
    • Otherwise, every integer would map to the same value b

Collision Handling

  • Collisions occur when different elements are mapped to the same cell
  • Chaining:
    • let each cell in the table point to a linked list of elements that map there
    • Chaining is simple, but requires additional memory outside the table
  • Linear Probing
    • Open addressing: the colliding item is placed in a different cell of the table
    • Linear probing handles collisions by placing the colliding item in the next (circularly) available table cell
    • Each table cell inspected is referred to as a “probe”
    • Colliding items lump together, causing future collisions to cause a longer sequence of probes
    • Example:􀂄 h(x) = x mod 13; 􀂄 Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order
  • Double Hashing
    • Double hashing uses a secondary hash function d(k) and handles collisions by placing an item in the first available cell of the series (i + jd(k)) mod N for j = 0, 1, … , N − 1
    • The secondary hash function d(k) cannot have zero values
    • The table size N must be a prime to allow probing of all the cells
    • Common choice of compression map for the secondary hash function: d2(k) = q − k mod q where􀂄 q < N􀂄 q is a prime
    • The possible values for d2(k) are 1, 2, … , q

Search with Linear Probing

  • Consider a hash table A that uses linear probing
  • findElement(k)
    • We start at cell h(k)
    • We probe consecutive locations until one of the following occurs
      • An item with key k is found, or
      • An empty cell is found, or
      • N cells have been unsuccessfully probed
Algorithm findElement(k)
  i ← h(k)
  p ← 0
  repeat
    c ← A[i]
    if c = ∅
      return NO_SUCH_KEY
    else if c.key () = k
      return c.element()
    else
      i ← (i + 1) mod N
      p ← p + 1
  until p = N
  return NO_SUCH_KEY

Updates with Linear Probing

  • To handle insertions and deletions, we introduce a special object, called AVAILABLE, which replaces deleted elements
  • removeElement(k)
    • We search for an item with key k
    • If such an item (k, o) is found, we replace it with the special item AVAILABLE and we return element o
    • Else, we return NO_SUCH_KEY
  • insert Item(k, o)
    • We throw an exception if the table is full
    • We start at cell h(k)
    • We probe consecutive cells until one of the following occurs
      • A cell i is found that is either empty or stores AVAILABLE, or 􀂊
      • N cells have been unsuccessfully probed
    • We store item (k, o) in cell i

Example of Double Hashing

Consider a hash table storing integer keys that handles collision with double hashing N = 13
h(k) = k mod 13
d(k) = 7 − k mod 7

Insert keys 18, 41, 22, 44, 59, 32, 31, 73, in this order

Performance of Hashing

  • In the worst case, searches, insertions and removals on a hash table take O(n) time
  • The worst case occurs when all the keys inserted into the dictionary collide
  • The load factor α = n/N affects the performance of a hash table
  • Assuming that the hash values are like random numbers, it can be shown that the expected number of probes for an insertion with open addressing is 1 / (1 − α)
  • The expected running time of all the dictionary ADT operations in a hash table is O(1)
  • In practice, hashing is very fast provided the load factor is not close to 100%
  • Applications of hash tables: small databases, compilers, browser caches

Universal Hashing

  • A family of hash functions is universal if, for any 0<i,j<M-1, Pr(h(j)=h(k)) < 1/N.
  • Choose p as a prime between M and 2M.
  • Randomly select 0<a<p and 0<b<p, and define h(k)=(ak+b mod p) mod N
  • Theorem: The set of all functions, h, as defined here, is universal.

TODO: Proof of Universality