Pattern Matching

Strings

  • A string is a sequence of characters
  • Examples of strings: Java program, HTML document, DNA sequence, Digitized image
  • An alphabet Σ is the set of possible characters for a family of strings
  • Example of alphabets: ASCII, Unicode, {0, 1}, {A, C, G, T}
  • Let P be a string of size m
    • A substring P[i .. j] of P is the subsequence of P consisting of the characters with ranks between i and j
    • A prefix of P is a substring of the type P[0 .. i]
    • A suffix of P is a substring of the type P[i ..m − 1]
  • Given strings T (text) and P (pattern), the pattern matching problem consists of finding a substring of T equal to P
  • Applications: Text editors, Search engines, Biological research

Brute-Force Algorithm

  • The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift of P relative to T, until either 􀂄
    • a match is found,
    • or 􀂄 all placements of the pattern have been tried
  • Brute-force pattern matching runs in time O(nm)
  • Example of worst case:
    • T = aaa … ah
    • P = aaah
    • may occur in images and DNA sequences
    • unlikely in English text
Algorithm BruteForceMatch(T, P)
  Input text T of size n and pattern P of size m
  Output starting index of a substring of T equal to P or −1 if no such substring exists
  for i ← 0 to n − m
    { test shift i of the pattern }
    j ← 0
  while j < m ∧ T[i + j] = P[j]
    j ← j + 1
  if j = m
    return i {match at i}
  else
    break while loop {mismatch}
  return -1 {no match anywhere}

Boyer-Moore Heuristics

  • The Boyer-Moore’s pattern matching algorithm is based on two heuristics
    • Looking-glass heuristic: Compare P with a subsequence of T moving backwards
    • Character-jump heuristic: When a mismatch occurs at T[i] = c
      • If P contains c, shift P to align the last occurrence of c in P with T[i]
      • Else, shift P to align P[0] with T[i + 1]
  • Example Boyer_Moore

Last-Occurrence Function

  • Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet Σ to build the last-occurrence function L mapping Σ to integers, where L(c) is defined as 􀂄
    • the largest index i such that P[i] = c or
    • −1 if no such index exists
  • Example:
    • Σ = {a, b, c, d}
    • P = abacab

Boyer_Moore_Last_Occurrence

  • The last-occurrence function can be represented by an array indexed by the numeric codes of the characters
  • The last-occurrence function can be computed in time O(m + s), where m is the size of P and s is the size of Σ

The Boyer-Moore Algorithm

Algorithm BoyerMooreMatch(T, P, Σ)
  L ← lastOccurenceFunction(P, Σ )
  i ← m − 1
  j ← m − 1
  repeat
    if T[i] = P[j]
      if j = 0
        return i { match at i }
      else
        i ← i − 1
        j ← j − 1
    else
      { character-jump }
      l ← L[T[i]]
      i ← i + m – min(j, 1 + l)
      j ← m − 1
  until i > n − 1
  return1 { no match }

Algrithm

Analysis

  • Boyer-Moore’s algorithm runs in time O(nm + s)
  • Example of worst case:
    • T = aaa … a
    • P = baaa
  • The worst case may occur in images and DNA sequences but is unlikely in English text
  • Boyer-Moore’s algorithm is significantly faster than the brute-force algorithm on English text

Boyer_Moore_Analysis

Knuth-Morris-Pratt’s algorithm

  • Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right, but shifts the pattern more intelligently than the brute-force algorithm.
  • When a mismatch occurs, what is the most we can shift the pattern so as to avoid redundant comparisons?
  • Answer: the largest prefix of P[0..j] that is a suffix of P[1..j]

KMP

Example

KPM