Version 05/18/2019 Notes Index ↑ 

4 String Matching

Two string algorithms we study in this section have many applications in diverse fields. We adopt the inspiration from computational biology where they apply to DNA sequences. But applications in other areas abound, especially when we consider them as "sequence" algorithms which could apply to sequences of objects of any type with the appropriate comparison operators defined, held in any sequential data structure (array, deque, list, vector).

The first of these is the Longest Common Subsequence algorithm which, given a pair of strings (sequences) seeks a maximum length sequence that is a subsequence of each. The second is the Edit Distance algorithm which, given two sequences, asks for the minimum cost set of edit operations that transform one into the other. These two algorithms are related by both outcome and solution technique. The Longest Common Subsequence problem is equivalent to the Edit Distance problem with two operations insert and delete each of cost 1. And both problems yield to a dynamic programming solution.

Caution - substring v subsequence. There is a distinction in usage for the two terms substring and subsequence. A substring is defined to be a subsequence that is contiguous, with no gaps. Whereas a subsequence may skip through the containing sequence. The previous section dealt with the problem of finding a substring (pattern) within a string (search space). The LCS algorithm in this section deals with the problem of finding subsequences in common to two strings.

4.1 Longest Common Subsequence

Given two sequences s of length m and t of length n a common subsequence is a sequence that appears in both s and t. (Note that a subsequence need not be contiguous.) For example, the two strings s = "BDACADBC" and t = "ADBCABAD" have "DCB" in common:

s = BDACADBC
t = ADBCABAD

The longest common subsequence (LCS) problem is: given two sequences s and t, calculate their longest common subsequence: produce a longest sequence of items that appear left-to-right (but not necessarily in a contiguous block) in both sequences. Note that we say "a" not "the" because there may be more than one LCS in a pair of sequences.

In the example above, DCAB is a solution to LCS(s,t):

s = BDACADBC
t = ADBCABAD

A legitimate interpretation of the LCS is to provide an optimal way to "align" the two sequences:

s = BDACADBC-
t = ADBCA-BAD

where we use '-' as a place-holder for a "missing" item. Note by the way that there is another LCS for these two sequences:

s = BDACADBC = --BDACADBC
t = ADBCABAD = ADBCABAD--

We will use strings using the alphabet {A,B,C,D,E,...,Z} (upper case letters) in illustrations. The most well-known area of application is DNA sequences, which are modelled with the 4-character alphabet {A,C,T,G}. Here is one more example:

s = ABADEFGAXBCDYBCDZAABD = ABADEFGAXBCDYBCD-ZAABD
t = HJXKLYKKKKZJJJ        = HJ------XKL-YKKKKZJJJ-

We will use Dynamic Programming to solve this problem. As usual, we first derive a solution for finding the length L of the longest common subsequence. Then finding the actual subsequence can be calculated using backtracing in the sub-problem space. Let

L(i,j) = length of the LCS of s[0..i-1] = s[0..i) and t[0..j-1] = t[0..j)

Note that L = L(m,n) is the solution. Consider the situation where we want to calculate L(i,j) from solutions on shorter initial segments.

Case 1: s[i-1] = t[j-1]

Then we can create a longer common subsequence by appending this item. If a longest common subsequence of s[0..i-1) and t[0..j) uses t[j], then the last matching item in s can be moved forward to s[i-1], and similarly for LCS(i,j-1). Therefore in this case

L(i,j) = L(i-1,j-1) + 1

Case 2: s[i-1] ≠ t[j-1]

Then any common subsequence has to ignore either s[i-1] or t[j-1], so we conclude

L(i,j) = Max ( L(i,j-1) , L(i-1,j) )

This analysis leads to the following bottom-up calculation:

void LCS (s,t)
{
  fsu::Matrix<size_t> L (m+1,n+1,0);
  for (size_t i = 1; i <= m; ++i) 
  {
    for (size_t j = 1; j <= n; ++j)
    {
      if (s[i-1] == t[j-1])
        L[i,j] = 1 + L[i-1,j-1];
      else
        L[i,j] = Max ( L[i-1,j] , L[i,j-1] );
    }
  }
  // L[m,n] is the length of a LCS
}

Because indexing of sequences in practice is [0,n) = [0,n-1], and also because we need the 0-index row and column of L for initialization, the matrix indices are [0,m]×[0,n] and the sequence indices are offset by 1. The matrix indices track one higher than the string indices because it is convenient to initialize the process with M[i,0] = M[0,j] = 0.

Constructing the actual LCS (in both s and t) can be done with a backtracing recurrence subroutine of the executor, operating on the matrix L after LCS has been executed to populate L with subproblem solutions, using BitVectors bvs and bvt:

// backtracing subroutine
{
  bvs.Unset(); bvt.Unset();
  for (size_t i = m, j = n; i > 0, j > 0; ) 
  {
    if (s[i-1] == t[j-1])
    {
      bvs.Set(i-1); bvt.Set(j-1); // indicates s[i-1] == t[j-1] is in the LCS
      --i; --j;
    }
    else if (L[i,j] == L[i-1,j])
    {
      --i;
    }
    else // L[i,j] == L[i,j-1]
    {
      --j;
    }
  }
}

The overall code organization is similar to the bottom-up code for the Knapsack problem (illustrated for C-strings s,t):

size_t LCS(const char* s, size_t m, const char* t, size_t n, fsu::BitVector& bvs, fsu::BitVector& bvt); // executor, called by client
void   LCS(const char* s, size_t m, const char* t, size_t n, fsu::Matrix<size_t>& L); // subsidiary called by executor; sets values in L

size_t LCS(const char* s, size_t m, const char* t, size_t n, fsu::BitVector& bvs, fsu::BitVector& bvt)
{
  Matrix <size_t> L(1+m,1+n,0);
  LCS(s, m, t, n, L); // call to matrix calculator sets values of L
  {
    // subroutine setting bvs and bvt (as above)
  }
  return L[m][n];
}

Proposition 1. The Longest Common Subsequence calculator LCS(m,n) has runtime Θ(m×n) and runspace +Θ(m×n). The backtracing subroutine has runtime Θ(m+n) and operates in existing space.

To make optimal practical use of LCS it is best practice to display the two sequences aligned with vertical bars indicating matches (rather than color), as in this sample output from LCS_i.x:

>LCS.x CCGATGATCATTGCCAGTCCACCGATTGTGAGAACGACAGCGACTCCAGC CCGATGACTTTTGCAGCTCCACCGATTTTGGTCCAGC
  Length of LCS:  33
  ...
  optimal alignment:
   s = CCGATGA--TCATTGCCAG-TCCACCGA-TTGTGAGAACGACAGCGACTCCAGC
       |||||||  |  ||| ||| |||||||| || |          | |  ||||||
   t = CCGATGACTT--TTG-CAGCTCCACCGATTT-T----------G-G--TCCAGC

(Other information, of interest to the algorithm developer but not necessarily to the user, is also displayed by LCS_i.x.)

Exercise 1. Locate the directory LIB/notes_support in your course library and copy/execute the demonstration program:

LCS_i.x   # Longest Common Substring

(As usual, the "_i" indicates Linux/Intel executable. Entering the command without arguments prints some documentation to screen.)

Use the example code above for the LCS calculator, devise and implement a display process that uses the results of a call to the LCS algorithm (that is, the return value and the bitvectors) and produces the enhanced output as in the demo program. This will require an alignment process that displays the aligned strings (as illusrated above). After development, your program should behave identically to LCS_i.x.

As always, type the code using the eyes->brain->fingers->keyboard->screen->eyes loop (or equivalent). Do not copy/paste. ∎

4.2 Edit Distance

The edit distance between two strings is important in widely diverse applications ranging from computational biology (for example, finding the distance between two DNA sequences), information extraction, speech recognition, and other areas where comparison of two data sequences is important. Here is an excellent introduction with a definition of edit distance (and Levenshtein distance in particular) and an overview of the calculation of edit distance using dynamic programming:

Minimum Edit Distance - Explained! by Dan Jurafsky, Stanford University.

We begin with the same input requirements as LCS, but with modified results: return the edit distance between two strings s along with an "edit transcript" that can be used to transform s into t. Define 3 edit operations and their cost as follows:

To understanding edit distance, consider the first element of s and compare with the first element of t. We have four choices:

  1. If the elements are equal we can do nothing; the edit transcript label is M.
  2. If the elements are not equal there are 3 possible ways to proceed:
  3. We can substitute one for the other; the edit transcript label is S.
  4. We can insert a slot for a new character; the edit transcript label is I.
  5. We can delete a character; the edit transcript label is D.

The process continues until elements of the two sequences are exhausted. Total cost of the transformation is obtained by adding up the costs of the edits at each step, using this cost table:

actionsymbol    cost
matchM0
deleteD1
insertI1
substitute    Ssub_cost = 1, 2, or ? (1 or 2 is typical. The choice of 2 here defines the Levenshtein Distance.)

The edit distance is the least cost sequence of edit actions transforming s to t. The problem of finding the edit distance and an optimal edit transcript (ED) is very similar to the longest common subsequence problem (LCS). ED has some properties that make it more useful however. The first is that edit distance satisfies the mathematical properties required for a metric:

Definition. Given a space X, a mapping d : X×X→[0,∞) is a metric iff the following three conditions hold:

1. d(x,x) = 0 for all x∈X // distance from a point to itself is 0
2. d(x,y) = d(y,x) for all x,y∈X // symmetry - distance from x to y is the same as the distance from y to x
3. d(x,z) ≤ d(x,y) + d(y,z) for all x,y,z∈X    // triangle inequality - it's shorter to travel in a straight line than to take a detour

Proposition 2. Edit distance is a metric on the space of sequences.

Proof. Suppose that x, y, and z are sequences and denote by ed the edit distance mapping. Since no edits are required to transform x to itself, ed(x,x) = 0 (axiom 1). If an optimal sequence of edits transforms x to y, then by replacing I with D and D with I (noting that I and D have equal cost) we obtain an equal-cost sequence of edits transforming y back to x. Moreover, this latter edit transcript must be optimal, otherwise we could "flip" a less costly sequence of edits into a less costly version of the original, which is optimal. Therefore ed(x,y) = ed(y,x) (axiom 2).

Suppose that Txy, Tyz, Txz are optimal edit transcripts taking x to y, y to z, and x to z, respectively. If Txy+Tyz provides a less costly way to edit x to z then Txz must have been non-optimal because we can replace it with Txy+Tyz. Therefore

ed(x,z) = cost(Txz) ≤ cost(Txy+Tyz) = cost(Txy) + cost(Tyz) = ed(x,y) + ed(y,z)

which verifies axiom 3. ∎

Corollary. Edit distance satisfies the Bellman Optimal Substructure Property.

With all that background, let's move toward develop a dynamic programming algorithm. Begin as in the LCS problem with two sequences

s = s0, s1,s2, ... , sm-1
t = t0, t1,t2, ... , tn-1

We want to calculate the edit distance ed(s,t) using a bottom-up Dynamic Programming approach. Let

D(i,j) = edit distance from the partial sequence s[0..i-1] = s[0..i) to the partial sequence t[0..j-1] = t[0..j)

We can initialize the calculation by observing that:

D(0,0) = 0 // zero edits are required to move the empty sequence to itself
D(i,0) = i // i deletes take s[0..i) to the empty sequence, and
D(0,j) = j // j inserts take the empty sequence to t[0..j)

and then proceed to populate the entire matrix using a recurrence:

D(i,j) = Min { D(i-1,j-1) + δ , D(i-1,j) + 1 , D(i,j-1) + 1 }
for i = 1 .. m and j = 1 .. n.

where

δ = 0 if si = tj else δ = sub_cost

In words, D(i,j) is the minimum of three previously calculated values, representing three kinds of edits:

  1. The edit distance taking s[0..i-1) to t[0..j-1) plus either 0 (in the case of a match) or sub_cost (in case of substitution).
  2. The edit distance taking s[0..i-1) to t[0..j) plus 1 for insertion at si-1
  3. The edit distance taking s[0..i) to t[0..j-1) plus 1 for deletion of si-1

Example 1. ED( ABCDE , ABDDD )

                    action         cost
     s: ABCDE
                    sub at s[2]       2
        ABDDE
                    sub at s[3]       2
     t: ABDDD

     optimal alignment:

        s = ABCDE
            || | 
        t = ABDDD

     ed: 2 + 2 = 4

Example 2. ED( ABCDE , ABDDDE )

                    action         cost
     s: ABCDE
                    ins at s[2]       1
        AB-CDE
                    sub at s[3]       2
     t: ABDDDE

     optimal alignment:
        s = AB-CDE
            ||  ||
        t = ABDDDE

     ed: 1 + 2 = 3

In addition to calculating the edit distance, we want detailed information comparing the two sequences: an edit transcript that provides a sequence of edits transforming s to t and an optimal alignment that shows the two strings (with gaps at insert locations) aligned according to the returned optimal edit transcript. The optimal alignment can be obtained by post-processing using the edit transcript.

The optimal edit transcript is obtained by backtrace of the algorithm itself. One way to accomplish such a backtrace is to keep an ancillary matrix of "parent" directions for each value of D(i,j). The "parent" direction is one of three possibilities, depending on which of the three values is used in the Min operation:

D(i,j) = Min { D(i-1,j-1) + δ , D(i-1,j) + 1 , D(i,j-1) + 1 }

  1. If D(i-1,j-1) + δ is used, the "parent" direction is "Diagonal" and the corresponding edit operation is either "Match" or "Substitute".
  2. If D(i-1,j) + 1 is used, the "parent" direction is "Up" and the corresponding edit operation is "Delete".
  3. If D(i,j-1) + 1 is used, the "parent" direction is "Left" and the corresponding edit operation is "Insert".

Using D,U,L for "Diagonal", "Up", and "Left" the auxilliary parent matrix P provides signposts to construct an optimal path from D(m,n) to D(0,0).

The edit process is done from left to right in s and t and corresponds to a path from D(0,0) to D(m,n) in the distance matrix. An optimal edit is obtained by reversing the optimal parent path. The following table summarizes the processes:

Parent DirectionEdit DirectionEdit Action s→tEdit Action t→s
Up ↑Down ↓DeleteInsert
Left ←Right →InsertDelete
Diag ↖Diag ↘Match | SubstituteMatch | Substitute

Example 3. ED( ABCDE , ABDDDDDDE ). The matrices D and P calculate to:


     ---------------------          ---------------------
    | 0 1 2 3 4 5 6 7 8 9 |        | - L L L L L L L L L |
    | 1 0 1 2 3 4 5 6 7 8 |        | U D L L L L L L L L |
D = | 2 1 0 1 2 3 4 5 6 7 |    P = | U U D L L L L L L L |
    | 3 2 1 2 3 4 5 6 7 8 |        | U U U D D D D D D D |
    | 4 3 2 1 2 3 4 5 6 7 |        | U U U D D D D D D L |
    | 5 4 3 2 3 4 5 6 7 6 |        | U U U U D D D D D D |
     ---------------------          ---------------------

The backtrace used to construct an optimal st edit transcript is highlighted. The results from the algorithm are:


  Edit Distance:     6 // Levenshtein - substitution cost = 2
                  s: ABCDE
   s > t transcript: MMIIIISMM
   t > s transcript: MMDDDDSMM
                  t: ABDDDDDDE
  optimal alignment:
   s = AB----CDE
       ||     ||
   t = ABDDDDDDE

Example 4. ED( CCGATGATCATTGCCAGTCCACTTGTGAGAACGACAGCGACTCCAGC , CCGATGACTTTTGCAGCTCCACTTTTGGTCCAGC ). The matrices D and P calculate to:


 -----------------------------------------------------------------------------------------------------------
|  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34  |
|  1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33  |
|  2  1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32  |
|  3  2  1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31  |
|  4  3  2  1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30  |
|  5  4  3  2  1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29  |
|  6  5  4  3  2  1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28  |
|  7  6  5  4  3  2  1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27  |
|  8  7  6  5  4  3  2  1  1  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26  |
|  9  8  7  6  5  4  3  2  1  2  2  3  4  5  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25  |
| 10  9  8  7  6  5  4  3  2  2  3  3  4  5  6  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  |
| 11 10  9  8  7  6  5  4  3  2  2  3  3  4  5  6  6  7  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  |
| 12 11 10  9  8  7  6  5  4  3  2  2  3  4  5  6  7  7  7  8  9 10 11 11 12 13 14 15 16 17 18 19 20 21 22  |
| 13 12 11 10  9  8  7  6  5  4  3  3  3  3  4  5  6  7  8  8  9 10 11 12 12 13 14 14 15 16 17 18 19 20 21  |
| 14 13 12 11 10  9  8  7  6  5  4  4  4  4  3  4  5  6  7  8  8  9 10 11 12 13 14 15 15 16 16 17 18 19 20  |
| 15 14 13 12 11 10  9  8  7  6  5  5  5  5  4  4  5  5  6  7  8  9  9 10 11 12 13 14 15 16 16 16 17 18 19  |
| 16 15 14 13 12 11 10  9  8  7  6  6  6  6  5  4  5  6  6  7  8  8  9 10 11 12 13 14 15 16 17 17 16 17 18  |
| 17 16 15 14 13 12 11 10  9  8  7  7  7  6  6  5  4  5  6  7  8  9  9 10 11 12 13 13 14 15 16 17 17 16 17  |
| 18 17 16 15 14 13 12 11 10  9  8  7  7  7  7  6  5  5  5  6  7  8  9  9 10 11 12 13 14 14 15 16 17 17 17  |
| 19 18 17 16 15 14 13 12 11 10  9  8  8  8  7  7  6  5  6  5  6  7  8  9 10 11 12 13 14 15 14 15 16 17 17  |
| 20 19 18 17 16 15 14 13 12 11 10  9  9  9  8  8  7  6  6  6  5  6  7  8  9 10 11 12 13 14 15 14 15 16 17  |
| 21 20 19 18 17 16 15 14 13 12 11 10 10 10  9  8  8  7  7  7  6  5  6  7  8  9 10 11 12 13 14 15 14 15 16  |
| 22 21 20 19 18 17 16 15 14 13 12 11 11 11 10  9  9  8  8  7  7  6  5  6  7  8  9 10 11 12 13 14 15 15 15  |
| 23 22 21 20 19 18 17 16 15 14 13 12 11 12 11 10 10  9  8  8  8  7  6  5  6  7  8  9 10 11 12 13 14 15 16  |
| 24 23 22 21 20 19 18 17 16 15 14 13 12 12 12 11 11 10  9  9  9  8  7  6  5  6  7  8  9 10 11 12 13 14 15  |
| 25 24 23 22 21 20 19 18 17 16 15 14 13 12 13 12 11 11 10 10 10  9  8  7  6  6  7  7  8  9 10 11 12 13 14  |
| 26 25 24 23 22 21 20 19 18 17 16 15 14 13 13 13 12 12 11 11 11 10  9  8  7  6  6  7  8  8  9 10 11 12 13  |
| 27 26 25 24 23 22 21 20 19 18 17 16 15 14 14 14 13 13 12 12 12 11 10  9  8  7  7  6  7  8  9 10 11 11 12  |
| 28 27 26 25 24 23 22 21 20 19 18 17 16 15 15 14 14 14 13 13 13 12 11 10  9  8  8  7  7  8  9 10 10 11 12  |
| 29 28 27 26 25 24 23 22 21 20 19 18 17 16 16 15 14 15 14 14 14 13 12 11 10  9  9  8  7  8  9 10 11 10 11  |
| 30 29 28 27 26 25 24 23 22 21 20 19 18 17 17 16 15 15 15 15 15 14 13 12 11 10 10  9  8  8  9 10 10 11 11  |
| 31 30 29 28 27 26 25 24 23 22 21 20 19 18 18 17 16 16 16 16 16 15 14 13 12 11 11 10  9  9  9 10 10 11 12  |
| 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 18 17 16 17 16 16 16 15 14 13 12 12 11 10 10  9  9 10 11 11  |
| 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 19 18 17 17 17 17 17 16 15 14 13 13 12 11 11 10 10 10 10 11  |
| 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 19 18 18 18 18 17 17 16 15 14 14 13 12 12 11 11 10 11 11  |
| 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 20 19 19 18 18 18 17 17 16 15 15 14 13 13 12 11 11 11 11  |
| 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 21 20 20 19 19 18 18 18 17 16 16 15 14 14 13 12 11 12 12  |
| 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 21 21 20 20 19 19 19 18 17 17 16 15 15 14 13 12 11 12  |
| 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 22 21 20 20 19 20 19 18 18 17 16 16 15 14 13 12 11  |
| 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 22 22 21 21 20 20 20 19 19 18 17 17 16 15 14 13 12  |
| 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 23 23 22 21 21 21 21 20 20 19 18 18 17 16 15 14 13  |
| 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 24 23 23 22 21 22 22 21 21 20 19 19 18 17 16 15 14  |
| 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 24 24 23 22 21 22 22 21 21 20 19 19 18 17 16 15  |
| 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 24 24 23 22 22 23 22 22 21 20 19 19 18 17 16  |
| 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 25 24 23 23 23 23 23 22 21 20 19 19 18 17  |
| 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 25 24 24 24 24 24 23 22 21 20 19 19 18  |
| 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 25 25 25 25 25 24 24 23 22 21 20 19 19  |
| 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 26 26 26 26 25 25 24 23 22 21 20 19  |
 -----------------------------------------------------------------------------------------------------------

and


 -----------------------------------------------------------------------
| - L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L L |
| U D D L L L L L D L L L L L D L L D L D D L D L L L L L L L D D L L D |
| U D D L L L L L D L L L L L D L L D L D D L D L L L L L L L D D L L D |
| U U U D L L D L L L L L L D L L D L L L L L L L L L L D D L L L L D L |
| U U U U D L L D L L L L L L L D L L L L L D L L L L L L L L L L D L L |
| U U U U U D L L L D D D D L L L L L D L L L L D D D D L L D L L L L L |
| U U U D U U D L L L L L L D L L D L L L L L L L L L L D D L L L L D L |
| U U U U D U U D L L L L L L L D L L L L L D L L L L L L L L L L D L L |
| U U U U U D U U D D D D D L L L L L D L L L L D D D D L L D L L L L L |
| U D D U U U U U D D D D D D D L L D L D D L D L L L L L L L D D L L D |
| U U U U D U U D U D D D D D D D L L L L L D L L L L L L L L L L D L L |
| U U U U U D U U U D D D D L L U D D D L L L L D D D D L L D L L L L L |
| U U U U U D U U U D D D D D D D D D D D D D D D D D D L L D L L L L L |
| U U U D U U D U U U U D D D L L D L D D D D D D D D D D D L L L L D L |
| U D D U U U U U D U U D D D D L L D L D D L D L L D D D D D D D L L D |
| U D D U U U U U D U U D D D D D D D L D D D D L L L L L L D D D L L D |
| U U U U D U U D U U U D D D U D D D D D D D L D D D D D D D D D D L L |
| U U U D U U D U U U U D D D U U D L L D D D D D D D D D D L L L U D L |
| U U U U U D U U U D D D D U D U U D D L L L L D D D D L D D L L L U D |
| U D D U U U U U D U U U D D D U U D D D D L D L D D D D D D D D L L D |
| U D D U U U U U D U U U D D D D U D D D D L D L L L L L L L D D L L D |
| U U U U D U U D U U U U D D U D U U D D U D L L L L L L L L L U D L L |
| U D D U U U U U D U U U D D D U D D D D D U D L L L L L L L D D U D D |
| U U U U U D U U U D D D D D U U D U D U D U U D D D D L L D L L L L D |
| U U U U U D U U U D D D D D U U D U D D D U U D D D D L L D L L L L L |
| U U U D U U D U U U U U U D D U D U U D D U U U U D D D D L L L L D L |
| U U U U U D U U U D D D D U D U U D D D D U U D D D D L D D L L L L L |
| U U U D U U D U U U U U U D D D D D U D D U U U U U D D D L D D D D L |
| U U U U D U U D U U U U U U D D U D U D D D U U U U D U D D D D D L D |
| U U U D U U D U U U U U U D D U D D U D D U U U U U D D D D D D D D L |
| U U U U D U U D U U U U U U D D U D U D D D U U U U D U U D D D D U D |
| U U U U D U U D U U U U U U D D U D D D D D U U U U D U U D D D D D D |
| U D D U U U U U D U U U U U D U U D D D D U D U U U D U U D D D L D D |
| U U U D U U D U U U U U U D U D D U D U D D U U U U D D D D U D D D L |
| U U U U D U U D U U U U U U U D U U D D D D U U U U D U U D U D D D D |
| U D D U U U U U D U U U U U D U D D D D D U D U U U D U U D D D U D D |
| U U U U D U U D U U U U U U U D D U D U D D U D U U D U U D U U D D D |
| U U U D U U D U U U U U U D U U D U D U D U D D U U D D D D U U U D L |
| U D D U U U U U D U U U U U D U U D D D D U D D U U D U U D D D U U D |
| U U U D U U D U U U U U U D U U D U D U U D U D U U D D D D U U U D U |
| U U U U D U U D U U U U U U U D U U D D U D U D D U D U U D U U D U U |
| U D D U U U U U D U U U U U D U U D D D D U D D D U D U U D D D U U D |
| U U U U U D U U U D D D D U U U U U D U D U U D D D D U U D U U U U U |
| U D D U U U U U D U U U U U D U U D U D D U D U D D U D U U D D U U D |
| U D D U U U U U D U U U U U D U U D U D D D D U D D U D U U D D U U D |
| U U U U D U U D U U U U U U U D U U U U U D U U D D D D U U U U D U U |
| U U U D U U D U U U U U U D U U D U U U U U D U D D D D D U U U U D U |
| U D D U U U U U D U U U U U D U U D U D D U D D D D D U D U D D U U D |
 -----------------------------------------------------------------------

respectively. Again the backtrace used to construct an optimal st edit transcript is highlighted. The results from the algorithm are:

  Edit Distance:     19 // substitution cost = 1
                  s: CCGATGATCATTGCCAGTCCACTTGTGAGAACGACAGCGACTCCAGC
   s > t transcript: MMMMMMMDMSMMSSMMMIMMMMMMMDMDDDDDDDDDSMDMDDMMMMMM
   t > s transcript: MMMMMMMIMSMMSSMMMDMMMMMMMIMIIIIIIIIISMIMIIMMMMMM
                  t: CCGATGACTTTTGCAGCTCCACTTTTGGTCCAGC
  optimal alignment:
   s = CCGATGATCATTGCCAG-TCCACTTGTGAGAACGACAGCGACTCCAGC
       ||||||| | ||  ||| ||||||| |          | |  ||||||
   t = CCGATGA-CTTTTGCAGCTCCACTT-T---------TG-G--TCCAGC

Proposition 3. The Edit Distance calculator ED(m,n) has runtime Θ(m×n) and runspace +Θ(m×n). The backtrace subroutine has runtime Θ(m+n) and operates in +Θ(m×n) space.

Exercise 2. Show that:

  1. Every non-increasing path from P[0][0] to P[m][n] represents an edit transcript taking s to t.
  2. The length L of any edit transcript taking s to t satisfies Max(m,n) ≤ Lm + n.

Exercise 3. Locate the directory LIB/notes_support in your course library and copy/execute the demonstration program:

ED_i.x   # Edit Distance

(As usual, the "_i" indicates Linux/Intel executable. Entering the command without arguments prints some documentation to screen.)

Using the algorithms developed above, devise and implement an ED calculator. Behavior should be indentical to that of ED_i.x. ∎