Longest Common Subsequence

X = <x1, x2, x3, ...,xm> is a sequence

Z = <z1, z2, z3, ...,zk> is a subsequence of X if z1 = xi1, z2= xi2, ...zk = xik  0 < i1 < i2... < ik  £  m

Given sequences X and Y, Z is a common subsequence if Z is a subsequence to both X and Y. If Z is the longest possible subsequence of both X and Y then it is the longest common subsequence.

X = < A, G, C, G, T, A, G >

Y = < G, T, C, A, G, A >

a common subsequence is < G, C, A > also < G, C, G, A > and < G, T, A, A > since no common subsequence of length 5 exists there are 2 LCG's: < G, C, G, A > and < G, T, A, G >.

Finding the LCS:


Could enumerate all subsequences of X (length m) ® 2mand Y (length n) ® 2n must search for hits and sort by length. This is an exponential algorithm. LCS has an optimal substructure property based on prefixes. If X = <x1, ...,xm> then Xi = <x1, ...,xi> is the ith prefix of X and X0 is empty.

Theorem 16.1 ( Optimal substructure for LCS)

 If X = <x1, ...,xm> and  if Y = <y1, ...,yn> are sequences, let Z = <z1, ...,zk> be some LCS of x and y.

  1. If xm =  yn  then zk = xmand Zk-1 is an LCS of Xm-1 and Yn-1
  2. If xm ¹ yn  then zk ¹xm Þ Z is an LCS of Xm-1 and Y
  3. If xm ¹ yn  then zk ¹ym Þ Z is an LCS of X and Yn-1

Proof:

  1. If zk ¹xm then we could add xm =  yn to Z to get an LCS of length k + 1.   By contradiction it must be that zk = xm =  yn .    |zk-1 | = k - 1 and is an LCS of  Xm-1 and Yn-1 .   It is an LCS, if not then $ W CS of Xm-1 and Yn-1 with | W | > k - 1 and so by appending xm =  yn we get a CS of X  and Y of length greater than k, a contradiction.
  2. If zk ¹xm then z is a cs of Xm-1 and Y. If  $ W a CS with | W | > k, then W would be a CS of X and Y, a contradiction
  3. Proof by reversing x and y

This means that to find the LCS of X and Y:

if xm =  yn find LCS of Xm-1 and Ym-1

If   xm ¹ y

  1. find LCS of Xm-1and Y
  2. find LCS of X and Yn-1

and take the larger of 'a.' or 'b.'. Thus we start with small problem, find LCS and grow our solution:

Let c[i,j] = | W |, W is LCS of Xi and Yj

c[i,j] = 0 if i*j = 0

         = c[i-1,j-1] + 1, if i*j > 0 and xi = yj

         = max (c[i, j-1]. c[i-1,j]) if i,j > 0 and  xi ¹ yj

Could use this to write an exponential algorithm via recursion. However, there are only Q (m n) sub-problems, so we use DP. Store c[i,j] and b[i,j] which points to the optimal sub-problem chosen. c[m,n] contains the length of the LCS and b[m,n] directions used to build it.

LCS-Length ( X, Y )

  1. m ¬ length[X]
  2. n ¬ length[Y]
  3. for i ¬ 1 to m
  4.    do c[i,0] ¬ 0
  5. for j ¬ 0 to n
  6.    do c[0,j] ¬ 0
  7. for i ¬ 1 to m
  8.    do for j ¬ 1 to n
  9.       do if  xi = yj
  10.             then c[i,j] ¬ c[i-1, j-1] + 1
  11.                     b[i,j] ¬ "\"
  12.             else if c[i - 1, j] ³ c[i, j-1]
  13.                then c[i,j] ¬ c[i-1, j]
  14.                        b[i,j] ¬ "Ý"
  15.                else c[i,j] ¬ c[i, j-1]
  16.                       b[i,j] ¬ "¬"
  17. return c and b

Running Time = O (m n) since each table entry takes O (1) time.

Note:

"\" means both the same

"Ý" means c[i - 1, j] ³ c[i, j-1]

"¬" means c[i - 1, j] < c[i, j-1]

The "\" diagonal arrows lengthen the LCS

Print-LCS (b, X, i, j)

  1. If i = 0 or j = 0
  2.    then return
  3. if b[i,j] = "\"
  4.    then Print-LCS (b, X, i-1, j-1)
  5.       print
  6. else if b[i,j] = "Ý"
  7.    then Print-LCS ( b, X, i-1, j)
  8. else Print-LCS (b, X, i, j-1)

This takes O ( m + n ) since either i or j is decremented in the recursion.

Dynamic Programming - 3 of 3