Longest Common Subsequence

X = <x1, x2, x3, ...,xm> is a sequence

Z = <z1, z2, z3, ...,zk> is a subsequence of X if z1 = xi1, z2= xi2, ...zk = xik  0 < i1 < i2... < ik  £  m

Given sequences X and Y, Z is a common subsequence if Z is a subsequence to both X and Y. If Z is the longest possible subsequence of both X and Y then it is the longest common subsequence.

X = < A, G, C, G, T, A, G >

Y = < G, T, C, A, G, A >

a common subsequence is < G, C, A > also < G, C, G, A >, < G, T, A, G > and < G, C, A, G >, since no common subsequence of length 5 exists there are 3 LCS's: < G, C, G, A > , < G, T, A, G > and < G, C, A, G >.

Finding the LCS:


Could enumerate all subsequences of X (length m) ® 2mand Y (length n) ® 2n must search for hits and sort by length. This is an exponential algorithm. LCS has an optimal substructure property based on prefixes. If X = <x1, ...,xm> the Xi = <x1, ...,xi> is the ith prefix of X and X0 is empty.

Theorem 16.1 ( Optimal substructure for LCS)

 If X = <x1, ...,xm> and  If Y = <y1, ...,yn> are sequences and let Z = <z1, ...,zk> be some LCS of X and Y.

  1. If xm =  yn  then zk = xmand Zk-1 is an LCS of Xm-1 and Yn-1

  2. If xm ¹ yn  then zk ¹xm Þ Z is an LCS of Xm-1 and Y

  3. If xm ¹ yn  then zk ¹yn Þ Z is an LCS of X and Y n-1

Proof:

  1. If zk ¹xm then we could add xm =  yn to Z to get an LCS of length k + 1. By contradiction it must be that zk = xm =  yn . |zk-1 | = k - 1 and is an LCS of  Xm-1 and Yn-1 . It is an LCS, if not then $ W CS of Xm-1 and Yn-1 with | W | > k - 1 and so by appending xm =  yn we get a CS of X  and Y of length greater than k, a contradiction.

  2. If zk ¹xm then z is a CS of Xm-1 and Y. If  $ W a CS with | W | > k, then W would be a CS of X and Y, a contradiction

  3. Proof by reversing x and y

This means that to find the LCS of X and Y:

if xm =  yn find LCS of and

If   xm ¹ yn

  1. find LCS of Xm-1and Y

  2. find LCS of X and Yn-1

and take the larger of 'a.' or 'b.'. Thus we start with small problem, find LCS and grow our solution:

Let c[i, j] = | W |, W is LCS of Xi and Yj

c[i, j] = 0                     if i*j = 0

         = c[i-1,j-1] + 1,                 if i*j > 0 and xi = yj

         = max (c[i, j-1]. c[i-1, j])    if i,j > 0 and  xi ¹ yj

Could use this to write an exponential algorithm via recursion. However, there are only Q (m n) sub-problems, so we use DP. Store c[i, j] and b[i, j] which points to the optimal sub-problem chosen. c[m, n] contains the length of the LCS and b[m, n] directions used to build it.

LCS-Length ( X, Y )

  1. m ¬ length[X]

  2. n ¬ length[Y]

  3. for i ¬ 1 to m

  4.    do c[i,0] ¬ 0

  5. for j ¬ 0 to n

  6.    do c[0, j] ¬ 0

  7. for i ¬ 1 to m

  8.    do for j ¬ 1 to n

  9.       do if  xi = yj

  10.             then c[i, j] ¬ c[i-1, j-1] + 1

  11.                     b[i, j] ¬ "\"

  12.             else if c[i - 1, j] ³ c[i, j-1]

  13.                then c[i, j] ¬ c[i-1, j]

  14.                        b[i, j] ¬ "Ý"

  15.                else c[i, j] ¬ c[i, j-1]

  16.                       b[i,j] ¬ "¬"

  17. return c and b

Running Time = O (m n) since each table entry takes O (1) time.

Note:

"\" means both the same

"Ý" means c[i - 1, j] ³ c[i, j-1]

"¬" means c[i - 1, j] < c[i, j-1]

The "\" diagonal arrows lengthen the LCS

Print-LCS (b, X, i, j)

  1. If i = 0 or j = 0

  2.    then return

  3. if b[i,j] = "\"

  4.    then Print-LCS (b, X, i-1, j-1)

  5.       print xi

  6. else if b[i,j] = "Ý"

  7.    then Print-LCS ( b, X, i-1, j)

  8. else Print-LCS (b, X, i, j-1)

This takes O ( m + n ) since either i or j is decremented in the recursion.

Dynamic Programming - 3 of 3