Eugene Myers' Diff Algorithm: Finding the Longest Common Subsequence of "A" and "B" -
i've been reviewing eugene myers' diff algorithm paper. algorithm implemented in popular diff
program.
on page 12 of paper, presents pseudo-code algorithm find longest common sub-sequence of a
, b
:
lcs(a, n, b, m) if n > 0 , m > 0 find middle snake , length of optimal path , b. suppose (x, y) (u, v). if d > 1 lcs(a[1..x], x, b[1..y], y) output a[x+1..u] lcs(a[u+1..n], n-u, b[v+1..m], m-v) else if m > n output a[1..n]. else output b[1..m].
suppose = "a" , b = "b". in case, n = 1 , m = 1. middle snake (x, y) = (0, 1) , (u, v) = (0, 1) because there no diagonals. in case d = 1 because algorithm has taken 1 step.
the algorithm says thing in scenario output b[1..m]
, equal "b", because n > 0, m > 0, d = 1, , m = n. seems wrong, because there no common sub-sequence between "a" , "b". paper's commentary "if d <= 1 b obtained either deleting or inserting @ 1 symbol" incorrect because "a" must removed , "b" added.
what misinterpreting here?
you misunderstanding definition of d-path , snake.
from page 4:
let d-path path starting @ (0,0) has d non-diagonal edges. 0-path must consist solely of diagonal edges. simple induction, follows d-path must consist of (d − 1)-path followed non-diagonal edge , a possibly empty sequence of diagonal edges called snake
so, in example = "a" , b = "b", optimal path 2-path (one horizontal , 1 vertical), , middle snake empty string. know inspection lcs empty string, we'd show algorithm working prove it.
first of need find middle snake. if follow algorithm find middle snake on page 11, see length of shortest edit script 2 , (x,y) = (u,v) = (1,0) or (0,1). in other words, empty snake in middle of path.
the algorithm pseudocode has non-obvious notational conventions:
- a[m..n] empty string if n < m.
- in recursive calls lcs, first call uses ceiling(d/2) d , second call uses floor(d/2). made more clear in text above algorithm on page 12.
so, in example case, assume first call lcs finds middle snake (x,y) = (u,v) = (1,0), since d=2, result expansion of:
lcs(a[1..1], 1, b[1..0], 0) // output nothing since m = 0 output a[2..1] // output nothing since empty string. lcs(a[2..1], 0, b[1..1], 1) // output nothing since n = 0
this correct since these strings have no common sub-sequence.
Comments
Post a Comment