Edit Distance

When a spell checker encounters a possible misspelling, it looks in its dictionary for other words that are close by. What is the appropriate notion of closeness in this case?
A natural measure of the distance between two strings is the extent to which they can be aligned, or matched up. Technically, an alignment is simply a way of writing the strings one above the other. For instance, here are two possible alignments of SNOWY and SUNNY:

S _ N O W Y
S U N N _ Y
Cost: 3

_ S N O W _ Y
S U N _ _ N Y
cost: 5

The edit distance between two strings is the cost of their best possible alignment. Do you see that there is no better alignment of SNOWY and SUNNY than the one shown here with a cost of 3?

Edit distance is so named because it can also be thought of as the minimum number of
edits- insertions, deletions, and substitutions of characters-needed to transform the first
string into the second. For instance, the alignment shown on the left corresponds to three
edits: insert U, substitute O ! N, and delete W.

When solving a problem by dynamic programming, the most crucial question is, What are the
subproblems? It is an easy matter to write down the algorithm: iteratively solve one subproblem after the other, in order of increasing size.
Our goal is to fnd the edit distance between two strings x[1….m] and y[1….n].

Let, an example
E X P O T E N T I A L
P O L Y N O M I A L

we have to find the minimum number of operation to convert them from one to another.
For this to work, we need to somehow express E(i; j) in terms of smaller subproblems.
Let’s see-what do we know about the best alignment between x[1…..i] and y[1….j]? Well, its
rightmost column can only be one of three things:
x[i] or _ or x[i]
_ y[j] y[j]

But this is exactly the subproblem E(i-1; j)! We seem to be getting somewhere. In the second case, also with cost 1, we still need to align x[1….i] with y[1….j-1]. This is again another subproblem, E(i; j-1). And in the final case, which either costs 1 (if x[i] != y[j]) or 0 (if x[i] = y[j]), what’s left is the subproblem E(i-1;j-1). In short, we have expressed E(i; j) in terms of three smaller subproblems E(i-1; j), E(i; j-1), E(i-1;j-1). We have no idea which of them is the right one, so we need to try them all and pick the best:
E(i; j) = min{1 + E(i – 1; j); 1 + E(i; j – 1); diff(i; j) + E(i – 1; j – 1)};
where for convenience diff(i; j) is defined to be 0 if x[i] = y[j] and 1 otherwise.

For instance, in computing the edit distance between EXPONENTIAL and POLYNOMIAL,
subproblem E(4; 3) corresponds to the prefixes EXPO and POL. The rightmost column of their
best alignment must be one of the following:
O _ O
_ L L

Thus, E(4; 3) = min{1 + E(3; 3); 1 + E(4; 2); 1 + E(3; 2)}.

So,the psudocode:
Here, m is the number of letters in POLYNOMIAL and n is the number of EXPONENTIAL
[code]
for i = 0; 1; 2; : : : ;m:
E(i; 0) = i
for j = 1; 2; : : : ; n:
E(0; j) = j
for i = 1; 2; : : : ;m:
for j = 1; 2; : : : ; n:
E(i; j) = min{E(i – 1; j) + 1;E(i; j – 1) + 1;E(i – 1; j – 1) + diff(i; j)}
return E(m; n)
[/code]

And in our example, the edit distance turns out to be 6:
E X P O N E N _ T I A L
_ _ P O L Y N O M I A L

One Reply to “Edit Distance”

Leave a Reply

Your email address will not be published. Required fields are marked *