Notes on Levenshtein Distance
Saturday, March 27, 2021
I was recently working on a project that involved migrating several thousand articles from from one CMS system to another. The article migration involved some manual work from a team of editors and as a result some of the articles had small naming errors after the migration. I needed to make sure that all of the articles were properly named in the new CMS system.
Since there were too many articles to manually inspect all of their names I decided to automate some of the work. It was easy enough to compare the names of the old articles with the names of the new articles to see which names where wrong, but then I needed a way to match the incorrect names to the correct ones. Since some of the naming errors were at the start of the article it wasn’t sufficient to just compare them alphabetically. I decided to use Levenshtein distance to find the articles that were similarly named.
What is Levenshtein Distance
The Levenshtein distance between two strings is the number of single character edits that need to be performed in order to transform one string into the other string.
For example, the strings "house"
and "mouse"
have a Levenshtein distance of
just one because only the first character is changed. Likewise, "house"
and
"houses"
also has a distance of one because only a single edit—adding a
character—needs to be performed to transform one into the other. The distance
between "apple"
and "banana"
is five.
Interactive Example
A Naïve Implementation
It turns out that the naïve implementation is a pretty straightforward recursive function. Here’s a simple implementation in JavaScript:
// Helper function to drop the first character from a string.
const tail = (str) => str.slice(1, str.length);
const naiveLevenshteinDistance = (a, b) => {
if (b.length === 0) {
return a.length;
}
if (a.length === 0) {
return b.length;
}
if (a[0] === b[0]) {
return naiveLevenshteinDistance(tail(a), tail(b));
}
return (
1 +
Math.min(
naiveLevenshteinDistance(tail(a), b),
naiveLevenshteinDistance(a, tail(b)),
naiveLevenshteinDistance(tail(a), tail(b))
)
);
};
Unfortunately this simple algorithm is not optimal since it compares many of the
same prefixes multiple times. However, it does nicely illustrate how the
algorithm works by recursively comparing subsequences of the strings. So for
example when comparing "cat"
and "map"
it would compare the strings like
this:
- First it sees that
"c"
and"m"
are different, so it knows that the distance is at least1
. - Next it compares the following pairs of substrings:
"at"
and"map"
"cat"
and"ap"
"at"
and"ap"
- Now its going to run this again three times and comparing the results of
those comparisons:
"at"
and"map"
is going to compare:"t"
and"map"
"at"
and"ap"
"t"
and"ap"
"cat"
and"ap"
is going to compare:"at"
and"ap"
"cat"
and"p"
"at"
and"p"
"at"
and"ap"
is going to compare:"t"
and"ap"
"at"
and"p"
"t"
and"p"
I’m not going to run through the entire call graph here, but it should be
obvious that not only are there a lot of combinations to compare, but there is
also a fair bit of redundancy. For instance it can be seen that "at"
and
"ap"
are going to be compared multiple times.
An Implementation with Caching
In order to make this function finish in a reasonable time on strings of any significant length a caching strategy needs to be implemented. A standard technique for this is to build up a matrix of the distances between all the prefixes.
const levenshteinDistance = (a, b) => {
const prefixes = [];
for (let i = 0; i <= a.length; i += 1) {
prefixes[i] = [i];
}
for (let j = 1; j <= b.length; j += 1) {
prefixes[0][j] = j;
}
for (let j = 1; j <= b.length; j += 1) {
for (let i = 1; i <= a.length; i += 1) {
prefixes[i][j] = Math.min(
prefixes[i - 1][j] + 1,
prefixes[i][j - 1] + 1,
prefixes[i - 1][j - 1] + (a[i - 1] === b[j - 1] ? 0 : 1)
);
}
}
return prefixes[a.length][b.length];
};
This version constructs a matrix of all the prefixes iteratively and is therefore able to avoid redoing that work. This isn’t the most optimal solution: it’s possible to improve memory usage by only using two rows of the table since older rows don’t need to be revisited. However, this version still has the same big-O complexity so I won’t go into the details of the other implementation here.
Using Levenshtein Distance to Find Title Mismatches
Now with a way to compare strings by how similar they are I was set to find potentially mis-transcribed article titles. I had a list of the titles from the original CMS and a list of the titles from the new CMS.
I’m going to start by eliminating correct titles from the lists and producing
two new lists of titles: a list of misnamed titles, which I’ll call
incorrectTitles
and a list of titles that are absent in the new titles called
missingTitles
. By eliminating the correctly matched titles up front it will
reduce the total number of titles for which Levenshtein distance must be
calculated.
const incorrectTitles = newTitles.filter((title) => !oldTitles.includes(title));
const missingTitles = oldTitles.filter((title) => !newTitles.includes(title));
Given these lists it is a simple matter to calculate the Levenshtein distances between each incorrect title and all of the missing titles. Then sort by that distance and the best match should be near the top of the list.
const nearMisses = (oldTitles, newTitles) => {
const incorrectTitles = newTitles.filter(
(title) => !oldTitles.includes(title)
);
const missingTitles = oldTitles.filter((title) => !newTitles.includes(title));
return incorrectTitles.map((incorrectTitle) => [
incorrectTitle,
missingTitles
.map((missingTitle) => [
missingTitle,
levenshteinDistance(incorrectTitle, missingTitle),
])
.sort(([, a], [, b]) => a - b)
.map(([title]) => title),
]);
};
We can run this on some sample data and then look at the output.
const oldArticles = [
"apples taste good",
"banana splits are delicious",
"cherries are sweet",
];
const newArticles = [
"apple taste good",
"banana splits are delicious",
"cherrys are sweet",
];
nearMisses(oldArticles, newArticles);
// [
// ['apple taste good', ['apples taste good', 'cherries are sweet']],
// ['cherrys are sweet', ['cherries are sweet', 'apples taste good']],
// ]
This produces some correct output, but a little extra work can improve the formatting:
const printNearMisses = (oldTitles, newTitles) => {
for (const [wrong, [bestMatch, ...rest]] of nearMisses(
oldTitles,
newTitles
)) {
console.log(`Incorrect title: ${wrong}`);
console.log(`\tBest match: ${bestMatch}`);
console.log(`\tOther matches: ${rest.join(", ")}`);
}
};
printNearMisses(oldArticles, newArticles);
// Prints:
// Incorrect title: apple taste good
// Best match: apples taste good
// Other matches: cherries are sweet
// Incorrect title: cherrys are sweet
// Best match: cherries are sweet
// Other matches: apples taste good
As you can see, this has given us a nice list of articles that were misnamed and the most likely matches.