Optimizing trie-based spelling correction algorithms at Constructor.io

At Constructor.io, we take pride in not only serving relevant, typo-tolerant results with ranking learned from user behavior, but also in doing so quickly. This is no mean feat — it’s not unusual for a dataset to have tens of millions of items with an average of 10 words per item, and if results are to update in real-time as the user types, we have maybe 100ms, including network latency, to go from user input to ranked results. This includes performing any spelling corrections, phonetic and morphological analysis, re-ordering, ranking analysis, and all sorts of other bells and whistles that go into providing your users with the results that they are most likely looking for. In this post, I want to talk just about the spelling correction step, and how we took ideas from various existing spelling correction algorithms that did not quite meet our needs to design one that does.

Whatever approach we come up with has to meet some constraints. As mentioned before, it has to be fast. The data structures involved should take up as little memory as possible. Building those data structures should be fast and memory-efficient. If at all possible, it would be best to re-use data structures that we already have.

The trie-based solution

It’s no secret that the standard data structure for predictive text search is a trie [Wiki], and we keep all the individual tokens, or keywords, of a customer dataset in a trie-like structure that’s been tweaked to best suit our needs. A quick refresher: a trie is basically a tree with one character per node. The children of a node containing character X are the characters that can come after X in the corpus, so it’s easy to see what words in your corpus could start with a given prefix.

There’s also a standard, well-known way to build a spell checker using a trie, described for example here. To summarize, you assemble candidate words one character at a time as you traverse the trie in a breadth-first search, throwing out branches when it becomes evident that they differ from the user input too strongly. This approach usually uses the Damerau-Levenshtein distance [Wiki] as a comparison metric, which just counts the number of insertions, deletions, transpositions, and substitutions needed to change one string into another.

And when we were just starting out, that’s exactly what we did. Except instead of Levenshtein distance, we used our own much more advanced string comparison metrics that take into account things like language-specific phonetics, term relevance, keyboard distance, etc. On the face of it, this was perfect. It used no new data structures, runtime memory use was a non-issue, and as long as the datasets we were dealing with had fewer than 10,000 or so unique tokens, the entire process could be done in single milliseconds or faster. However, as the size of our customers’ datasets grew, so did the computation time. When we got our first customer with hundreds of thousands of unique tokens, this was still fast enough, but barely, and it became apparent that we would eventually need to find another solution if we wanted to eventually serve customers with millions of unique tokens.

I wanted to check how this algorithm would have performed on such a large dataset if we had kept using it. Here’s a quick timing check through a python wrapper on an AWS C4 instance:

In [7]: get_spelling_suggestions('optimizashun')
Out[7]: ['optimization']

In [8]: timeit get_spelling_suggestions(u'optimizashun')
10 loops, best of 3: 131 ms per loop

And that’s a relatively fast one.

Moving to precomputation

When runtime computation is too slow, one natural question to ask is how much can be precomputed. Not wanting to reinvent any wheels, we came across an approach called a symmetric-delete dictionary. This is a Damerau-Levenshtein-based approach which begins with the observation that if two strings \(s_1\) and \(s_2\) have Damerau-Levenshtein distance \(D\), then some string \(s’_1\) is equal to some string \(s’_2\), where each \(s’_i\) is derived from \(s_i\) by \(d\) deletions, where \(0\leq d\leq D\). Thus, if you precompute a multimap whose values are strings which lead to their key after \(d\) deletions, the runtime computation of all strings within a certain Damerau Levenshtein distance of a given input becomes very fast.

For us, this comes with a few problems. First, the precomputation step is cumbersome and slow, which translates to longer index update times for our customers. And second, as I mentioned above, our approach to evaluating string comparisons is not based on Damerau distance! Meaning we had to set \(D\) to a sufficiently high number, and then apply our in-house comparisons to winnow down the results. But it gave a noticeable performance improvement over the trie-based solution. However, a length-\(n\) word has \({n\choose D}\) possible distance-\(D\) deletions, so for a dataset with \(N\) tokens, the multimap size grows as \(Nn^D\). Once we got to work with datasets with millions of unique tokens, this approach once again started to become unsustainable: the size of the multimap grew into the gigabytes both on disk and in memory, the precomputation of this multimap alone could take over an hour, and some input strings could have potentially thousands of Damerau-close strings which then had to all be scored and ranked by our own comparison algorithms.

So it was back to the drawing board. Was there a way to combines these two approaches?

The Middle Way

The most computationally expensive part of the trie-based spelling correction, it turns out, is unwrapping the first few characters during the breadth-first search: with short prefixes, there isn’t enough information present to confidently throw out many of the bad branches, and the uncertainty in phonetic interpretation of just a few characters makes our own comparison metrics slower. But what if we could just precompute prefixes of potential spelling corrections for prefixes of user input queries, and use those to bypass the most expensive part of the computation? That is, given a 5-character prefix, we would be precomputing the set of 3-character prefixes for all possible spelling corrections of all possible user inputs that begin with that 5-character prefix.

This approach has a number of bonuses. Unlike the second approach, where one string could affect the values of hundreds of keys in the precomputed multimap (“optimization” has 298 possible distance-1, 2, and 3 deletions), these prefixes could be computed for one value at a time, so the precomputation routine was low on memory use and relatively easy to parallelize. Since we’re only precomputing these values for short prefixes of frequent user-input tokens, the number of values is greatly limited, and does not grow with dataset size or token length: even for our largest customers, this precomputation takes less than 10 minutes and uses a few hundred kilobytes of memory. Finally, we’re using actual user data to dictate what gets precomputed — yet another way that, when you use Constructor.io, your search learns from your users.

And by the way, it’s plenty fast again.

In [9]: timeit get_spelling_suggestions_precomputed_prefixes('optimizashun')
100 loops, best of 3: 4.35 ms per loop

And that’s a relatively slow one.