Here comes a little insight (and bragging!) into the internals of Skeyer – at least the way it works as of now…
If i give you a grid of keys – a fairly general assumption about most keyboards, and ask you to swipe the word “hey” on it, what would you do? You start at the H key then swipe till you reach the E key and then swipe till you reach the Y key. If you were to do that on your QWERTY keyboard, the keys you would have most probably pressed are : “HGFTRERTY”. Lets call this swipeHint of a word…
If you were to write a program to compute this, it would probably look like:
function swipeHint( word )
{
if(word.length < 2)
return word
var result = ''
for( var i = 0; i < word.length - 1; i++ ){
result += word[i] + keys_between( word[i], word[i+1] )
}
result += word[word.length-1]
return result
}
Pretty straight forward isn’t it? Now, the problem Skeyer was trying to solve boils down to that of a dictionary search for the input pattern. Instead of searching for words, we search for swipeHint(words), and then return the words related to the swipeHints.
I tried using an existing spelling corrector (hunspell) for this task… But it was of no use. Its main problems were:
- It was VERY slow… (Notice the VERY in caps!)
- It had no clue about what words are more “probable” than others.. Some one told me that I use the word “like” a lot. So if i mistype that as “loke”, it was suggesting me “luke” instead of “like”.
- It was very noisy… It had no clue about the layout of the keyboard, so it ended up suggesting many words which are totally unrelated to the word you swiped/intended to type. for eg. if i mistyped “meat” and typed “meay”, it was suggesting me “mean” instead of “meat”.
So I dug around on the internet a little and found Norvig’s Spell Checker, which was basically unusuable if you want to find words which differ by more than 3 characters. So, I wrote a very straightforward implementation of such a matcher myself, something along the lines of:
function findSimilarWords( input, dictionary, count )
{
var results = []
for(var word in dictionary){
// Bayes Theorem
// Our dictionary can also provide probabilit(word), based on the user's usage...
var score = similarity( input, swipeHint(word) )*probability(word)
if( score > threshold )
results.add(word, score)
}
return results
}
Now the question is: How do we compute similarity( string1, string2 )?
Once again, the internet to the rescue and I found: Edit distance … Nikita’s blog explains it a lot better.
I implemented an editDistance method whose cost function for substitution of two characters returns the distance between the characters you are trying to substitute on the keyboard. The only problem was that it was too.. slow.
The dynamic programming approach for it was an O(n*m) algorithm, where n, m are the lengths of patterns we are trying to compare.
To put things into perspective, my dictionary has 150,000 words. And an average word length of 30. Even if i were to search 10% of it, (you know… words starting wtih the same character as the first character of the input), that would be: 15000*30*30 = 13,500,000 operations, which is definitely waaay too much for your little phone.
Now a series of neat little tricks to make this work:
- Since we are checking for a threshold, as per Nikita’s blog, I stopped computing similarity after finding the threshold number of differences. So the similarity(string1, string2), now became O(k*min(m,n)). If we are looking for words not differing by more than 50%, for an average word length of 30, k would be 15, and the total number of operations would be 6750000. That’s still waaay too much.
- We are interested in finding no more than 5 words (Even that’s too much if you ask me, but lets keep that at 5 for now). So If you already found 5 words whose score is 0.8, there is no point computing the similarity of the input with words whose probability of occurance is 50% as score = probability*similarity.So skip computing the edit distance for a word, if it’s probability < minScore(result). For the word “beautiful”, whose swipeHint is “bvgfdrewsasdfghyuyttyuiuhgfgyuiol”, this brings down the cost to around 2700000.
- As we already know what the minimum score our word needs to enter the results, why still keep k at 50% of the input length? why not less? we are computing edit distance for not more than k differences anyway
// we know that
similarity = 1.0 - editDistance( string1, string2, k )/maxLength(string1,string2)
// and that
score = similarity(input, word, k)*probability(word)
// As we want a score > results' minScore
k = min( k, ( maxLength(input, word)*( 1.0 - results.minScore()/probability(word) ) ) )
This brings down the cost of “beautiful” to ~1700000 operations (900msec) …
- Okay, so even that’s too expensive. On my 8 year old laptop with Core 2 Duo processor, that gives me results in about 900msec… and I assume it would take the same amount of time on your Jolla/Ubuntu touch devices…You don’t want to swipe 1 word per second do you? If the word lengths differ by more than k, it is obvious their edit distance would be more than k. So skip all the words whose lengths differ by more than k. As it turns out this helps, but not so much.“beautiful” now costs ~1500000 operations (700msec).
- I tried looking at other distance metrics for quickly estimating the similarity of two strings. The first nice one was Hamming Distance ( http://en.wikipedia.org/wiki/Hamming_distance ), but sadly it only applies to strings of equal length. So I tried computing Hamming Distance for the “signature” of two strings. i.e, if the a string A has the alphabets “C”, “A”, “T”, it’s
signature = 1 << index(C) | 1 << index(A) | 1 << index(T)
As it turns out this absolutely sucks as a measure of similarity for two normal strings. But the insight this gave me was that if the number of characters of each alphabet in the two strings differs by more than k, then their editDistance would obviously be more than k, so don’t bother computing edit distance for them… this can be done in O(n).
“beautiful” now costs about about 132370 operations or ~101msec, with a lazy implementation of this, which isn’t too bad for a brute force matcher don’t you think? From 5 secs to 101 msec without too much effort 😀
Now, I m getting back to some more… less interesting tasks from my Todo list for Skeyer.. Trello
Btw. I am still looking for a job, So if you know anyone who is hiring for a role that you think suits me ( My Resume ), please let me know/let them know 🙂