c # – LSH for a word list

I am trying to write a spell checker, I have a huge list of words (at least 500K, due to the nature of the language). The performance would suffer greatly if I obtained the distance of lavenshtein from all the words in the list of words to the current word.

So I'm trying to distribute the list of words in groups based on the characters of the word. In this way, I just need to get the Levenshtein distance from the words in the specific cube instead of the entire word list.

So I was wondering if there is any good local-sensitive hash algorithm for my problem? I have written a very simple algorithm that seems to work well and the performance is at least 20 times better. Although the accuracy suffers a lot.

This is what I found:

static string GetHash (string word)
{
word = word.ToUpperInvariant ();
var lettersToTake = word.Length - (int) (word.Length * 0.70);

var chars = word.GroupBy (c => c)
.OrderByDescending (c => c.Count ())
.ThenBy (c => GetFrequency (c.Key))
.Take (lettersToTake)
.Select (g => g.Key)
.To list();

return new string (chars.ToArray ());
}

Here are the steps of the algorithm:

  1. Capitalize all the characters
  2. Sort the characters in the word by the frequency of the letter in the word
  3. Then order the characters by the frequency of the word in the whole list of words (GetFrequency)
  4. Take 30% of the characters of the word.