Note
In Whoosh 1.9 the old spelling system based on a separate N-gram index was replaced with this significantly more convenient and powerful implementation.
Whoosh can quickly suggest replacements for mis-typed words by returning a list of words from the index (or a dictionary) that are close to the mis-typed word:
with ix.searcher() as s:
corrector = s.corrector("text")
for mistyped_word in mistyped_words:
print corrector.suggest(mistyped_word, limit=3)
See the whoosh.spelling.Corrector.suggest() method documentation for information on the arguments.
Currently the suggestion engine is more like a “typo corrector” than a real “spell checker” since it doesn’t do the kind of sophisticated phonetic matching or semantic/contextual analysis a good spell checker might. However, it is still very useful.
There are two main strategies for correcting words:
To enable spell checking on the contents of a field, use the spelling=True keyword argument on the field in the schema definition:
schema = Schema(text=TEXT(spelling=True))
(If you have an existing index you want to enable spelling for, you can alter the schema in-place using the whoosh.writing.add_spelling() function to create the missing word graph files.)
Tip
You can get suggestions for fields without the spelling attribute, but calculating the suggestions will be slower.
You can then use the whoosh.searching.Searcher.corrector() method to get a corrector for a field:
corrector = searcher.corrector("content")
The advantage of using the contents of an index field is that when you are spell checking queries on that index, the suggestions are tailored to the contents of the index. The disadvantage is that if the indexed documents contain spelling errors, then the spelling suggestions will also be erroneous.
There are plenty of word lists available on the internet you can use to populate the spelling dictionary.
(In the following examples, word_list can be a list of unicode strings, or a file object with one word on each line.)
To create a whoosh.spelling.Corrector object from a word list:
from whoosh.spelling import GraphCorrector
corrector = GraphCorrector.from_word_list(word_list)
Creating a corrector directly from a word list can be slow for large word lists, so you can save a corrector’s graph to a more efficient on-disk form like this:
graphfile = myindex.storage.create_file("words.graph")
# to_file() automatically closes the file when it's finished
corrector.to_file(graphfile)
To open the graph file again very quickly:
graphfile = myindex.storage.open_file("words.graph")
corrector = GraphCorrector.from_graph_file(graphfile)
You can combine suggestions from two sources (for example, the contents of an index field and a word list) using a whoosh.spelling.MultiCorrector:
c1 = searcher.corrector("content")
c2 = GraphCorrector.from_graph_file(wordfile)
corrector = MultiCorrector([c1, c2])
You can spell-check a user query using the whoosh.searching.Searcher.correct_query() method:
from whoosh import qparser
# Parse the user query string
qp = qparser.QueryParser("content", myindex.schema)
q = qp.parse(qstring)
# Try correcting the query
with myindex.searcher() as s:
corrected = s.correct_query(q, qstring)
if corrected.query != q:
print("Did you mean:", corrected.string)
The correct_query method returns an object with the following attributes:
You can use a whoosh.highlight.Formatter object to format the corrected query string. For example, use the HtmlFormatter to format the corrected string as HTML:
from whoosh import highlight
hf = highlight.HtmlFormatter()
corrected = s.correct_query(q, qstring, formatter=hf)
See the documentation for whoosh.searching.Searcher.correct_query() for information on the defaults and arguments.