3. GSpell
GSpell is a spelling suggestion tool that uses a mix of algorithms to retrieve close neighbors. This application is best suited to applications that index at the word or term level of tokenization.
Java API's are provided for developers to embed this application in their applications. Two java main programs are provided, GSpell, and CommonMisspellings. GSpell is the general spelling suggestion tool. The auxiliary CommonMispellings program is provided as mechanism for inserting and updating common misspelling by correct spelling pairs. Once installed, the listed shell scripts around these two java programs are available.
See the output section for an explanation of the the output from the GSpell program.
3.1 Usage
GSpell.[sh|bat]
  
--index|--update|--find|--export 
   [--dictionaryDir=someFullPath]
    --dictionaryName=NameOfDictionary 
   [--inputFile=YYYYY] [--outputFile=ZZZZZ]
   [--fieldedText] [--termField=X]

   [--truncate=N] [--considerNCandidates=N]
   [--maxEditDistance=N] [--correctField=Y]
   [--wordLengthHeuristic=true|false]
   [--reportTime] [--help]
GSpellIndex.[sh|bat]
   [--dictionaryDir=someFullPath]
    --dictionaryName=NameOfDictionary 
   [--inputFile=YYYYY] [--outputFile=ZZZZZ]
   [--fieldedText] [--termField=X]
   [--reportTime] [--help]     
GSpellFind.[sh|bat]
   [--dictionaryDir=someFullPath]
    --dictionaryName=NameOfDictionary 
   [--inputFile=YYYYY] [--outputFile=ZZZZZ]
   [--fieldedText] [--termField=X]

   [--truncate=N] [--considerNCandidates=N]
   [--maxEditDistance=N] [--correctField=Y]
   [--wordLengthHeuristic=true|false]
   [--reportTime] [--help]
GSpellUpdate.[sh|bat]
[--dictionaryDir=someFullPath]
    --dictionaryName=NameOfDictionary 
   [--inputFile=YYYYY] [--outputFile=ZZZZZ]
   [--fieldedText] [--termField=X]

   [--truncate=N] [--considerNCandidates=N]
   [--maxEditDistance=N] [--correctField=Y]
   [--wordLengthHeuristic=true|false]
   [--reportTime] [--help]
3.2 Options

--index
Create a new dictionary. Take input from either the file specified by the --inputFile parameter, or from standard input. Input terms are expected to be of the format, one term per new line separated line. Lines starting with # are taken to be comments are are ignored. Existing dictionaries of the same name will be overwritten.
A note about memory. This application requires more than the standard amount of memory when indexing. The total amount of memory used during indexing is reported with the --reportTime option.  It has been observed that 128MB was needed to load a set of 300,000 terms. Please use the java options -MX and -MS to run with the appropriate amount of memory when indexing.
--update
Add terms to an existing dictionary. This option is similar to --index option, except that if the dictionary exists, terms are added to it, rather than having the dictionary overwritten. Take input from either the file specified by the --inputFile parameter, or  from standard input. Input terms are expected to be of the format, one term per new line separated line. Lines starting with # are taken to be comments are are ignored.
--find
Retrieve close suggestions to input terms. Input is either taken from the standard input [the default], or from a file specified via the --inputFile. By default the input is to be in the format of one term per newline delimited line. However, the application can take fielded input (see the --fieldedText and --    termField options).
The output returned contains the input plus four fields. The first of these fields is the suggestion, the second is a ranking, the third field is the method that returned this suggestion, and the last field, will indicate whether or not a term was found, or was found to be correct (in a case insensitive way).
--export
Spew out to the contents of the indexes to standard output, or to the file specified by the -- outputFile option. Caution: this option spews out lots of data. This option is useful really only for debugging purposes. This option may disappear in future releases.
--dictionaryName=
The name of the dictionary to be created or to be referenced. Since a directory is created based on this name, it should be named compliant with the naming rules for directories.
Each dictionary is placed in the dictionaries directory. The dictionaries directory is specified by the --dictionaryDir=  path found in the $GSPELL_ROOT/config/gspellRegistry.cfg file. The $GSPELL_ROOT directory is the base directory where GSpell was installed.
--dictionaryDir=
Dictionary directories containing the indexes are placed in the in the $DictionaryDir. $DictionaryDir is specified in the $GSPELL_ROOT/config/gspellRegistry.cfg file, but can be overruled by the -- dictionaryDir=aFullPath command line option. The $GSPELL_ROOT directory is the base directory where GSpell was installed.
A note to those who would like to have dictionaries placed in a location other than $GSPELL_ROOT/dictionaries: specify  the --dictionaryDir= and -- dictionaryName= command line options.              
--inputFile=
Specify the (full path) name of an input file. The semantics of this file refer to being the contents of what is to be index when this option is used with the --index and -- update options. The contents of this file is taken as input for retrieval when the --find option is used.
--outputFile=
Specify the name of the output file.
--fieldedText
Indicate that the input is Pipe delimited input, where one of the fields contains the significant term, and the rest of the terms are passed through. This option must be used with the --termField option to indicate the field to retrieve. This option is only relevant with the --find option.
--termField=X
Indicate the field to pick up the term to be looked up. The first field is taken to be 1, not zero.
--truncate=N
Return the top N suggestions.
--considerNCandidates=N
Limit the total number of candidates to be evaluated to N. This parameter has a direct relation to both the speed and accuracy of the application. The more candidates evaluated, the better the suggestions, but the slower the response is.
--maxEditDistance=N
Limit the returned suggestions to those who have edit distances that are less than or equal to N.
--correctField=Y
This option is for debugging and benchmarking purposes. When present, it indicates the field that contains the correct spelling. Further, benchmark statistics are computed when this option is present. Those statistics include the number of responses where the correct term was found in the first, the first five, the first ten, the first hundred suggestions along with how many responses completely missed the mark. The output stats file is labeled dictionaryName.stats and is placed in the directory where the application is kicked off.
--reportTime
This option reports status messages to standard error during processing. It is useful when indexing a large set of terms to determine how long it takes to process 10,000 terms, and how much memory was needed.
--wordLengthHeurstic=
true|false
Limit the candidates to be evaluated to those which are with +/-4 characters of the input term. The nGram index is segmented into bins by the size of input words.
When this heuristic is turned on, the number of candidates is effectively pruned to not even consider words that are known to be outside reasonable edit distances already. This allows additional candidates that might not have been able to be evaluated to be evaluated. This heuristic has two beneficial aspects to it. It speeds up suggestions, and it suggests additional close neighbors.
This heuristic does increase the size of the resulting index by about 50% over an index that did not use this heuristic. Making a non- wordlengthHeurstic index is not an option. This heuristic prunes multi word neighbors that may share a word in common.
The size of target words range from 1 to over 100 characters, and are distributed in a Zipf like distribution (at least for our dictionary), where there are a few 1 character words, some more two character words, a fair amount of three character words, and a huge number of words between four characters and ten characters. The numbers start trailing off slowly after that. There is a precipitous drop off in the number of words that have more than 20 characters. This option is turned on by default.
--timeout=true|false
Turns on a timeout feature. Used in conjunction with --timeoutInHours= variable. When true, GSpell will close down and reload after the specified timeout period. This was added at the request of the RxNav project, where they get updates to the files once a week. New indexes are generated, and to be seen, the application needs to reload.  This variable is false by default.
--timeoutInHours=
The time out period, in hours.
--version
--CVSVersion
--history
Retrieve the version of this build
Retrieve the CVS tagged version of this build
Retrieve the history notes of this build.
--help
Help
3.3 Output
GSpell's output for an input term contains the following fields:
Input term
Suggestion
Edit Distance
Rank from 0 to 1
(1.0 is an exact match)
[1]
Method used
Other messages
such as
Correct
Frequency from corpus
[2]
Weighted rank
[3]
Here's an example of the output for the misspelled term anonomous.
anonomous
anonomous|anonymous|1.0|0.87|NGrams||402|1999598
anonomous|autonomous|2.0|0.58|NGrams||664|2999336
anonomous|allonomous|2.0|0.58|NGrams||-1|3000001
anonomous|analogous|3.0|0.3|NGrams||1008|3998992
anonomous|anomalous|3.0|0.3|NGrams||890|3999110
anonomous|anonymously|3.0|0.3|NGrams||54|3999946
anonomous|anadromous|3.0|0.3|NGrams||1|3999999
anonomous|anonymes|3.0|0.3|Metaphone||-1|4000001
anonomous|anonyms|3.0|0.3|Metaphone||-1|4000001
anonomous|Axonopus|3.0|0.3|NGrams||-1|4500001

Here is an example for the correctly spelled word disease
disease
disease|disease|0.0|1.0|NGrams|Correct|94306|905694
disease|diseases|1.0|0.87|NGrams||18485|1981515
disease|diseased|1.0|0.87|NGrams||855|1999145
disease|dispase|1.0|0.87|NGrams||27|1999973
disease|discase|1.0|0.87|NGrams||3|1999997
disease|diverse|2.0|0.58|NGrams||3134|2996866
disease|disuse|2.0|0.58|NGrams||84|2999916
disease|disperse|2.0|0.58|NGrams||66|2999934
disease|diastase|2.0|0.58|NGrams||25|2999975
disease|dispense|2.0|0.58|NGrams||24|2999976

[1] The rank is the normalized ranking with a value between 0 and 1 where 1 is an exact match and 0 is no match at all. The distribution for this normalized score is intended to have edit distances of .5 rank about the .90. The distribution curve is half the bell curve, and intended to allow developers relying on edit distance as the sorting metric to have a metric that ranks words beyond 2 edit distances as quickly getting worse as the curve extends out.
[2] The corpus frequency is taken from the 1999 Medline word frequencies. See the simpleCorpusStatistics notes for details on how to embed other corpus frequencies into this package.
[3] The weighted rank is currently the following:
     
      [ ( the edit distance * the method number)  * 1000 )  ]    + [ MAX FREQ - the frequency of the word found in the corpus]

 
    Where
  • The edit distance is the Levenshtein edit distance
  • The frequency of the word found in the corpus comes from the 1999 Medline word frequency list, modified to throw out words with frequencies less than 4 occurrences, as an efficiency to keep the list manageable.
  • The MAX FREQ currently is a constant == 1,000,000.  This number was intended to be the number of occurrences of words in the corpus, but this proved hard to compute and pass on to the ranking method. A suitably large number was chosen as a surrogate.
  • The method numbers were intended to weight homophones and corrections to common misspellings higher than other methods.
Method
Method Number
NGrams
1
Metaphone
1
Homophones
.5
Common Misspellings
.5
bagOWordsPlus
1
default
1

 
                 
3.4 Tunable Parameters
There are tuning parameters that are specific to each dictionary created. These parameters are stored in a configuration file in the dictionary directory. 
Performance tuning variables
--considerNCandidates
The number of candidates that can be evaluated for any given input. The higher the number the better the suggestions are, but the longer it takes to do. A reasonable rate on a sun is about 2000. This parameter alters the behavior of the NGrams algorithm.
--cacheSize
This is the number of grams to keep in memory during processing. The more grams in memory, the faster the response is. This makes a real difference for indexing, where if you can keep all the grams in memory, the faster indexing will go. The downside is more memory is used. For indexing, this may mean bumping up the memory to something along the lines of 100mb for dictionaries of 100,000 terms or more. This variable is used to keep the number of grams in memory to this size. The hash of grams does not grow beyond this number. Ideally, this size should not be set bigger than the number of grams you have. This parameter alters the behavior of the NGrams algorithm.

Variables that can be changed to give specific behaviors
--truncate
This variable limits the number of suggestions returned for a given input.
--maxEditDistance
This variable limits suggestions to those which have an edit distance equal to or less than this parameter.
--wordLengthHeuristic
This heuristic retrieves candidate terms that are +/- 4 characters in length larger or smaller than the query term. For example, if you put in a term that is two characters long, the longest term retrieved would be 6 characters.  This heuristic is turned on by default.  This heuristic only affects the behavior of the NGram algorithm.
3.5 API's

API builders should view the gspell package. It should be possible to embed this package into other applications using the following constructors and methods for the class GSpell.
GSpell instances can be embedded into single threaded applications with no problem. It can be embedded into threaded applications with the following advice:
Embedding read-only instances of GSpell objects is fine as long as there are no read/write instances present. As of Version 0.0.36, when in a multi-threaded environment, care should be taken if the GSpell instance is a singleton, or if the instance is a static instance. Several hashtables and arrayLists get reused for each term processed.  If you have one of these cases, put a synchronize around the find call to avoid problems. 
It is recommended that in cases where there are updates or read/writes, this object gets implemented as a singleton instance per dictionary, that is shared across threads and processes.
There is NO writer lock and unlock around the update/index method, nor around the cleanup or flush methods to insure that only one writer is alive at any given time, and that there are no readers alive. It is up to the application developer to insure that access is controlled around GSpell Object instances that do writes or updates.
Create or update an index
The new dictionary name is specified as part of the specification to the GSpell constructor. There is a constructor specifically with this parameter.  If just a name is given, the location is placed in the nls/gspell/dictionaries directory. If you want to specify a different place, then use the full path as part of the dictionary name.
An alternative method of specifying the dictionary name is to place the command line argument "--dictionaryName=someName" on the args parameter to the constructor that takes the command line arguments.


When indexing is complete, call the save or flush(), then call the cleanup()  method.
Retrieve suggestions for terms
Methods from the Candidate class.
GSpell GlobalBehavior and Find Options
Example Applications

Examples that illustrate embedding this in an application are provided in the examples directory.
Constructors for Apps that cannot set the classpath
For those applications that cannot set the classpath, use the following constructor:
GSpell(java.lang.String pConfigPath,java.lang.String pDictionaryName, int pMode)

The pConfigPath parameter should specify the full path to the ___GSPELL_ROOT/config/GSpellConfigpath.
The dictionaryName should reflect the name of a directory from the ___GSPELL_ROOT/dictionaries/dictionaryNamedirectory, or it should be the full path to the dictionary to be opened.
This constructor is used for those applications that cannot set the classpath, such as java servlet apps.
3.6 Example Scripts
3.7 Dictionaries
3.8 Known Issues
The application is driven by some heuristics that in general work well, but have consequences.
There is a consequence to the bathtub heuristic. If the spelling error occurs in the first or last character of the string, the suggestions, if any will more than likely be wrong. In a future release, two additional techniques may be employed to ameliorate this: a left and right truncation retrieval, and a conservative stemming heuristic. A centroid approach may be looked into as well, but that approach has a high cost of training associated with it.
The quality of the returned suggested terms are directly related to the number of candidates evaluated. The more candidates evaluated, the better, but the cost is the amount of time it takes to return suggestions.
Nick Ide has suggested several improvements that should be adopted:
    • Have the suggestions be weighted by the likelihood that the transformation between the bad and good word was caused by some understandable heuristic. For instance it is common to see vowel transformations, where it is less common to see a vowel to consonant transformation. Likewise typo's caused by keyboard configuration are common.
This suggests instituting a confusion matrix a-la Tom Lasco within the edit distance. Tom Lasco augmented gspell with a confusion matrix two years ago for an OCR application.
    • Institute a logging function to capture good/bad spelling pairs.
3.9 Future Directions
Performance enhancements
    • Create threads for each of the spelling techniques. Each is independent of the other, and this would speed the application up by about 20%.
Quality Enhancements
    • Investigate whether removing plural sembients from incoming terms helps or hurts. I have observed that some noticeable portion of misses were caused by the incoming term containing a trailing s, but where only the singular form of the word was in the target.
Misc. Enhancements
    • Gather and distribute application specific dictionaries.