Real-word Spelling (1-To-1)
This page describes the processes for real-word spelling (1-to-1) detection and correction.
I. Processes
- Detector:
RealWord1To1Detector.java- Not corrected previously in the CSpell pipeline.
- real-word: valid word (in checkDic)
- Not exceptions: digit, punctuation, digit/punctuation, url, email, empty string, measurement, properNoun, abbreviation/acronym
- word has context score
- word WC >= 65 (configurable:
CS_DETECTOR_RW_1TO1_WORD_MIN_WC) - word has length >= 2 (configurable:
CS_DETECTOR_RW_1TO1_WORD_MIN_LENGTH)
- Candidates:
RealWord1To1Candidates.java- Max. length of real-word <= 10 (configurable:
CS_CAN_RW_1TO1_WORD_MAX_LENGTH)
Only generate real-word 1-to-1 candidates for the word has length less than certain value to prevent over-generating and slow performance. The recall will be decreased if this number is too small (with faster speed). - Generate all possible candidate as in the non-word
- Filter out invalid candidates (IsValid1To1Cand)
=> Ideally, we only correct real-word with candidates that are very similar to the inWord, that is they looks (orthographic) and sounds (phonetic) alike. If we loose this restriction, the real-word correction will be mainly rely on the context score (word2vec). In this version, our corpus for word2vec is relatively small and thus it generates too much noise [FP] and results in low precision and F1. This restriction of sounds and looks alike also helps (a little) on the run time performance (less context score calculation in ranking).- in suggDic (valid word)
- has context score (word2Vec)
- WC >= 1 (has word count, configurable:
CS_CAN_RW_1TO1_CAND_MIN_WC) - length >= 2 (configurable:
CS_CAN_RW_1TO1_CAND_MIN_LENGTH) - candidate is not a inflectional variant of inWord
In this version, we do not correct grammar and thus no inflectional variants (such as plural nouns, 3rd personal singular verb, etc.) are corrected. - Heuristic rules of looks and sounds alike:
- sounds alike: both phonetic codes of double metaphone and refined soundex must be the same
- same double metaphone code (pmDist = 0)
- same refined soundex code (prDist = 0)
- look alike: small edit distance with similar sounds
- leadDist + endDist + lengthDist + pmDist + prDist < 3
- editDist + pmDist + prDist < 4
- phonetic codes for double metaphone (pmDist = 0)
- sounds alike: both phonetic codes of double metaphone and refined soundex must be the same
- Key size in HashMap to store real-time 1-To-1 candidates in memoery: 1,000,000,000 (configurable:
CS_CAN_RW_1TO1_CAND_MAX_KEY_SIZE)
Slow run time performance due to too many real-words and their candidates. The generation of all possible candidates on the fly causes slow performances. To resolve this issue, we saved generated candidates (values) with real-word (key) to memory (in HashMap) to improve performance. Our test showed the elastped time is improved from 25+ min. to 3.5 min. on the training set. This is because:- lots of real-word are repeated
- the candidates of real-word are the same
- Max. length of real-word <= 10 (configurable:
- Ranker:
RankRealWord1To1ByCSpell.java- Find the top rank candidate
Sort the candidates byCSpellScoreRw1To1Comparator.java:OrthographicScoreComparator
The top ranked candidate (highest Orthographic score) must also have the highest scores of the follows in the candidate list:- FrequencyScore
- EditDistSimilarityScore
- PhoneticSimilarityScore
- OverlapSimilarityScore
- Validate the top ranked candidate
Use context score to validate the top ranked candidate (IsTopCandValid):- context radius = 2 (configurable,
CS_RW_1TO1_CONTEXT_RADIUS) - Set the RealWord_1To1_Confidence_Factor = 0.0 (configurable:
CS_RANKER_RW_1TO1_C_FAC) for more strict restriction to avoid false-positive candidates
- orgScore < 0
- & topScore > 0
- Context Score Check (on min., distance, and ratio)
- Min: topScpre > rw1To1CandMinCs (0.00, configurable:
CS_RANKER_RW_1TO1_CAND_MIN_CS) - Dist: topScore - orgScore > rw1To1CandCsDist (0.085, configurable:
CS_RANKER_RW_1TO1_CAND_CS_DIST) - Ratio: (topScore/-orgScore) > rw1To1CandCsFactor (0.1, configurable:
CS_RANKER_RW_1TO1_CAND_CS_FAC) - Min: orgScore > rw1To1WordMinCs (-0.085, configurable:
CS_RANKER_RW_1TO1_WORD_MIN_CS)
- Min: topScpre > rw1To1CandMinCs (0.00, configurable:
- Frequency Score Check (on min., distance, and ratio)
- Min: topFScore > rw1To1CandMinFs (0.0006, configurable:
CS_RANKER_RW_1TO1_CAND_MIN_FS) - Dist: topFScore > orgFScore or
(orgFScore - topFScore) < rw1To1CandFsDist (0.02, configurable:
CS_RANKER_RW_1TO1_CAND_FS_DIST) - Ratio: (topFScore/orgFScore) > rw1To1CandFsFactor (0.035, configurable:
CS_RANKER_RW_1TO1_CAND_FS_FAC)
- Min: topFScore > rw1To1CandMinFs (0.0006, configurable:
- Context Score Check (on min., distance, and ratio)
- & topScore < 0 & topScore * RealWord1To1CFactor > orgScore
- & topScore > 0
- orgScore > 0
- & topScore * RealWord_1To1_Confidence_Factor > orgScore
=> Never happen beacuse RealWord_1To1_Confidence_Factor is 0.0
- & topScore * RealWord_1To1_Confidence_Factor > orgScore
- orgScore = 0
- No real-word 1-to-1 correction because they are exclusive from the detector (no word2Vec information on the inspected word)
- context radius = 2 (configurable,
- Find the top rank candidate
- Corrector:
OneToOneCorrector.java- Update the focused (inspected) token with the top ranked candidate.
- Update process history to real-word-1To1
II. Development Tests
Tested different real-word 1-to-1 factors on the revised real-word included gold standard from the training set. Each test takes about 3~5 min. (depends on computer and memory size)
- Detector (check on focus token):
Function Min. Length Min. WC Raw data Performance NW (All) N/A N/A 607|777|964 0.7812|0.6297|0.6973 NW + RW_1To1 1 65 612|786|964 0.7786|0.6349|0.6994 NW + RW_1To1 2 65 612|786|964 0.7786|0.6349|0.6994 NW + RW_1To1 3 65 612|786|964 0.7786|0.6349|0.6994 NW + RW_1To1 4 65 612|786|964 0.7786|0.6349|0.6994 NW + RW_1To1 5 65 611|783|964 0.7803|0.6338|0.6995 NW + RW_1To1 6 65 609|781|964 0.7798|0.6317|0.6980 NW + RW_1To1 7 65 608|778|964 0.7815|0.6307|0.6980 NW + RW_1To1 8 65 607|777|964 0.7812|0.6297|0.6973 NW + RW_1To1 2 1 612|786|964 0.7786|0.6349|0.6994 NW + RW_1To1 2 10 612|786|964 0.7786|0.6349|0.6994 NW + RW_1To1 2 65 612|786|964 0.7786|0.6349|0.6994 NW + RW_1To1 2 100 611|785|964 0.7783|0.6338|0.6987 NW + RW_1To1 2 500 610|784|964 0.7781|0.6328|0.6979 NW + RW_1To1 2 1000 610|782|964 0.7801|0.6328|0.6987 NW + RW_1To1 2 10000 608|778|964 0.7815|0.6307|0.6980 - Test on Min. length:
- Increase it for better precision, worse recall.
- Use a small number, precision does not increase.
- The TPs starts to drop after 5. This might results in better/worse F1.
- No TPs by RW-1To1 when it is 8 (>= 8), because the length of all corrections in the development set are less than 8.
- Choose 2 for more recall with same F1 and precision. This means if the length of target word is 1, it is not a valid real-word for 1-To-1 correction.
- Test on Min. WC (word count)
- Increase it for better precision, worse recall, and faster run time.
- Use a small number is precision does not increase.
- Choose 1 for more recall with same F1 and precision.
- Test on Min. length:
- Candidates (check on candidates):
Function Min. Length Min. WC Raw data Performance NW (All) N/A N/A 607|777|964 0.7812|0.6297|0.6973 NW + RW_1To1 1 1 612|786|964 0.7786|0.6349|0.6994 NW + RW_1To1 2 1 612|786|964 0.7786|0.6349|0.6994 NW + RW_1To1 3 1 612|787|964 0.7776|0.6349|0.6990 NW + RW_1To1 4 1 612|785|964 0.7796|0.6349|0.6998 NW + RW_1To1 5 1 612|785|964 0.7796|0.6349|0.6998 NW + RW_1To1 6 1 609|779|964 0.7818|0.6317|0.6988 NW + RW_1To1 7 1 608|778|964 0.7815|0.6307|0.6980 NW + RW_1To1 2 1 612|786|964 0.7786|0.6349|0.6994 NW + RW_1To1 2 10 612|787|964 0.7776|0.6349|0.6990 NW + RW_1To1 2 100 612|791|964 0.7737|0.6349|0.6974 NW + RW_1To1 2 1000 611|791|964 0.7724|0.6338|0.6963 NW + RW_1To1 2 10000 608|782|964 0.7775|0.6307|0.6964 - Candidate Min. length:
- Increase it for better precision, worse recall.
- If it pass a threshold, recall and precision drops.
- Best F1 when it is at 4-5 because all TP are >= 4 (see example below).
- This number must coordinated with min. focus length.
- Choose 2 (candidate with length of 1 is not a valid candidate)
- Candidate Min. WC:
- Increase it for better precision, worse recall.
- Choose 1 (corrections might be at small WC)
- Candidate Min. length:
- Rankers - confidence factor for selecting and validating the top candidate:
Function C Factor C Score F Score Raw data Performance NW (All) N/A N/A N/A 607|777|964 0.7812|0.6297|0.6973 NW + RW_1To1 0.00 0.01|0.00|0.085|-0.085 0.035|0.0006|0.02 612|786|964 0.7786|0.6349|0.6994 NW + RW_1To1 0.01 0.01|0.00|0.085|-0.085 0.035|0.0006|0.02 612|789|964 0.7757|0.6349|0.6982 NW + RW_1To1 0.10 0.01|0.00|0.085|-0.085 0.035|0.0006|0.02 612|813|964 0.7528|0.6349|0.6888 NW + RW_1To1 0.50 0.01|0.00|0.085|-0.085 0.035|0.0006|0.02 612|998|964 0.6132|0.6349|0.6239 NW + RW_1To1 0.00 0.01|0.00|0.085|-0.085 0.035|0.0006|0.02 612|786|964 0.7786|0.6349|0.6994 NW + RW_1To1 0.00 0.10|0.00|0.085|-0.085 0.035|0.0006|0.02 612|786|964 0.7786|0.6349|0.6994 ... TBD ... - Confidence Factor:
- A very strict restriction is needed for confident factor to eliminate the FP.
- Choose C factor to 0.00. (top candidate is only valid when the focus token has negative score and top candidate has positive score
- Confidence Factor:
III. Observations from Development test set (F1 = 0.6994)
- [TP] real-word 1-To-1 corrections:
ID Source Detected Words Corrected Word Text Notes TP-1 11225 weather whether from one Person to another. Weather it can happen or TP-2 11597 bowl bowel irregular bowl movements. TP-3 12748 effect affect what is TSD/Clubfoot, and how does it effect a baby TP-4 13922 their there in the Chicago area hospitals is their a surgeon familiar with the shoudice TP-5 17713 small smell lost ability to taste and small, and who is profoundly depressed smell size Example: smell vs. small
- taste and small, foul small, bad small, small an odor, sense of small
- smell size, smell amounts, a smell sip of water, smeller amounts, smell intestine
- [FP] real-word 1-To-1:
ID Source Detected Word Corrected Word Text FP-1 10349 please place ...give me good advice please FP-3 18855 head had ... backalso inner head pain.com FP-4 2 causes cases What are some causes of anorexia - FP-3: Corpus has more "also and had" than "inner head"
- FP-4: "some causes of anorexia", but add "are" the "causes" is corrected to "cases". But it is OK for "What are some causes of pain" or "What are causes of anorexia"
- [FN] real-word 1-To-1:
ID Source Focus Words Corrected Word Text TP-1 32 then than TP-2 51 thing think TP-3 10138 know now TP-4 10375 tried tired TP-5 10934 specially especially TP-6 11186 repot report TP-7 11378 then than Is Radioiodine treatment better then surgery for me? TP-8 16734 weather whether I was particularly interested in learning weather parents should be worried about cribs death TP-9 12286 lesson lessen What can I do to lesson the severity of the adema TP-10 12757 pregnancy pregnant TP-11 12788 leave live TP-12 15759 tent tend TP-13 16256 access excess TP-14 16297 loosing losing - TP-9: "lesson" is not in the corpus of word2Vec.
=> Only "lessons" is in. Maybe use inflVars for detection.
=> Need a much bigger corpus for the word2Vec
=> The word2vec is very good on precision. However, the corpus used for training have to include such information (words and their context).
- TP-9: "lesson" is not in the corpus of word2Vec.
