Dictionary Functions - Check Valid Word
I. Introduction
In cSpell, all tokens that are used for spelling error detection are single words. Thus, only single words are needed to be in the dictionary. This page described which dictionary should be used for the spelling erroor detection.
II. Algorithm
Both the whole token and core-term for the token are checked for the valid spellina (Is-Valid-Word):
- Check the whole token and core-term for dictioary word (non-case snesitive)
- Remove possessive ('s) to check the original word for dictioary word (non-case snesitive)
- Check all words in the or-slash-term for valid word
III. Results
Test cSpell with different dictionaries:
- Is Valid (Spell Checker):
- Dictionary.IsValidWord
- Possesses
- slash or
- hyphen words (TBD)
- Exceptions:
- IsDigit
- IsPunc
- IsDigitPunc
- IsMeasurements (units, digit + units)
- IsUrl
- IsEmail
- IsEmptyString
- IsProperNoun (dictionary based)
- IsAbbAcr(dictionary based)
- Dictionary.IsValidWord
- Candidates:
- 1-to-1: mainDic
- Split:
- if whole split is a multiword
- if all split words are digit, unit, or noAaDic*
* noAaDic: En + Pn
eng_medical.dic:
- element words from UMLS Strings
- does not inlcude Proper noun
- does not include Abbreviation or acronym
- does not include spelling varaints
Lexicon.dic:
- all: all words
- sw: single words
- mw: multiwords
- ew: element words (unigram)
- aa: abbreviations and acronyms
- pn: proper noun
- sv: spelling variants
- noAa: english words (all exclude aa)
- en: english words (all exclude aa and pn)
- swEn: single word and english words (single word exclude aa and pn)
IV. Tests:
Test-1: Tests on Baseline + Lexicon (not used, result are included from above)
- These tests use only 1 dictionary for check and suggestion
- The pn, aa, sv check are implemented in the algorithm
- The result show element word (ew: for better spelling error detection) with English word (en: for better candidate suggestion) are the best.
- The result leads us to have two dictionaries: check and suggest
| Dictionary | TP|Ret|Rel | Precision | Recall | F1 | Notes |
|---|---|---|---|---|---|
| Lexicon (single-word + multiwords) | |||||
| 535|858|814 | 0.6235 | 0.6572 | 0.6400 | |
| 530|877|814 | 0.6043 | 0.6511 | 0.6268 | |
| Lexicon (single-words) | |||||
| 531|808|814 | 0.6572 | 0.6523 | 0.6547 | |
| 535|858|814 | 0.6235 | 0.6572 | 0.6400 | |
| 530|877|814 | 0.6043 | 0.6511 | 0.6268 | |
| Combined (10 spVar are included in Lexicon) | |||||
| 529|740|814 | 0.7149 | 0.6499 | 0.6808 | |
| 533|745|814 | 0.7154 | 0.6548 | 0.6838 | |
| 537|745|814 | 0.7208 | 0.6597 | 0.6889 | |
| TBD | |||||
| 549|745|814 | 0.7369 | 0.6744 | 0.7043 | |
Test 2: Tests on split Dictionaries
- Implemented two dictinoary: check and suggestion
- Results shows:
- check dictionary should be validated element words (Lexicon.ew)
- suggestion dictinoary should be the one in the testing domain, exclude pn, aa, sv? (more tests on UMLS/MedLine and TBD)
| Dictionary | TP|Ret|Rel | Precision | Recall | F1 | Notes | ||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Use Baseline Dictionary for check and suggest | |||||||||||||||||||||||||||||||||||||||||||||||
| 546|820|814 | 0.6659 | 0.6708 | 0.6683 | |||||||||||||||||||||||||||||||||||||||||||
| 547|810|814 | 0.6753 | 0.6720 | 0.6736 | Add 10 files for spVars | ||||||||||||||||||||||||||||||||||||||||||
| 548|765|814 | 0.7163 | 0.6732 | 0.6941 | Check proper noun from Lexicon | ||||||||||||||||||||||||||||||||||||||||||
| 547|804|814 | 0.6803 | 0.6720 | 0.6761 | Check Abb/Acr from Lexicon | ||||||||||||||||||||||||||||||||||||||||||
| 548|759|814 | 0.7220 | 0.6732 | 0.6968 | Check proper nouns/Abb/Acr from Lexicon | ||||||||||||||||||||||||||||||||||||||||||
| 544|747|814 | 0.7282 | 0.6683 | 0.6970 | Add SpVar from Lexicon | ||||||||||||||||||||||||||||||||||||||||||
| 543|746|814 | 0.7279 | 0.6671 | 0.6962 | Replace 10 files by Lexicon.spVar | ||||||||||||||||||||||||||||||||||||||||||
| 543|749|814 | 0.7250 | 0.6671 | 0.6948 | Add SpVar to dic decreases F1 because it us also used for suggestion (need a better ranking system) | ||||||||||||||||||||||||||||||||||||||||||
| 543|746|814 | 0.7279 | 0.6671 | 0.6962 | Add number, no change bz of data | ||||||||||||||||||||||||||||||||||||||||||
| Implement 2 Dictionaries: Check + Suggest | |||||||||||||||||||||||||||||||||||||||||||||||
| Find the Check dictionary | |||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||
Test-3: test on the suggestion dictionary
- Need a dictionary that focuses on the domain
- The performance is coupled with the candidate and ranking algorithm, need a analysis tool for better understanding.
| Dictionary | TP|Ret|Rel | Precision | Recall | F1 | Notes | ||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Find the Suggest dictionary | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
