Real-word Split
This page describes the processes for real-word split detection and correction.
I. Processes
- Detector:
RealWordDetector.java- Not corrected previously in the CSpell pipeline.
- real-word: valid word (in checkDic)
- Not exceptions: digit, punctuation, digit/punctuation, email, url, empty string, measurement, Aa, proper noun)
- focus token has word2Vec
- focus token has length >= 4 (configurable:
CS_DETECTOR_RW_SPLIT_WORD_MIN_LENGTH) - focus token: WC >= 200 (configurable:
CS_DETECTOR_RW_SPLIT_WORD_MIN_WC)
- Candidates:
SplitCandidates.java- Get splitSet from all possible split as in the non-word
- SplitNo <= 2 (configurable:
CS_CAN_RW_MAX_SPLIT_NO)
- SplitNo <= 2 (configurable:
- The split candidate is a Lexicon multiword
- If not a multiword, check if it is a valid split candidate:
- Check short split words in split candidate
- short split word <= 2 (configurable:
CS_CAN_RW_MAX_SHORT_SPLIT_WORD_NO)
The total number of short split word <= maxShortSplitWordNo (2); - length of short plit word <= 3 (configurable:
CS_CAN_RW_SHORT_SPLIT_WORD_LENGTH)
The default value of short split word is a word with length of 3 or less. - Heuristic rules are used to avoid split into invalid many short split words, For examples:
Src Candidate Notes 17942.txt "something" -> "so me thing" both "so" and "me" are short split word, two of them means it is not a valid split 16369.txt "suggestion" -> "suggest i on" both "i" and "on" are short split word, two of them means it is not a valid split 60.txt "upon" -> "up on" 30.txt "soon" -> "so on" 12353.txt "another" -> "a not her", "an other" "a not her" is an invalid candidate, "an other" is a valid candidate. 15721.txt "anyone" -> "any one" - keep: "away" -> "a way", "along" -> "a long", etc.
- short split word <= 2 (configurable:
- Check all split words (element words) in split candidate
- in splitDic (Not pure Aa)
- has context score (word2Vec)
- WC > min. threshold (200 configurable:
CS_CAN_RW_SPLIT_CAND_MIN_WC)
example: ploytension -> poly tension - not unit
examples:Src Candidate Notes 17536.txt "inversion" -> "in version" where "in" is a unit, short for "inch" 10136.txt "everyday" -> "every day" where "day" is a unit - not proper noun
examples:Src Candidate Notes 16661.txt "human" -> "hu man" where "Hu" is a proper noun 16481.txt "children" -> "child ren" where "Ren" is a proper noun
- Check short split words in split candidate
- Get splitSet from all possible split as in the non-word
- Ranker:
RankRealWordSplitByContext.java- Rank split candidates by context scores
- context radius = 2 (configurable,
CS_RW_SPLIT_CONTEXT_RADIUS)
- context radius = 2 (configurable,
- Validate the top rank candidate
compare the top ranked candidate to the original token for correction:- orgScore < 0
- & topScore > 0
- & topScore < 0 & topScore * RealWord_Split_Confidence_Factor > orgScore
- orgScore > 0
- & topScore * RealWord_Split_Confidence_Factor > orgScore
- orgScore = 0
- No real-word split correction because no word2Vec information on the original word, this case is filtered out in the detection
where:
- orgScore: is the context score of the original token
- topScore: is the context score of the top candidate
- RealWord_Split_Confidence_Factor = 0.01 (Configurable:
CS_RANKER_RW_SPLIT_C_FAC)
- orgScore < 0
- TBD: the ranking can be improved if n-gram frequency is available. The frequency with context will be a better ranking source for split candidate
- Rank split candidates by context scores
- Corrector:
ProcRealWordSplit.java,ProcRealWordSplit.java- FlatMap the split word (OneToOneSplitCorrector.AddOneToOneSplitCorrection)
- Update process history to real-word-split
II. Development Tests
Tested different real-word merge factor on the revised real-word included gold standard from the training set with the following setup:
CS_DETECTOR_RW_SPLIT_WORD_MIN_LENGTH=4CS_CAN_RW_SHORT_SPLIT_WORD_LENGTH=3CS_CAN_RW_MAX_SHORT_SPLIT_WORD_NO=2
| Function | Confidence Factor | Context Radius | Max. SplitNo | Raw data | Performance |
|---|---|---|---|---|---|
| NW (1-to-1, Split, Merge) | N/A | N/A | 2 | 604|775|964 | 0.7794|0.6266|0.6947 |
| NW + RW_SPLIT | 0.00 | 2 | 5 | 605|789|964 | 0.7668|0.6276|0.6902 |
| NW + RW_SPLIT | 0.01 | 2 | 5 | 605|789|964 | 0.7668|0.6276|0.6902 |
| NW + RW_SPLIT | 0.02 | 2 | 5 | 605|790|964 | 0.7658|0.6276|0.6899 |
| NW + RW_SPLIT | 0.03 | 2 | 5 | 605|790|964 | 0.7658|0.6276|0.6899 |
| NW + RW_SPLIT | 0.05 | 2 | 5 | 605|791|964 | 0.7649|0.6276|0.6895 |
| NW + RW_SPLIT | 0.10 | 2 | 5 | 605|792|964 | 0.7639|0.6276|0.6891 |
| NW + RW_SPLIT | 0.20 | 2 | 5 | 605|792|964 | 0.7639|0.6276|0.6891 |
| NW + RW_SPLIT | 0.40 | 2 | 5 | 605|809|964 | 0.7478|0.6276|0.6825 |
| NW + RW_SPLIT | 0.60 | 2 | 5 | 607|835|964 | 0.7269|0.6297|0.6748 |
| NW + RW_SPLIT | 0.80 | 2 | 5 | 608|875|964 | 0.6949|0.6307|0.6612 |
| NW + RW_SPLIT | 0.01 | 9 | 0 | 604|775|964 | 0.7794|0.6266|0.6947 |
| NW + RW_SPLIT | 0.01 | 9 | 1 | 606|777|964 | 0.7799|0.6286|0.6962 |
| NW + RW_SPLIT | 0.01 | 9 | 2 | 606|777|964 | 0.7799|0.6286|0.6962 |
| NW + RW_SPLIT | 0.01 | 9 | 3 | 606|777|964 | 0.7799|0.6286|0.6962 |
| NW + RW_SPLIT | 0.01 | 9 | 4 | 606|777|964 | 0.7799|0.6286|0.6962 |
| NW + RW_SPLIT | 0.01 | 9 | 5 | 606|777|964 | 0.7799|0.6286|0.6962 |
- Bigger the confidence factor increases the [TP] and [FP]. Value of 0.01 seems reach the best F1.
- Bigger the context radius decreases the [TP] and [FP] first, then increase [TP] and [FP], value of 9 seems to reach the best F1.
=> real-word split involves understand the meaning of the text, software needs more context for better precision. - The value of max. split No. does not seems have too much impact on F1. Use empirical value of 2 as default. There are not too much possibility that a merged word happen to be real-word. Use 2 (instead of bigger number) could save running time and increase speed performance.
III. Observations from Training Set
- [TP] real-word split:
ID Source Original Words Split Word TP-1 10349 along a long TP-2 10349 along a long TP-3 13165 iam i am TP-4 18669 iam i am - 10349.txt: "sounding in my ear every time for along time."
- TP-3 and TP-4 are done in the ND splitter
- [FP] real-word split:
ID Source Original Words Split Word FP-1 10349 along a long FP-2 10061 however how ever FP-3 39 without with out FP-4 39 because be cause FP-5 41 anywhere any where - [FN] real-word split:
ID Source Original Words Merged Word FN-3 13864 apart a part - FN-3: The original input text is ... I donate my self to be apart of this study. The word2Vec need to be improved by bigger corpus. This split case is very sensitive with context as shown in follows:
Input Output Notes apart apart apart of a part of apart of this apart of this apart of this study apart of this study apart of this group a part of this group Good apart of this process a part of this process Good apart of this effect a part of this effect Good be apart be apart be apart of be a part of Good to be apart of to be apart of not be apart of not be a part of Good weeks apart of weeks apart of Good weeks apart of 160 mg weeks apart of 160 mg Good distance apart of distance apart of Good distance apart of the distance apart of the Good apart from apart from Good
- FN-3: The original input text is ... I donate my self to be apart of this study. The word2Vec need to be improved by bigger corpus. This split case is very sensitive with context as shown in follows:
