Real-word Merge
This page describes the processes for real-word merge detection and correction.
I. Processes
- Detector:
RealWordMergeDetector.java- Not corrected previously in the CSpell pipeline
- real-word: valid word (in splitDic)
- Not exceptions: digit, punctuation, digit/punctuation, email, url, empty string, upperCase, 1Char, measurement
- Candidates:
MergeCandidates.java- mergeNo <= 2 (configurable:
CS_CAN_RW_MAX_MERGE_NO) - merge with hyphen is false (configurable:
CS_CAN_RW_MERGE_WITH_HYPHEN)
only merge with space " ", (no merge with hyphen "-") - context (adjacent tokens) is not an exception (url, email, ...)
- orgWords (before merge words) is not a multiwords (not in mwDic)
- candidate is a valid word (in suggDic), not abbreviations or acronyms (not in aaDic)
- candidate has context score (not zero)
- Word count of candidate >= 15 (configurable:
CS_CAN_RW_SPLIT_CAND_MIN_WC) - Not a short word merge
- short word is the length less than 3
- the total number of short words should be less than 2
- Examples:
Input text Candidate Notes me at meat - invalid candidate
- 2 short words (me and at)
- source: 80.txt and 16734.txt
- mergeNo <= 2 (configurable:
- Ranker:
RankRealWordMergeByContext.java, - Rank merge candidates by context scores
- context radius = 2 (configurable,
CS_RW_MERGE_CONTEXT_RADIUS)
- context radius = 2 (configurable,
- Validate the top rank candidate
Compare the top ranked candidate to the original token for correction:- orgScore < 0
- & topScore > 0
- & topScore < 0 & topScore * RealWord_Merge_Confidence_Factor > orgScore
- orgScore > 0
- & topScore * RealWord_Merge_Confidence_Factor > orgScore
- orgScore = 0
- No real-word merge correction because no word2Vec information on the original word
where:
- orgScore: is the context score of the original token
- topScore: is the context score of the top candidate
- RealWord_Merge_Confidence_Factor = 0.60 (Configurable:
CS_RANKER_RW_MERGE_C_FAC)
- orgScore < 0
- Corrector:
MergeCorrector.java- reconstruct the text by updating the whole inTokenList with all mergeObjs
- Update process history to real-word-merge
- The corrector need to take care of contains and overlap cases for all mergeObjs before the merge operation. This is a quick fix. The best way is to correct the merge right after the merge (TBD). Also, current merge operation is first come first serves, maybe this sequential order of merge and other spelling correction can be improved by frequency or other score systems.
II. Development Tests
Tested different real-word merge factor on the revised real-word included gold standard from the training set.
| Function | Confidence Factor | Context Radius | Max. MergeNo | Raw data | Performance |
|---|---|---|---|---|---|
| NW (1-to-1, Split, Merge) | N/A | N/A | 2 | 604|775|964 | 0.7794|0.6266|0.6947 |
| NW + RW_MERGE | 0.20 | 2 | 2 | 609|783|964 | 0.7778|0.6317|0.6972* |
| NW + RW_MERGE | 0.25 | 2 | 2 | 610|785|964 | 0.7771|0.6328|0.6975 |
| NW + RW_MERGE | 0.30 | 2 | 2 | 610|783|964 | 0.7791|0.6328|0.6983 |
| NW + RW_MERGE | 0.33 | 2 | 2 | 610|785|964 | 0.7771|0.6328|0.6975 |
| NW + RW_MERGE | 0.40 | 2 | 2 | 610|783|964 | 0.7791|0.6328|0.6983 |
| NW + RW_MERGE | 0.50 | 2 | 2 | 610|786|964 | 0.7761|0.6328|0.6971 |
| NW + RW_MERGE | 0.55 | 2 | 2 | 612|787|964 | 0.7776|0.6349|0.6990 |
| NW + RW_MERGE | 0.60 | 2 | 2 | 613|786|964 | 0.7799|0.6359|0.7006 |
| NW + RW_MERGE Fixed LC on W2V | 0.60 | 2 | 2 | 614|788|964 | 0.7792|0.6369|0.7009 |
| NW + RW_MERGE | 0.70 | 2 | 2 | 613|790|964 | 0.7759|0.6359|0.6990 |
| NW + RW_MERGE | 0.80 | 2 | 2 | 614|791|964 | 0.7762|0.6369|0.6997 |
| NW + RW_MERGE | 0.90 | 2 | 2 | 614|792|964 | 0.7753|0.6369|0.6993 |
| NW + RW_MERGE | 1.00 | 2 | 2 | 615|794|964 | 0.7746|0.6384|0.6997 |
| NW + RW_MERGE | 0.60 | 1 | 2 | 610|783|964 | 0.7791|0.6328|0.6983 |
| NW + RW_MERGE | 0.60 | 2 | 2 | 613|786|964 | 0.7799|0.6359|0.7006 |
| NW + RW_MERGE | 0.60 | 3 | 2 | 611|784|964 | 0.7793|0.6338|0.6991 |
| NW + RW_MERGE | 0.60 | 4 | 2 | 609|783|964 | 0.7778|0.6317|0.6972 |
| NW + RW_MERGE | 0.60 | 5 | 2 | 608|782|964 | 0.7775|0.6307|0.6964 |
| NW + RW_MERGE | 0.60 | 6 | 2 | 610|784|964 | 0.7781|0.6328|0.6979 |
| NW + RW_MERGE | 0.60 | 7 | 2 | 607|779|964 | 0.7792|0.6297|0.6965 |
| NW + RW_MERGE | 0.60 | 8 | 2 | 607|778|964 | 0.7802|0.6297|0.6969 |
| NW + RW_MERGE | 0.60 | 9 | 2 | 607|779|964 | 0.7792|0.6297|0.6965 |
| NW + RW_MERGE | 0.60 | 10 | 2 | 606|778|964 | 0.7789|0.6286|0.6958 |
| NW + RW_MERGE | 0.60 | 2 | 1 | 613|786|964 | 0.7799|0.6359|0.7006 |
| NW + RW_MERGE | 0.60 | 2 | 2 | 613|786|964 | 0.7779|0.6359|0.7006 |
| NW + RW_MERGE | 0.60 | 2 | 3 | 613|786|964 | 0.7799|0.6359|0.7006 |
| NW + RW_MERGE | 0.60 | 2 | 4 | 613|786|964 | 0.7799|0.6359|0.7006 |
- Bigger the confidence factor increases the [TP] and [FP]. Value of 0.6 seems reach the best F1.
- Bigger the context radius decreases the [TP] and [FP], Value of 2 seems reach the best F1. We trained word2vec with a window size of 5, which is the same spec of context radius of 2 (1 token + 2 adjacent tokens on each sides). It is best to use same specification for the training and application.
- If the relevance of global context in the article us of interest, we suggest to use larger window size in training and the equivalent window in the application.
- The value of max. merge No. does not seems have too much impact on F1. The bigger of max. merge No. has slower speed performance. Use empirical value of 2 as default.
III. Observations from Development test set
- [TP] real-word merge:
ID Source Original Words Merged Word TP-1 1 on set onset TP-2 39 under developed underdeveloped TP-3 39 some what somewhat TP-4 62 life long lifelong TP-5 11579 anti psychotic antipsychotic TP-6 13645 non prescription nonprescription TP-7 13864 my self myself TP-8 14296 some one someone TP-9 15759 anti depresants antidepressants TP-10 16974 non drug nondrug TP-11 18766 some times sometimes TP-12 12745 extra corporeal extracorporeal - TP-9, depresants is corrected to "depressants" from nw_1-to-1, then merge to "antidepressants" in rw_merge (the only merge candidate).
- [FP] real-word merge:
ID Source Original Words Merged Word FP-2 12261 a while awhile FP-3 16481 me anyt meant FP-5 18903 over time overtime FP-6 12630 every day everyday - FP-1 & 4 are caused by different annotations between brat ([CONTACT]) and corpus Word2Vec ([EMAIL]).
- TBD: Check on the Word2Vec scores, a bigger corpus might have better recall to cover these cases.
- [FN] real-word merge:
ID Source Original Words Merged Word FN-1 24 some thing something FN-2 30 there after thereafter FN-3 33 web site website FN-4 74 great full grateful FN-5 74 use full useful FN-6 11225 over read overread FN-7 11435 some time sometime FN-8 11579 with out without FN-9 11579 worth while worthwhile FN-10 11757 care taker caretaker FN-11 12271 in to into FN-12 12520 post menopause postmenopause FN-13 12646 what ever whatever FN-14 12800 through out throughout FN-15 13287 grand child grandchild FN-16 16823 after noon afternoon FN-17 16829 grand father grandfather FN-18 19818 boy friend boyfriend - FN-4, 5 involves more correction more than real-word merge
- TBD: Check on the Word2Vec scores, a bigger corpus might have better recall to cover these cases.
