Corpora in CSpell
I. Introduction
The corpus is used to:
- calculate the word frequency scores (word count) and noisy channel scores.
- generate word vector, use it to train word2vec to generate IM and OM
- Generate n-gram set (not used directly in CSpell)
II. Corpora Tested in CSpell
Three corpora were tested for comparison:
| Baseline | Consumer Health Corpus | Medline N-gram Set | |
|---|---|---|---|
| Resources | 7 web sites | 20 (16) web sites | The Medline N-gram Set |
| Statistics |
|
|
|
PS.
- Total words in CSpell Suggesting Dictionary: 566,914
- lexicon.enEwLc.dic.addRm
- customerDic.data
- NRVAR.1.uSort.data
shell> ${PRE_PROCESS}/bin/RunCorpusshell> ${PRE_PROCESS}/bin/RunPreProc
Setup the dictionary and corpus in CSpell for test471
III. Corpora
- Consumer Health Corpus
- Ensemble Corpus
- MedlinePlus Medical Encyclopedia
- MedlinePlus Drug
- Genetics Home Reference
- Genetic and Rare Disease frequency asked questions
- NHLBI Health Topics
- NINDS Disorders
- NIH Senior Health
- Corpus from MEDLINE (2017)
- Use unigrams from the MEDLINE N-gram Set
- Daniel Davis's Crawler (TBD)
IV. Development Tests
- Compare above three corpora for word frequency test on cSpell:
- Setup: Non-word, Revised GoldStd, unigram model for word frequency, new rank sorting algorithm
Model Ensemble Corpus Consumer Health Corpus Medline.2017 Frequency Only Frequency - Halil 438|770|774
0.5688|0.5659|0.5674438|770|774
0.5688|0.5659|0.5674404|770|774
0.5247|0.5220|0.5233Frequency - cSpell-Dev-1 536|769|774
0.6970|0.6925|0.6948534|770|774
0.6935|0.6899|0.6917521|770|774
0.6766|0.6731|0.6749Frequency - cSpell-Dev-2 536|769|774
0.6970|0.6925|0.6948534|770|774
0.6935|0.6899|0.6917522|770|774
0.6779|0.6744|0.6762Combined method Noisy Channel 552|769|774
0.7178|0.7132|0.7155551|770|774
0.7156|0.7119|0.7137523|770|774
0.6792|0.6757|0.6775CSpell Combined
Orthographic and Frequency598|769|774
0.7776|0.7726|0.7751598|769|774
0.7776|0.7726|0.7751597|769|774
0.7763|0.7713|0.7738
- Setup: Non-word, Revised GoldStd, unigram model for word frequency, new rank sorting algorithm
V. Notes
- The corpus is used for word2Vec. It need to be big enough (for recall) to cover the word and frequency. Current corpus need to be enhanced for better context ranking.
