Corpora in CSpell

I. Introduction

The corpus is used to:

  • calculate the word frequency scores (word count) and noisy channel scores.
  • generate word vector, use it to train word2vec to generate IM and OM

  • Generate n-gram set (not used directly in CSpell)

II. Corpora Tested in CSpell

Three corpora were tested for comparison:

BaselineConsumer Health CorpusMedline N-gram Set
Resources7 web sites20 (16) web sitesThe Medline N-gram Set
Statistics
  • Articles: 8,590
  • Tokens: 5,771,363
  • Unique Word:51,190
  • Dic Words in Corpus: 35,781|6.3115%
  • Dic Words WC: 5,637,436|98.4708%
  • Articles: 17,139
  • Sentences: 550,193
  • Tokens: 10,228,699
  • Unique Word: 192,818
  • Unique CoreTerm.Lc: 109,175
  • Dic Words in Corpus: 48690|8.5886%
  • Dic Words WC: 9,979,195|97.6123%
  • Articles: 26,759,399
  • Sentences: 163,021,640
  • Tokens: 3,386,661,350
  • Unique Word: 976,872 (WC > 30)
  • Unique CoreTerm.Lc: 496,388
  • Dic Words in Corpus: 214,581|37.8507%
  • Dic Words WC: 3,224,585,163|97.1439%

PS.

  • Total words in CSpell Suggesting Dictionary: 566,914
    • lexicon.enEwLc.dic.addRm
    • customerDic.data
    • NRVAR.1.uSort.data
  • shell> ${PRE_PROCESS}/bin/RunCorpus
  • shell> ${PRE_PROCESS}/bin/RunPreProc
    Setup the dictionary and corpus in CSpell for test
    4
    71

III. Corpora

  • Consumer Health Corpus
  • Ensemble Corpus
    • MedlinePlus Medical Encyclopedia
    • MedlinePlus Drug
    • Genetics Home Reference
    • Genetic and Rare Disease frequency asked questions
    • NHLBI Health Topics
    • NINDS Disorders
    • NIH Senior Health
  • Corpus from MEDLINE (2017)

  • Daniel Davis's Crawler (TBD)

IV. Development Tests

  • Compare above three corpora for word frequency test on cSpell:
    • Setup: Non-word, Revised GoldStd, unigram model for word frequency, new rank sorting algorithm

      ModelEnsemble CorpusConsumer Health CorpusMedline.2017
      Frequency Only
      Frequency - Halil438|770|774
      0.5688|0.5659|0.5674
      438|770|774
      0.5688|0.5659|0.5674
      404|770|774
      0.5247|0.5220|0.5233
      Frequency - cSpell-Dev-1536|769|774
      0.6970|0.6925|0.6948
      534|770|774
      0.6935|0.6899|0.6917
      521|770|774
      0.6766|0.6731|0.6749
      Frequency - cSpell-Dev-2536|769|774
      0.6970|0.6925|0.6948
      534|770|774
      0.6935|0.6899|0.6917
      522|770|774
      0.6779|0.6744|0.6762
      Combined method
      Noisy Channel552|769|774
      0.7178|0.7132|0.7155
      551|770|774
      0.7156|0.7119|0.7137
      523|770|774
      0.6792|0.6757|0.6775
      CSpell Combined
      Orthographic and Frequency
      598|769|774
      0.7776|0.7726|0.7751
      598|769|774
      0.7776|0.7726|0.7751
      597|769|774
      0.7763|0.7713|0.7738

V. Notes

  • The corpus is used for word2Vec. It need to be big enough (for recall) to cover the word and frequency. Current corpus need to be enhanced for better context ranking.