CSpell Dictionary - Medical Terms

I. Introduction

In addition to terms in the Lexicon, terms from UMLS are retrieved as medical terms and for a better coverage. Lexicon and this retrieved medical terms are used together as dictionary in cSpell.

II. Algorithm for Medical Terms

Medical terms are generated by the following steps:

  • Terms from the latest UMLS:
    • Go through MRCONSO.RRF
    • English String (LAT = ENG)
    • Preferred Term (TS = P, SST = PF, ISPREF = Y)
    • Not Obsolete (not used)
    • Not Abb/Acr (Not used)
    • Semantic type match (from STDEF: St_abb -> STI, MRSTY.RRF: STI -> CUI)
      CategorySemantic Types
      Problem
      • acab, anab, bact, cgab, dsyn, inpo, mobd, neop, patf, sosy
      Interventions
      • diap, lbpr, topp
      Drugs
      • aapp, antb, clnd, drdd, phsu, vita
      • nsba, strd (removed after 2014AA-)
      Anatomy
      • bdsy, blor, bpoc, bsoj, gngm, orga, orgf, phsf, tisu
      population
      • aggp famg, podg, popg
    • Only lower case term (mixed case and upper case contains lots of Abb/Acr, also mostly overlap with lowercase only terms)

    • Program:

      shell>cd ${PRE_PROCESS}/bin
      2017AB
      3
      35

    • output:
      ${PRE_PROCESS}/data/Umls/${RELEASE_AX}/outData/umlsDicBySt.data.ewLc
      (ew: element word, Lc: lowercase)

  • Terms form the useful legacy data (from Gopher and problem list). These terms are static and are retrieved from baseline dictionary (4 files):
    • Input:
      • noCui.data.expo
      • noCui.data.prob
    • Program:

      shell>cd ${PRE_PROCESS}/bin
      2017
      4
      45
      46

    • output:
      ${PRE_PROCESS}/data/Baseline/outData/noCui.ewLc.data

  • Retrieved unigrams from UMLS resources and combine with legacy data:
    • Algorithm:
      • Tokenized unigram
      • coreTerm - remove punctuation at the leading/ending position
      • Filter out digit/punctuation, numbers, unit/measurement
      • Filter out terms/possessive already in Lexicon (or general English dictionary)
      • Customized dictionary (add and remove terms)
    • Program:

      shell>cd ${PRE_PROCESS}/bin
      2017
      5
      53
      54

    • input (${PRE_PROCESS}/data/cSpellDic/${YEAR}/inData/):
      • ewToBeAdded.data
      • ewToBeRemoved.data
      • umlsDicBySt.data.ewLc
      • noCui.ewLc.datao
      • lexicon.enEwLc.dic.addRm
    • output (${PRE_PROCESS}/data/cSpellDic/${YEAR}/outData/):
      • Med.l.dic (l: for using lexicon as English dictionary)
      • EngMed.l.dic

  • Finalized words from above derived Med.l.dic:
    • Get words that are in both Med.l.dic and medline.dic
      56
    • Exclude words that are in Lexicon to gen Dic
      55
      57

    • output (${PRE_PROCESS}/data/cSpellDic/${YEAR}/outData/):
      • Med.cm-l.dic (cm: consumer and medline)
      • EngMed.cm-l.dic (-l: exclude lexicon)

  • Generate LexBuild Candidate list from Med.cm-l.dic:
    • Generate candidate list from Med.cm-l.dic
      60
      Output: ${PreProcess}/data/LexBuild/${YEAR}/outData/cCandidates.data
    • Sort by grouping base and plural forms of element words
      61
      Output: ${PreProcess}/data/LexBuild/${YEAR}/outData/cCandidates.data.gbp

III. Test Results

Please see performance test on distionaries