Inclusive Filter: N-Gram with CUIs

I. Introduction

A LexMultiWord must have a meaning (concept). If a nGram has a concept (CUI) from Metathesaurus, it is a good LMW candidate.

II. Procedure
The following procedure is used to find valid multiwords from n-grams that have MetaThesaurus CUI:

  • Dir: ${MEDLINE_WORDS}/bin
  • Program:

    shell> 09.MatcherCui ${YEAR}

    StepDescriptionInputsOutputsNotes
    Get CUI stats on Lexicon
    1Add CUI to Lexicon (InflVars)
    • shell>cd ${STMT_DIR}/bin
    • smt.AA
    • Time: 1 hr. 10 min.
    • ${IN_DIR}inflVars.data.f1
    • inflVars.data.cui
    • Use smt to get CUIs for all terms from inflVars.data
    2Analyze and get stats of CUI in Lexicon
    • AnalyzeCuiMapping.java
    • inflVars.data.cui
    • inflVars.data.cui.rpt
    • Get stats for single words|multiwords wtih CUIs
    Get CUI stats on MEDLINE n-gram set
    10Add CUI to nGram
    • shell>cd ${STMT_DIR}/bin
    • smt.AA
    • Time: 16 hr.
    • distilledNGram.${YEAR}.core.lc
      =>run 06.NGramUtil ${YEAR}
      11
    • distilledNGram.${YEAR}.core.f2
    • distilledNGram.${YEAR}.core.cui
    • get term from distilled nGrams (field 2)
    • Use smt to get CUIs for all terms from nGrams
    11Filter out nGrams without CUI
    • FilterCuiFromFile.java
    • distilledNGram.${YEAR}.core
    • distilledNGram.${YEAR}.core.cui
    • distilledNGram.${YEAR}.core.cui.out
    • Filter out nGrams without CUIs (including 1,2,3 substitutions)
    12Tag results of step-11
    • TagCuiTerm.java
    • distilledNGram.${YEAR}.core.cui.out
    • Max. WC (2000000)
    • Min. WC (0)

    • inflVars.data.${YEAR}
    • inflVars.data.current
    • notMwFromCuiTerm.data.${YEAR}
    • notMwFromCuiTerm.data.current
    • distilledNGram.${YEAR}.core.lc.cui.out.stats (the stats between init year and current)
    • distilledNGram.${YEAR}.core.lc.cui.out.${YEAR}.tag.${MIN_WC}-${MAX_WC}
    • distilledNGram.${YEAR}.core.lc.cui.out.${YEAR}.tbd.${MIN_WC}-${MAX_WC}
    • distilledNGram.${YEAR}.core.lc.cui.out.current.tag.${MIN_WC}-${MAX_WC}
    • distilledNGram.${YEAR}.core.lc.cui.out.current.tbd.${MIN_WC}-${MAX_WC}
    Tag and calulate precision:
    • sent distilledNGram.${YEAR}.core.cui.out.current.tbd.${MIN_WC}-${MAX_WC} to linguist:
      • tag yes|no|exp
      • Add valid MW to Lexicon
    • Update files from tag result of "yes|no"
      • Update inflVars.data.current from Lexicon
      • Update notMwFromCuiTerm.data.current from no-tag
    • rerun this step until current.tbd is 0
    • Check precision
    20Apply Matcher-Cui on nGram TBD
    • distilledNGram.${YEAR}.core
    Generate multiword candidates from n-gram set
    30Proc: Apply filter of Lexicon on Distilled nGram (core)
    • distilledNGram.${YEAR}.core
    • ${IN_DIR}/inflVars.data
    • 30.disNGram.Core.lexicon.out
    • Use core-term of n-gram
    31PreProc: Get unique English String from UMLS - MRCONSO.RRF.ENG
    • MRCONSO.RRF.ENG
      => link to MRCONSO.RRF.ENG.${YEAR}AB
    • umlsStr.data
    • Preprocess to get English UMLS String for step 32
    32Proc: Apply matcher of UMLS-Str on distilled nGram (must run 31)
    • 30.disNGram.Core.lexicon.out
    • umlsStr.data
    • 32.disNGram.Core.umlsStr.out
    • A simple hashTable lookup to match n-gram to UMLS String
    33Proc: Apply matcher of Multiword on nGram (core)
    • 32.disNGram.Core.umlsStr.out
    • 33.disNGram.Core.multiword.out
    • Remove single word
    34Proc: Apply matcher of EndWord (top 33) on nGram (core)
    • 33.disNGram.Core.multiword.out
    • endWords.top33.data
      => Must run 10.MatchEndWord first, option 1, to get the top endword list
      => cp endWords.top${NN}.data

        => link endWords.top.data
    • 34.disNGram.Core.endword.out
    • Use the top endWord for matcher
    35Proc: Apply filter of previous candidates
    • 34.disNGram.Core.endword.out
    • prevCandidateList.data
      • put all 36.disNGram.Core.endword.new.gsp.${PREV_YEARS} together
    • 35.disNGram.Core.endword.new.out
    • Filter out candidates taht are in the preivous year list
    36Post-Proc: Rearrange canList by grouping singluars/plurals
    • 35.disNGram.Core.endword.new.out
    • 36.disNGram.Core.endword.new.out.gsp
      => This file is used for annual candidate list
    • resort and group it to put singular and plural together
    Future Usage
    40PreProc: Get nGram spVar from result of 8.MatcherSpVar
    • medline.2.byM2CES.2.out.30.spVars.2016
    • nGramSpVars.data
    • get the n-grams that match spVar patterns
    41Proc: Apply matcher of nGram-SpVar on nGram
    • 34.disNGram.Core.endword.out
    • nGramSpVars.data
    • 36.disNGram.Core.spVar
    • Lost recall, not use for now