Matcher - SpVar Pattern

I. Introduction

If n-grams of a term that matches spVar pattern in the n-gram set, they are good candidates of MWEs. This spVar model has been developed, tested on LRSPL (recall), tested on Lexicon (recall and precision), and then apply on MEDLINE.

II. Models
The following algorithm is used to retrieve N-grams that match spVar patterns:

  • AMIA.2016 paper (May, 2016)
    • The distrilled MEDLINE n-gram set, WC > 150

    • Group SpVar ByNorm
    • Group by MES: maxEditDist = 2
    • Group by ES: maxEditDist = 1
    • Group by MES: maxEditDist = 3
    • Group by ES: maxEditDist = 2
    • Group by MES: maxEditDist = 4

    • This model work fine. However, more developments on this model are needed to improve the processing time and performance (precison and recall on Lexicon).
  • HealthInf.2017 paper
    • The distrilled MEDLINE n-gram set, WC > 150

    • Group SpVar ByNorm
    • Group by M2CES: maxEditDist = 2

    • Lower the WC,

III. Processes

  • directory: ${MULTIWORDS_DIR}/bin
  • Run program: shell> ./08.MatcherSpVar ${YEAR}
  • Processes:

    StepDescriptionInputsOutputsNotes - Examples
    Unit Tests for SpVar software compoments
    1Unit Test: SpVar Norm
    SpVarNorm.java
    NonenormStr of sample StrUnit test on spVar norm
    • No need to run for candidate list
    2Unit Test: Metaphone
    Metaphone.java
    Nonemetaphone of sample StrUnit test on metaphone
    • No need to run for candidate list
    3Unit Test: Edit Distance
    EditDistance.java
    Noneedit distance of 2 sample StrsUnit test on edit distance
    • No need to run for candidate list
    4Unit Test: Sorted Distance
    SortedDistance.java
    Nonesorted distance between sample Strs in a set of input termsUnit test on sorted distance
    • No need to run for candidate list
    Utility Tests for SpVar software compoments
    10SpVar on norm|Metaphone|Edit distance ..
    TestSpVarOnNormMpEd.java
    Nonenorm|Metaphone|EditDistanceShow results of two inStrs
    • Not needed for candidate list
    11GroupSpVarByNorm ..
    GetStdAndSpVarsFromLRSPL.java
    GroupSpVarByNorm.java
    • ./inData/LRSPL
    • LRSPL.std
      SpVar-1|SpVar-2|SpVar-3|...
    • LRSPL.data
      SpVar-1
      SpVar-2
      SpVar-3
      ...

    • ./unitTest/LRSPL.data.1.byNorm.out
    Retrieve LMW from LRSPL by spVarNorm
    • Not needed for candidate list
    12GroupSpVarByMES ..
    GroupSpVarByMES.java
    • ./unitTest/LRSPL.data.1.byNorm.out
    • ./unitTest/LRSPL.data.2.byMES.2.out
    Retrieve LMW from results of step-11 by MES
    • Not needed for candidate list
    13GroupSpVarByES ..
    GroupSpVarByES.java
    • ./unitTest/LRSPL.data.2.byMES.2.out
    • ./unitTest/LRSPL.data.3.byES.1.out
    Retrieve LMW from results of step-12 by ES
    • Not needed for candidate list
    14PrintOutSpVars ..
    PrintOutSpVars.java
    • ./unitTest/LRSPL.data.3.byES.1.out
    Split and print out spVars to single and spVars
    • ./unitTest/LRSPL.data.3.byES.1.out.std (= single + spVars)
    • ./unitTest/LRSPL.data.3.byES.1.out.notSpVars (single without spVar)
    • ./unitTest/LRSPL.data.3.byES.1.out.spVars (have spVars)
    • Not needed for candidate list
    15GetNormSpVarsTable ..
    GetNormSpVarsTable.java
    • ./unitTest/LRSPL.data
    • ./unitTest/LRSPL.data.notSpVar
    • ./unitTest/LRSPL.data.spVar
    Get a norm spVar|spVars table from a file (terms or n-grams)
    • Not needed for candidate list
    Analysis: Test on LRSPL
    20Convert file format from LRSP ..
    21Get SpVar from LRSPL - ByNorm ..
    22Get SpVar from LRSPL - ByMES (ED:2) ..
    23Get SpVar from LRSPL - ByES (ED:1) ..
    24Get SpVar from LRSPL - ByMES (ED:3) ..
    25Get SpVar from LRSPL - ByES (ED:2) ..
    26Get SpVar from LRSPL - ByMES (ED:4) ..
    27Print result of step 26 - ByMES (ED:4) ..
    28Test SpVar matcher on LRSPL - (Steps: 21-26) ..
    29Analysis: GetSpVarTypeFromLRSPL ..
    GetSpVarTypeFromLRSPL.java
    • ./inData/LRSPL
    • spVars.type
      = GENITIVE + NON_GENITIVE

    • spVars.type.GENITIVE
    • spVars.type.NON_GENITIVE
    • spVars.type.NON_GENITIVE.GENITIVE
    Analyze types of spVars:
      TypeExamples
      SVT_SPACElookup|look up
      SVT_CASEAcG|ACG
      SVT_PUNC_DASHlookup|look-up
      SVT_PUNC_PERIODAAMD|A.A.M.D.
      SVT_PUNC_OTHERSanti-HB(s)|antiHBs
      SVT_GENITIVEAddisons|Addison's
      SVT_GENITIVE_SAlzheimer|Alzheimer's
      SVT_GENITIVE_PAlzheimer|Alzheimers'
      SVT_GENITIVE_SSAlzheimer|Alzheimer'S
      SVT_GENITIVE_PPAAlzheimer|lzheimerS'
      SVT_NUMBER3|three
      SVT_RANK2nd|second
      SVT_SYNONYMSt.|Saint
      SVT_SPVARantitumour|antitumor
      SVT_UNICODEæcidium|aecidium
      SVT_TBDadvertize|advertiseo

    • No need to run for candidate list
    Analysis: Test on Lexicon (inflVars.data) for AMIA paper Table-2
    30Get inflectional SpVar from Lexicon
    • GetInflSpVarsFromLexicon.java
    31Get gold standard for SpVars from Lexicon
    • GetGoldStdFromLex.java
    32Get SpVar from Lex - ByNorm
    • GroupSpVarByNorm.java
    33Get SpVar from Lex - ByMES (ED:2)
    • GroupSpVarByMES.java
    • ED = 2
    • Time: hr.
    33A,B,C,D,EGet SpVar from Lex - ByM2ES, M3ES, C2ES, M2CES, M3CES (ED:2)
    • GroupSpVarByXXXX.java
    • ED = 2
    • Time: ~ 2 hr.
    34Get SpVar from Lex - ByES (ED:1)
    • GroupSpVarByES.java
    • ED = 1
    • Time: hr.
    35Get SpVar from Lex - ByMES (ED:3)
    • GroupSpVarByMES.java
    • ED = 3
    • Time: hr.
    36Get SpVar from Lex - ByES (ED:2)
    • GroupSpVarByES.java
    • ED = 2
    • Time: hr.
    37Get SpVar from Lex - ByMES (ED:4)
    • GroupSpVarByMES.java
    • ED = 4
    • Time: hr.
    37Get PRF for above tests (Must complete steps: 31-36)
    Pre-Process
    40Group nGrams by core-term
    GroupByCoreTerm.java
    Same as use otpion 11 of 6.NGramUtil
    • ${NGRAM_DIR}distilledNGram.${YEAR}
    • ./Candidates/distilledNGram.${YEAR}.core
    • ./Candidates/distilledNGram.${YEAR}.detail
    • Group distilled n-gram by core-term
    • Must finish the distilled n-gram set
    41Get terms from nGramSet (filtered by WC, sorted)
    NGramWcTermFilter.java
    • distilledNGram.${YEAR}.core
    • distilledNGram.${YEAR}.core.${WC}.sort
    • distilledNGram.${YEAR}.core.${WC}.sort.term
    Filter by WC (default 150) and sort
    Process: Apply spVar on MEDLINE
    50Apply SpVar on Medline - ByNorm
    • 1 min (AMIA.init, WC: 150)
    • 2 min (HealthInf, WC: 100)
    • 3 min (HealthInf, WC: 50)
    • 5 min (HealthInf, WC: 30)
    • distilledNGram.${YEAR}.core.WC.sort (manual remove lines less than WC)
    • distilledNGram.${YEAR}.core.WC.sort.term (flds 2 of above file)
    ./${YEAR}/outData/08.MatcherSpVar/Medline/
    • medline.1.byNorm.out.150
    • medline.1.byNorm.out.100
    • medline.1.byNorm.out.50
    • medline.1.byNorm.out.30
    51Apply SpVar on Medline - ByM2CES
    • 2 hr (Amia, WC:150)
    • 23 hr (2015, WC:100)
    • 4 days 18 hr (2015, WC:50)
    • 14 days 8 hr (2015, WC:30)
    • 16 days 23 hr (2016, WC:30)
    ./${YEAR}/outData/08.MatcherSpVar/Medline/
    • medline.2.byM2CES.2.out.150
    • medline.2.byM2CES.2.out.100
    • medline.2.byM2CES.2.out.50
    • medline.2.byM2CES.2.out.30
    51APrint out SpVarClass results of Step-51 to files
    • 1 min (HealthInf)
    ./${YEAR}/outData/08.MatcherSpVar/Medline/
    • medline.2.byM2CES.2.out.30.notSpVars
    • medline.2.byM2CES.2.out.30.spVars
    • medline.2.byM2CES.2.out.30.std
    52Apply SpVar on Medline - ByMES (ED:2)
    • 11 hr (AMIA-lexdev1)
    53Apply SpVar on Medline - ByES (ED:1)
    • 5 days 6 hr (AMIA-lexdev1)
    54Apply SpVar on Medline - ByMES (ED:3)
    • 30 min. (AMIA-lexdev1)
    55Apply SpVar on Medline - ByES (ED:2)
    • 5 days 5 hr (AMIA-lexdev1)
    56Apply SpVar on Medline - ByMES (ED:4)
    • 20 min. (AMIA-lexdev1)
    Results
    57Apply SpVar on Medline (Step 51 - 56)
    • 11~12 days

    GetSpVarClassByTerm.java
    • distilledNGram.${YEAR}.core.${WC}.sort.term
    It takes 12 days to generate spVar class (WC >=150), they are LMW candidates:
    • norm
    • MES 1, maxEditDist=2
    • ES 1, maxEditDist=1
    • MES 2: maxEditDist=3
    • ES 2, maxEditDist=2
    • MES 3: maxEditDist=4
    • PrintOut results
    58Print out SpVarClass results to files
    • distilledNGram.${YEAR}.core.${WC}.sort.term.std
    • distilledNGram.${YEAR}.core.${WC}.sort.term.notSpVars
    • distilledNGram.${YEAR}.core.${WC}.sort.term.spVars
    Results
    Process: Generate LMW candidates from SpVar, Cui, Wc
    60Get LMS candidates from spVar file with CUI
    • GetCanFromSpVarCui.java
    • 90 min.
    • ./Medline/medline.2.byM2CES.2.out.30.spVars
    • ${MED_DIR}/distilledNGram.2015.core.30 (for WC)
    • ${STMT_DIR}/data/Config/smt.properties (for SMT to get CUI)
    • ./Candidates/medline.2.byM2CES.2.out.30.spVars.cui
    • All fields are tokenized into terms.
    • Check if the term has CUI information from SMT
    • Filter out terms without CUI in the spVar class
    • Remove the spVar class if only one term in the class
    • Retrieve WC information
    61Apply WC to LMW candidates from above spVar-CUI file
    • ApplyWcToSpVarCuiCanFile.java
    • ./Candidates/medline.2.byM2CES.2.out.30.spVars.cui
    • WC_BASE
    • UP_LIMIT
    • DOWN_LIMIT
    • ./Candidates/medline.2.byM2CES.2.out.30.spVars.cui.raw
    • ./*.spVars.cui.raw.WC_BASE.UP_LIMIT.DOWN_LIMIT
    61AApply 5 WC sets to LMW candidates
    • ApplyWcToSpVarCuiCanFile.java
    • 15 sec.
    • medline.2.byM2CES.2.out.30.spVars.cui
    • *.raw.100.0.500
    • *.raw.1000.0.500
    • *.raw.10000.0.500
    • *.raw.100000.0.500
    • *.raw.1000000.0.500
    • WC: 1000K, 100K, 10K, 1K, 100
    • UP_LIMIT: 0
    • DOWN_LIMIT: 500
    62Auto tag spVar-CUI-WC file
    • TagSpVarWcCandidateFile.java
    • Use 62A instead
    62AAuto tag 5 WC sets of spVar-CUI files
    • TagSpVarWcCandidateFile.java
    • 15 sec.
    • *.raw.100.0.500
    • *.raw.100.50.50.tag.${YEAR}
    • *.raw.100.50.50.tag.${YEAR}.no
    • *.raw.100.50.50.tag.${YEAR}.tbd
    • *.raw.100.50.50.tag.${YEAR}.yes
    • *.raw.100.50.50.tag.${YEAR}.yesNo
    • *.raw.100.50.50.tag.${YEAR}.can (candidate file)
      => Same as tag file except:
      • remvoe tag if tag is TBD
      • remove the spVar class if all candidates are auto-tagged
    • WC:1000K, 100K, 10K, 1K, 100
    Post-Process: TBD
    70Add word count to SpVar class
    AddWcToSpVarClass.java
    • distilledNGram.${YEAR}.core.${WC}.sort.term.spVars
    • distilledNGram.${YEAR}.core.${WC}.sort
    • distilledNGram.${YEAR}.core.${WC}.sort.term.spVars.wc
    Add WC back to spVar1|spVar2|..
    71Sort by reversed string ... (request by Lynn)