Matcher: Parenthetic Acronyms (ACR)

I. Introduction

Acronym expansions are good candidates of multiwords. For example, "Wolf Motor Function Test" is the expansion of acronym "WMFT", it is also a valid multiword in Lexicon (LexMultiWord - LMW). On the other hand, there are acronym expansion that is not a LMW, such as "GLP|good laboratory practice", "SOC|sense of coherence" are not a valid LMW. There are more than 7,000 valid expansions of acronyms and abbreviations in Lexicon are invalid LMW.

If a nGram contains a parenthetic acronym pattern (ACR), the term before the left parenthesis can be the expansion of the acronym. The parenthetic acronym pattern is a uppercase term with parenthesis (ACR). This pattern is used in exclusive filters to remove invalid term. The expansion can also be used to generate multiword candidates.

II. Procedure
The following procedure is used to find valid multiwords from n-grams by parenthetic acronym pattern:

  • Dir: ${MEDLINE_WRODS}/bin
  • Program:

    shell> 07.MatcherParAcr ${YEAR}

    StepDescriptionInputsOutputsNotes
    Get Candidates from acronym expansions
    1Get nGrams match pattern of [acronym expansion (ACR)]
    • GetParAcrFromNGram.java
    • nGram.${YEAR}
    • n-gram.${YEAR}.parAcr.rpt
    • n-gram.${YEAR}.parAcr.exp
    • n-gram.${YEAR}.parAcr.pass
    • n-gram.${YEAR}.parAcr.trap
      => matches (ACR)
    • Uses the raw nGram Set (not distilled)
    • Uses FilterParentheticAcronym to trap terms with (ACR) pattern from n-gram set
    2Get acronym|acronym expansion from n-grams with (ACR) pattern and identify illegal acronym expansion by
    • Initial character of first and last word of expansion matches the first and last character of acronym
    • Word size of acronym expansion > 1
    • Word size of acronym expansion matches number of chracter of acronym ? (TBD)
    • GetAcronymFromParAcrFile.java
    • n-gram.${YEAR}.parAcr.trap
    • acronymExp.pattern.acr
      => lines match acr
    • acronymExp.pattern.notAcr
      => lines not match acr
    • acronymExp.pattern
      => acronym|acronym expansion, unified
    • find the (ACR) or (ACRs) from n-grams
    • get the ACR
    • get singular form ACR from plural acronym (ACRs)
    • get the expansion
    • check if valid acronym:
      • first and last initial of expasion matches acronym (ignore case)
      • the word no. of expansion > 1 (must be multiword)
    3Filter out (exclude) sub-term of expansions
    • ExcludeSubTermExpFromParAcrFil.java
    • acronymExp.pattern
    • acronymExp.subterm.raw
      => candidates (ACR expansion)
    • acronymExp.subterm.pass
      => candidates (ACR|ACR expansion)
    • acronymExp.subterm.trap
      => subterm of others, invalid
    • if the expansion is a sub-term of other expansion with same acronym, it is not a valid expansion and should be excluded
    Pre-process: core-term, valid/invalid files
    4Get core-term for candidates input file
    • acronymExp.subterm.raw
    • acronymExp.subterm.raw.core
    • Get lowercase core-term of candidates
    5Prepare valid/invalid MW files for auto-tagging All known valid and invalid MW should be updated:
    • Initial data (${MULTIWORDS}/data/current/inData/}, used to calculate precision and recall
      • valid:
        • inflVars.data.${YEAR}
          => known valid MW in Lexicon
      • invalid (invalidMwForParAcr.data.${PREV_YEAR}):
        • ${LEX_CHECK}/data/Files/notBaseForm.data.${PREV_YEAR}
          => Known invalid ACR/ABB expansion from LexCheck
        • invalidMwFromParAcrTag.data.${PREV_YEAR}
          => invalid tag from (ACR) pattern from previous tagging result
          => none if it is the first time (no 2013 data for 2014)
          => Need to be updated in the previous year post-process by running through step 7-9

    • Final data (used to cal precision and record)
      • valid:
        • inflVars.data.final
          => known valid MW in the latest Lexicon, a snapshot from the the latest Lexicon. By default, it is automatic linked to auto-gen fiel at 4:00 AM at ${BACKUP}/Routine.lexBuild/Lexico/${YEAR}/InflVars. If need more recent version, run ${LB_DIR}/Tools/LoadDb/GenScript (2).
      • invalid (invalidMwForParAcr.data.final):
        • ${LEX_CHECK}/data/Files/notBaseForm.data.${YEAR}
          => Known invalid ACR/ABB expansion from LexCheck
        • invalidMwFromParAcrTag.data.${YEAR}
          => invalid tag from (ACR) pattern from current tagging result
          => need to complete all TBD n-grams
          => Need to be updated by running through step 7-9
        • ln -sf ./invalidMwForParAcr.data.${YEAR} invalidMwForParAcr.data.final
    • invalidMwForParAcr.data.${PREV_YEAR}
    • invalidMwForParAcr.data.${YEAR}
    • invalidMwForParAcr.data.final
    This file should be updated accoring to annual data.
    Process: Generate Candidates and Stats
    6Generate LMW candidate list for linguists:
    • candidates: acronymExp.subterm.raw.core

    • valid MW (${MULTIWORDS}/data/current/inData/):
      • inflVars.data.${INIT_YEAR}
        => initial Lexicon year, default: 2014
      • inflVars.data.current
        =>final (current): generate the latest version from LexBuild

    • invalid MW (${MULTIWORDS}/data/current/inData/):
      • invalidMwForParAcr.data.${PREV_YEAR}
        => initial: known invalid MW for ACR expansion
        • notBaseForm.data.${PREV_YEAR}.f1
          => initial: not base form from LexCheck (only field 1)
        • invalidMwFromParAcrTag.data.${PREV_YEAR}
          => initial: invalid MW form from ParAcr Tag (previsou year)
      • invalidMwForParAcr.data.${YEAR}
        => Final(current): known invalid MW for ACR expansion
        • notBaseForm.data.${YEAR}.f1
          => Final: not base form from LexCheck (only field 1)
        • invalidMwFromParAcrTag.data.${PREV_YEAR}
          => Final: invalid MW form from ParAcr Tag (current latest)
    • Tag results from initial Lexicon ${YEAR}
      • acronymExp.tag.data.${YEAR}
      • acronymExp.tag.data.${YEAR}.yes
      • acronymExp.tag.data.${YEAR}.no
      • acronymExp.tag.data.${YEAR}.tbd

    • Tag results from final Lexicon (current)
      • acronymExp.tag.data.final
      • acronymExp.tag.data.final.yes
      • acronymExp.tag.data.final.no
      • acronymExp.tag.data.final.tbd
        => shell> cp -r acronymExp.tag.data.final.tbd acronymExp.tag.data.final.tbd.${YEAR}
        => Candidate list, sent to linguists for tagging

        =>Make sure complete the previous year ACR/ABB candidate list before generate this one.

    • Tag results of the new candidates (init-TBD)
      • acronymExp.tag.data.tag.new.yesNo (with yes/no tag for further analysis)
      • acronymExp.tag.data.tag.new.yes
      • acronymExp.tag.data.tag.new.no
      • acronymExp.tag.data.tag.new.tbd

    • acronymExp.tag.data.stats
      => summary: stats and precision

    Algorithm:

    • Initial tag: tag candidates based on the data of ${YEAR}
    • Final tag: tag candidates based on the data of final
    • New candidates: TBD from the initial tag
      • if valid
        => added to Lexicon
      • if no
        => from LexCheck (notBaseForm.data) or tag results (invalidMwFromParAcrTag.data, continuously updates)
        =>tagged result is at ${OUT_DATA}/Tagged
      • Others: tag as TBD in the final set, sent to linguist
    • Repeat steps 6-9 until no. of "Final-TBD" is 0
    • Get the precision when "final-TBD" is 0
    Post-Tagging process: Update invalid MWs file
    7Validate linguist's tagged file (to get invalid LMWs) ..
    • CandidateUtil.GetInvalidMwFromTagFile
      • Get invalid tag from linguist's tag file
    • lexAccessLb -n -i:input -o:output
      • Check linguist's invalid tags by comparing to Lexicon
    • dir: ${DATA}/current/tagData/07.MatcherParAcr/
    • ParAcr.tagged.data (link to the tagged candidate file)
      => Manully update ParAcr.tagged.data file
      • get tagged files from linguists, convert to *.txt
      • append them to the end of the previous ParAcr.tagged.data
      • uSort
      • link ParAcr.tagged.data to the new uSort file
    • ParAcr.tagged.data.yes
      => ParAcr.tagged.data.yes.out
    • ParAcr.tagged.data.no
      => ParAcr.tagged.data.no.out
    • Steps 7~9 are needed to update data after new tags are done!

    • [y]: valid expansion and LMW in Lexicon
    • [n]: invalid LMW (not in Lexicon), valid expansion
      -------------------------------------------------------
    • [o]: invalid expansion (not used after 2016+) => converted to [n] in the program automatically
    • [e]: valid expansion in Lexicon (bot used after 2016+) => converted to [y] in the program automatically
    • Follow the instruction from the screen result to find invalid (y|n) tags, and send to linguists for revision.
    • This step is not used after 2016 due to the limited resources. Instead of validate, we used Lexicon to automatically assign the tag [y] and [n] in 7.1
    7.1Assign tag to ACR/ABB candidate file ..
    • CandidateUtil.AssignTagToTermFile
      • Update and assign tag [Y] and [N}] to the ACR/ABB candidate file
      • Run this until the acronymExp.tag.data.tag.final.tbd.${YEAR} is done (in the Lexicon)
    • dir: ${DATA}/current/tagData/07.MatcherParAcr/TagData/${YEAR}
    • ParAcr.data.${YEAR}
      • Manually generate ParAcr.data.${YEAR}
        • shell>cp -rp ../${PRE_YEAR}/ParAcr.data.${PRE_YEAR}.tag .
        • get acronymExp.tag.data.tag.final.tbd.${YEAR}
        • combine above two files:
          cat ParAcr.data.${PRE_YEAR}.tag acronymExp.tag.data.tag.final.tbd.${YEAR} > ParAcr.data.${YEAR}
      • Make sure there is no duplicated candidates
        • sort -u ParAcr.data.${YEAR} > ParAcr.data.${YEAR}.uSort
    • ParAcr.data.${YEAR}.tag
    • ParAcr.data.${YEAR}.yes
    • ParAcr.data.${YEAR}.no
    Step 7.1 ~ 9 are the post-process that should be done after ACR/ABB candidate list are completed, and before generate the next ACR/ABB candidate list
    8Get unique lowercased core-term for no-tag (invalidMw) file ..
    • CandidateUtil.ToCoreTerm
    • dir: ${DATA}/current/tagData/07.MatcherParAcr/
    • ParAcr.tagged.data.no
      ln -sf ./ParAcr.data.2017.no ParAcr.tagged.data.no
    • ParAcr.tagged.data.no.core
    • Get the core-term form of invalid MW from tag-file
    9Update invalid MWs file
    • invalidMwFromParAcrTag.data.${YEAR}
      • invalidMwFromParAcrTag.data.${PREV_YEAR}
      • ParAcr.tagged.data.no.core
    • invalidMwForParAcr.data.final
      => Link to invalidMwForParAcr.data.${YEAR}
      • notBaseForm.data.${YEAR}.f1
      • invalidMwFromParAcrTag.data.${YEAR}
    • ParAcr.tagged.data.no.core
    • invalidMwFromParAcrTag.data.${PREV_YEAR}

    • notBaseForm.data.${YEAR}
    • invalidMwForParAcr.data.final
    • invalidMwFromParAcrTag.data.${YEAR}
    • Update invalidMwForParAcr.data.final
      => Run the 7.1-9 first, then go back to Step 5, to re-run Step-6 to generate candidate for ${YEAR}
    Further Analysis on newly tagged candidates: pre-process
    10Add WC to tagged candidates
    • UpdateWcToTaggedCandidates.java
    • acronymExp.tag.data.tag.new.yesNo
    • ./2.NGram/nGrams/nGramSet.2014.30.core
      => Must run coreterm on n-gram set from option 10 of 6.NGramUtil
    • acronymExp.tag.data.tag.new.yesNo.Wc
    • Update WC information from nGramSet.2014.30.core
    11Get CUIs for tagged candidates
    • UpdateCuiToTaggedCandidates.java
    • acronymExp.tag.data.tag.new.yesNo.Wc
    • SMT configuration file
      => ${STMT_DIR}data/Config/smt.properties
      => Use the latest installed STMT
    • acronymExp.tag.data.tag.new.yesNo.Wc.Cui
    • Update CUI information from SMT:
      • 0: found CUIs with 0 subterm substituition
      • 1: found CUIs with 1 subterm substituition
      • 2: found CUIs with 2 subterms substituition
      • 3: No CUI found within 3 substituition
    12Tag Distilled for tagged candidates
    • UpdateDistToTaggedCandidates.java
    • acronymExp.tag.data.tag.new.yesNo.Wc.Cui
    • ./2.NGram/nGrams/distilledNGram.2014.core
      => Must run coreterm on distilledNGram set from option 11 of 6.NGramUtil
    • acronymExp.tag.data.tag.new.yesNo.WcCui.Dist
    • Update Distilled information (true|false) from nGramSet.2014.30.core
    13Tag SpVar for tagged candidates
    • UpdateSpVarToTaggedCandidates.java
    • acronymExp.tag.data.tag.new.yesNo.Wc.Cui.Dist
    • TBD ./08.Matcher/nGrams/distilledNGram.2014.core
      => TBD. Must run spVar 6.NGramUtil
    • acronymExp.tag.data.tag.new.yesNo.WcCui.Dist.SpVar
    • Update SpVar information (true|false) from TBD
    Further Analysis on newly tagged candidates: process
    15Analyze precision, recall, and f1 on candidates
    • CandidateUtil.AnalyzeCandidatePRF
    acronymExp.tag.data.tag.new.yesNo.rpt Developed, not used for analysis yet
    16Analyze WC histogram on candidates
    • CandidateUtil.AnalyzeCandidateHistogram
    acronymExp.tag.data.tag.new.yesNo.his Developed, not used for analysis yet
    17Analyze WC histogram details (smaller range on lower frequency) on candidates
    • CandidateUtil.AnalyzeCandidateHistogram
    cronymExp.tag.data.tag.new.yesNo.his.min-max.sec.csv Developed, not used for analysis yet
    Precision and Recall Analysis for AMIA full paper
    30Tag Matcher-(ACR): baseline to be used as gold standard
    • Must get the latest inflVars.data from Lexicon
    • Must run Step 7-9 first (for invalidMwForParAcr.data.final)
    • TagCandidateFile.java auto-tag:
      • [y]: if it is in Lexicon (inflVars.data)
      • [n]: invalidMwForParAcr.data.final
      • [tbd]: otherwise
    • inFile: acronymExp.subterm.raw.core
    • validFile: inflVars.data.current
    • invalidFile: invalidMwForParAcr.data.current
    • acronymExp.subterm.raw.core.tag.${YEAR}
    • acronymExp.subterm.raw.core.tag.${YEAR}.no
    • acronymExp.subterm.raw.core.tag.${YEAR}.tbd
    • acronymExp.subterm.raw.core.tag.${YEAR}.yes
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
      => Used as the gold standard for precision and recall
    • Must run step 7-9 first (to update invalidMwForParAcr.data.current)
    • Must update the inflVars.data.current from Lexicon (approve all submit records)
    31Get precision, recall, F1 for Baseline (acronym expansion)
    • GetPRF.java
    • Must run step-30 first
    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • test: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1
    • Output (PRF) is on the screen
    • not goldStd No: must be 0
    • err tag No: must be 0
    • Must finished steps 30
    32Tag (ACR) + Distilled set, PRF
    • CandidateUtil.ApplyDistToFile
    • CandidateUtil.GetPRF
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1
    • dist: nGrams/distilledNGram.${YEAR}.core

    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.dist
    • Must finished steps 30, 31
    33Tag (ACR) + SpVar, PRF
    • CandidateUtil.ApplySpVarToFile
    • CandidateUtil.GetPRF
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1
    • spVar: Candidates/distilledNGram.2014.core.150.sort.term.spVars.latest

    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.spVar
    • Must finished steps 30, 31
    34Tag (ACR) + CUI, PRF
    • CandidateUtil.ApplyCuiToFile
    • CandidateUtil.GetPRF
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1
    • smt: data/Config/smt.properties

    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui
    • Must finished steps 30, 31
    35Tag (ACR) + EndWord, PRF
    • CandidateUtil.ApplyEndWordToFile
    • CandidateUtil.GetPRF
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1
    • endWord: inFilterEndWord.data.used

    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.endWord
    • Must finished steps 30, 31
    36Tag (ACR) + CUI + SpVar, PRF
    • CandidateUtil.ApplSpVarToFile
    • CandidateUtil.GetPRF
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui
    • spVar: Candidates/distilledNGram.2014.core.150.sort.term.spVars.latest

    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui.spVar
    • Must finished steps 30, 31, 34
    37Tag (ACR) + CUI + SpVar + EndWord, PRF
    • CandidateUtil.ApplyEndWordToFile
    • CandidateUtil.GetPRF
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui.spVar
    • endWord: inFilterEndWord.data.used

    • goldStd: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.f1.cui.spVar.endWord
    • Must finished steps 30, 31, 34, 36
    Frequency (WC) Analysis on (ACR) for AMIA poster paper
    40Add WC to GoldStd
    • CandidateUtil.AddWcToTermTagFile
    • in: acronymExp.subterm.raw.core.tag.${YEAR}.yesNo
    • ngram_wc: nGrams/nGramSet.${YEAR}.30.core
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.wc
    • Must finished n-gram core term
    • Must finished step 30
    41Get Histogram of GoldStd
    • CandidateUtil.GetPRFHistogram
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.wc
    • acronymExp.subterm.raw.core.tag.${YEAR}.yesNo.wc.minWc-maxWc.increment.prfHis.csv
    • Should run this one time to get the Max. WC, then use it as input
    Frequency (WC) Analysis on LEXICON for AMIA poster paper
    45Add WC to LMWs and LSWs
    • CandidateUtil.GetSwMwFromLexicon
      => get LMWs and LSWs from Lexicon (inflVars.data)
    • CandidateUtil.AddWcToTermFile
      => Add WC to LSWs
    • CandidateUtil.AddWcToTermFile
      => Add WC to LMWs
    • inflVars.data

    • nGrams/nGramSet.${YEAR}.30.core
    • ./10.LexWords/inflVars.data.lsw
    • ./10.LexWords/inflVars.data.lmw

    • ./10.LexWords/inflVars.data.lsw.wc
    • ./10.LexWords/inflVars.data.lmw.wc
    • This data is used in Figure-1 WC spectrum: no. of terms vs. WC class
    46Get Histogram of LSWs
    • CandidateUtil.GetHistogram
    • ./10.LexWords/inflVars.data.lsw.wc
    • ./10.LexWords/inflVars.data.lsw.wc/minWc-maxWc.incWc.his.csv
    • Should run this one time to get the Max. WC, then use it as input
    47Get Histogram of LMWs
    • ./10.LexWords/inflVars.data.lmw.wc
    • ./10.LexWords/inflVars.data.lmw.wc/minWc-maxWc.incWc.his.csv
    • Should run this one time to get the Max. WC, then use it as input