Precision for New LMW Candidates from (ACR) Model

I. Introduction

The results from matchers generate LMWs (LexMultiWords) candidates list. This list is sent to linguists to add to Lexicon. An algorithm has been developed to calculate the precision (= valid no/candidates no = retrieved-relevant/retrieved) automatically in this process.

II. Format and Tag

Ideally, no tag is needed because tag can be retrieved from Lexicon. However, we would like to have linguists manually add tag to the LMW candidates as follows to ensure the precision. For example, if linguists forget to tag a candidate and thus it will be considered as invalid MW without manully tagging. With manual tagging, any missed tag can be identify.

  • LMW candidates format:
    LMW candidate*Tag

    * LMW candidates are lowercase core-terms

  • Tag:

    Three types of tag are automatically tagged as follows. The algorithm check valid first, then invalid. TBD are other than above two conditions. Filter.Lexicon are used for checking, thus including exact match, exact match after lowercase, match after removing lead-end punctuation, etc..

    known valid LMW from Lexicon
    known invalid LMW from previous tag
    • (from Lexicon - valid expansion)
    tbdTo be done (untagged candidates)
    • LMW candiates file
    onot a valid acronym expansion (invalid MW)legency tag, not used
    ea valid expansion that exist in Lexicon (valid MW)legency tag, not used

    The tagging results are used to update the invalidMw file. The following checking algorithm are used:

    • yes: should be in Lexicon
      =>lexAccessLb -n -i:yes -o:yes.out
      =>fgrep "|No Result Found-" yes.out |wc -l ... should be 0
    • no: should not be in Lexicon
      =>lexAccessLb -n -i:no -o:no.out
      =>fgrep "|No Result Found-" no.out |wc -l ... should be the same size as no.out

III. Process to get precision on new LMWs from candidate list

  • Candidates from MEDLINE n-gram.${NGRAM_YEAR}
  • Stats from LEXICON.${INIT_LEX_YEAR}

0Prepare valid and invalid files:
  • Initial set:
    • valid:${YEAR}
    • invalid:${YEAR-1}
  • Final set (current):
    • valid: (latest)
    • invalid: (current)
      => Need to update${YEAR} until tagging on TBD is completed
      =>${YEAR} updates${YEAR}
      => link to the lastest${YEAR}
1Run TagXXX to auto tag until no TBD in the final data:
  • Candidates: TBD from initial data, send to linguists to
    • add LMW to Lexicon
    • tag yes|no
  • Update from tagged file (no Tag)
    • Rerun this step until no TBD in the final data
    • Link to the latest${YEAR}

  • Precision is used after TBD inthe final data is 0:
    • All "yes" from "TBD" is alreay in Lexicon (}
    • All "no" from "TBD" is updated in${NEXT_YEAR} (current)
    • precision (of new candidates) = No. of yes from "TBD"/no. of total "TBD"

    • Precison should be improved over the years (after first release in 2014) because more invalidMw are collected.