Frequency Analysis on 5 WC ranges: 100, 1K, 10K, 100K, 1M

I. Introduction

Frequenct strategy is important for LMW acquistion. It is applied to LMW candidates obtained from fitlers and matchers for better precision. This page describes an frequency analysis on 5 word count range (100, 1K, 10K, 100K, 1M).

II. Details

  • Directory:
    • ${MULTIWORDS}/bin/08.MatcherSpVar
    • ${MULTIWORDS}/data/2015/outData/08.MatcherSpVar/
    • ${MULTIWORDS}/data/2015/outData/08.MatcherSpVar/Candidates/tag.2017.good
  • Model:
    • Input Data: 2015 Distilled MEDLINE N-gram Set
    • Process:
      • Step 51: Use SpVar model of M2CES to get SpVar List
        medline.2.byM2CES.2.out.30.spVars (min_ed >= 2, WC >= 30)
      • Step 60: Apply CUI filter
        medline.2.byM2CES.2.out.30.spVars.cui
      • Step 61A: retrieve 500 LMW candidates at 5 WC range
        The algorithm only count multiwords of 500 below the WC
        • 100
        • 1000
        • 10000
        • 100000
        • 1000000
      • Tag them:
        TagDescription
        AUTO_YESAutomatically tagged by computer if term is in Lexicon
        AUTO_NOAutomatically tagged by computer if term is in Lexicon
        YManually tagged by linguists if term is LMW, then add to Lexicon
        NManually tagged by linguists if term is not LMW, then add to invalid LMW List

III. Results

FrequencyPrecision (New Terms)Precision (Total Terms)
10019.81% (= 104/525)21.60% (= 116/537)
1K36.77% (= 196/533)42.42% (= 249/587)
10K47.73% (= 263/551)67.56% (= 604/894)
100K35.72% (= 384/1075)68.38% (= 1516/2217)
1M36.77% (= 556/1512)71.16% (= 2396/3367)

The total precision is increased as the frequency increase. Thus, we should acquire LMW from the highest frequency n-grams. Details data are available at: ${MULTIWORDS}/data/2015/outData/08.MatcherSpVar/Candidates/tag.2017.good/*.rpt