Exclusive Filter: A Term with min. Document Count and Word Count

  • Description:
    If a term has DC or WC less than the specified minimum DC or WC, it is not a valid (or commonly-used) multiword. These terms are filtered out from the MEDLINE n-gram set. For examples, the following terms are invalid multiwords:
    • n2oniliforme
    • m2per
    • protocol, recruitment
    • embryos (61.9%)

    The MEDLINE n-gram set is used to retrieve the DC and WC. It uses 30 as the minimum WC and 1 and the min. DC. There are lots of multiwords in Lexcion are not in the n-gram set due to:

    • They are spelling variants - meaning their spVars exist in n-gram set, they don't have enough WC (30).
    • n-gram might in different forms, such as case and punctuation, thus they don't have enough WC (30).
    • Lexicon records some multiwords which has small occurance (WC).
    Thus, we think this is not an right (frequency) filter to used in generating multiwords. Instead, we should use normalized word count (NWC) along with DC|WC for a better result from the frequncy test. Please note that normalized document count (NDC) can't be calculated.

  • Filter Algorithm:
    • Logics:

      get DC|WC from n-gramFT_TBD
      if not in the MEDLINE n-gram setFT_WC_DC_NOT_FOUND
      • Exceptions: valid terms not found DC|WC in n-gram
      Check if (dc < minDc) or (wc < minWc)FT_WC_DC_INV_LES
      • filtered invalid terms with DC|WC less than minimum

    • source code: FilterDcWc.java
    • FilterType: FilterType.FT_WC_DC_INV_LESS

  • Accuracy Test on Lexicon (DC >= 1; WC >= 30):
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
      • ${OUT_DATA}/02.NGram/nGrams/n-gram.${YEAR}
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2018FT_WC_DC_INV_LESS9555649555640 625175100.0000%
      2016FT_WC_DC_INV_LESS9155839155830 618966100.0000%
      2015FT_WC_DC_INV_LESS8962138962130 612316100.0000%
      2014FT_WC_DC_INV_LESS8750908750900 603592100.0000%