Exclusive Filter: A Term Ends with Valid-End-Term matches Pattern of no SpVar (VETP)

  • Description:
    If a term ends with a valid-end-term and has no spelling variants co-exist in the n-gram set, it is not a valid multiword. These terms are filtered out from the MEDLINE n-gram set. The spelling variants pattern includes hypen (built in|built-in), non-space (built in|builtin), case (Built in|built in), and combination of above cases (Built-In|built in).

  • Examples:
    • analysis framework for
    • channel blockers should be
    • lotion in the treatment of
    • design was used to
    • lymph nodes up

    The valid-end-terms are derived from Lexicon. Some end-terms from the invalid lead-end-term candidate list are valid-end-terms and used to checked in the spVar pattern, such as "after", "for", "worth", etc.. N-grams end with any of these pattern valid end-terms and does not have spelling variant co-exist in n-gram set are most likely not valid multiwords. In 2014, there are 37 valid-end-terms found from program. 10 of them are removed and only 27 valid-end-terms are used for the pattern of no spVar. Terms - "I", "W", "all", "bar", "may", "mine", "minus", "need", "one", and "other" have valid MWE in Lexicon without spVar and thus they are removed as well. Please refer to design documents of End-Term Types for details.

  • Input Term: core-term
  • Filter Algorithm:
    • Logics:

      Get invalid termsFT_TBD
      • EndTermPatObj.java
        • HashSet<String> endTermSet_
        • HashSet<String> natureTermSet_
        • HashSet<String> orgTermSet_
      • collect all terms that ends with valid-end-term
      • Save all terms with same normSpVar (hyphened or non-spaced)
        • lowercase terms if the valid-end-term is not upper case
        • lowercase terms if terms are not all upper case
        • save to a hashMap(normSpVar, EndTermPatObj) if nature-term ends with valid-end-term
      • Convert hashMap(normSpVar, EndTermPatObj) to invalid term list
        • Find EndTermPatObj matches valid-end-term pattern
        • no spVar (checks EndTermPatObj.natureTermSet_.size() <= 1)
        • not an invalid-end-term candidates
      Check if an invalid-term FT_END_TERM_INV_PATUse invalid-term-list from above step

    • source code: FilterEndTermPat.java
    • FilterType: FilterType.FT_END_TERM_INV_PAT

  • Accuracy Test on Lexicon:
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
      • ${OUT_DATA}/03.LeadEndTerm/validEndTerms.data.pat
      • ${OUT_DATA}/03.LeadEndTerm/invalidLeadEndTermCandidates.data
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2018FT_END_TERM_INV_PATTERN955564955535 29099.9970%
      2017FT_END_TERM_INV_PATTERN935276935247 29099.9969%
      2016FT_END_TERM_INV_PATTERN915583915554 29099.9968%
      2015FT_END_TERM_INV_PATTERN896213896190 23099.9974%
      2014FT_END_TERM_INV_PATTERN875090875068 22099.9975%