Exclusive Filter: A Term Leads with Absolute Invalid-Lead-Terms (ILT)

  • Description:
    If a term leads with absolute invalid-lead-terms (ILT), it is not a valid multiword. These terms are filtered out from the MEDLINE n-gram set.

  • Examples:
    • away from that
    • as to support
    • but a simple

    The absolute invalid-lead-terms (ILT) are derived from Lexicon. Some lead-terms from the invalid lead-end-term candidate list are absolute invalid lead-terms, such as "about", "across", "across from", etc.. N-grams start with any of these absolute invalid lead-terms are not valid multiwords. In 2014, there are 381 abosulute invalid lead-terms derived from coputer program. "the" is moved manually from valid-lead-terms to invalid-lead-terms becasue it was an error in Lexicon. The final file used for this model contains 382 abosulute invalid lead-terms. Please refer to design documents of Lead-Terms Types and Lead-End-Terms Model for details.

  • Input Term: core-term
  • Filter Algorithm:
    • Logics:

      Norm: strip punctuation except for '/.-FT_TBD
      • Optional
      • (ABB) of: => abb of
      • Norm case: go through all abs-inv-lead-terms (AILTs)
        • Case-1.1: if AILTs is not upper case and inTerm is uppercase
        • Case-1.2: if AILTs is not upper case and inTerm is mixed case and lead-word is not upper case
          => lowercase, use inTerm.lc
        • Case-2.1: if AILTs is upper case
        • Case-2.2: if AILTs is not upper case and inTerm is lowercase
        • Case-2.1: if AILTs is not upper case and inTerm is miexed case and lead-word is upper case
          => use inTerm (no change in case)
      CaseAILTinTerminTerm converted
      1.1hisHIS PROBLEMhis problem
      1.2hisHis problemhis problem
      Keep case
      2.1W/OW/O problemsW/O problems
      2.2hishis problemhis problem
      2.3norNOR miceNOR mice
      • Check if inTerm is abs-inv-lead-terms (AILT)
      • Expcetions: AILTs are valid terms
      • Check if inTerm leads with AILT + " "
      • his problem => invalid

    • source code: FilterLeadTermAbs.java
    • FilterType: FilterType.FT_LEAD_TERM_INV_ABS

  • Accuracy Test on Lexicon:
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
      • ${OUT_DATA}/03.LeadEndTerm/invalidLeadTerms.data.abs
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2018FT_LEAD_TERM_INV_ABS955564955508 56 44299.9941%
      2017FT_LEAD_TERM_INV_ABS935276935222 54 42299.9942%
      2016FT_LEAD_TERM_INV_ABS915583915531 52 43199.9943%
      2015FT_LEAD_TERM_INV_ABS896213896167 46 42799.9949%
      2014FT_LEAD_TERM_INV_ABS875090875044 46 42799.9947%

Please note two types of valid words are filtered out by mistake:
  • Init case type
    His bundle: is a valid word, a collection of heart muscle fibers were names after Swiss cardiologist Wilhelm His Jr.. who discovered themin 1893.
  • Upper case type:
    US EPA: United States Environmental Protection Agency

However, these types of valid words are very few. Also, these two trapped words are not multiwords and have been removed from Lexicon after 2015. In other word, "the" should belong to absolute invalid lead-term list.

  • the Netherlands
  • the Staatliche Frauenklinik und Hebammenschule