Exclusive Filter: Lead-End-Term pattern

  • Description:
    If a term is composed of two invalid-lead-end-term (ILET) candidates (starts with ILET and ends with ILET), it is not a valid multiword.

  • Examples:
    • in a
    • to be
    • with a
    • may be

  • Input Term: core-term
  • Filter Algorithm:
    • Logics:

      DescriptionFilterTypeNotes
      Norm: replace punc with space, then trimFT_TBDOptional:
      • In A. => In A
      inTerm a single word (not a multiword)FT_TBD
      inTerm is invalid-lead-end-term candidates (ILET)FT_TBD
      Case norm: inTerm
      • Case-1.1: if ILET is not uppercase and inTerm is uppercase
      • case-1.2: if ILET is not uppercase and inTerm is mixed case and lead-word is not uppercase
        => lowercase
      • Case 2.1: if ILET is uppercase
      • Case 2.2: if ILET is not uppercase and inTerm is lowercase
      • Case 2.3: if ILET is not uppercase and inTerm is mixed case and lead-word is uppercase
        => keep case
      FT_TBD
      CaseILETinTerminTerm converted
      LowerCase
      1.1inIN Ain a
      1.2inIn ain a
      Keep case
      2.1W/OW/O PROBLEMW/O PROBLEM
      2.2inin ain a
      2.3inUS hasUS has
      • inTerm starts with ILET + " "
      • ILET contains the rest of inTerm
      FT_LEAD_END_TERM_PAT
      • filtered invalid terms

    • source code: FilterLeadEndTermPat.java
    • FilterType: FilterType.FT_LEAD_END_TERM_PAT

  • Accuracy Test on Lexicon:
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
      • ${OUT_DATA}/03.LeadEndTerm/invalidLeadEndTermCandidates.data
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2018FT_LEAD_END_TERM_PAT955564955557 7 099.9993%
      2017FT_LEAD_END_TERM_PAT935276935269 7 099.9993%
      2016FT_LEAD_END_TERM_PAT915583915576 7 099.9992%
      2015FT_LEAD_END_TERM_PAT896213896206 7 099.9992%
      2014FT_LEAD_END_TERM_PAT875090875083 7 099.9992%

      Please notes that it is case sensitive, so some of invalid terms are not filtered out by designed (because they are not the disallowed category):

      • a His
      • a No.
      • in AS,
      • to W
      • on US,
      • to ME
      • with He