Exclusive Filter: A Term is in Lexicon

  • Description:
    If a term is in Lexicon, it is a valid (multi)word. However, they are not not multiword cndidates (because they are already in Lexicon). This filter has several filter levels as described follows. Any of these levels are considered as "true" match.

    Filter TypeDescriptionExample
    FT_LEX_EMexact matchskin disease
    FT_LEX_LCmatch after lowercaseSkin disease
    FT_LEX_LE_PUNCmathc after strip leading and ending punctuation:skin disease
    FT_LEX_LC_LE_PUNCmathc after lowercase & strip leading and ending punctuation:SKIN DISEASE
    FT_LEX_ALL_PUNCmathc after removing all punctuation:skin-disease,
    FT_LEX_LC_ALL_PUNCmatch after lowercase and removing all punctuation:SKIN-DISEASE,

  • Filter Algorithm:
    • Logics:

      DescriptionFilterTypeNotes
      InitFT_TBD
      Check exact matchFT_LEX_EM
      • filtered terms in Leixcon
      Check match after lowercaseFT_LEX_LC
      • filtered terms in Leixcon
      Check match after removing lead-end punctuationFT_LEX_LE_PUNC
      • filtered terms in Leixcon
      Check match after lowercase and removing lead-end punctuationFT_LEX_LC_LE_PUNC
      • filtered terms in Leixcon
      Check match after removing all punctuationFT_LEX_ALL_PUNC
      • filtered terms in Leixcon
      Check match after lowercase and removing all punctuation FT_LEX_LC_ALL_PUNC
      • filtered terms in Leixcon

    • source code: FilterLexicon.java
    • FilterType: FilterType.FT_LEX_EM

  • Accuracy Test on Lexicon:
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
      • ${IN_DATA}/inflVars.data
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2018FT_LEX_EM95556495556400100.0000%
      2017FT_LEX_EM93527693527600100.0000%
      2016FT_LEX_EM91558391558300100.0000%
      2015FT_LEX_EM89621389621300100.0000%
      2014FT_LEX_EM87509087509000100.0000%