Exclusive Filter: A Term contains pattern of disallowed punctuation

  • Description:
    If a term contains pattern of disallowed punctuation, it is an invalid MWE and. Disallowed punctuation includes {}_!@#*\;"?~=|<>$`^. The count of valid terms in Lexicon.2014 is shown as table below:

    PunctuationCount in Lexicon.2014
    {0
    }0
    _0
    !1
    @0
    #0
    *4
    \0
    ;0
    "10
    ?0
    ~0
    =0
    |0
    <0
    >0
    $0
    `4
    ^0

    Among above punctuation, some are allowed to be the leading or ending characters. They should be stripped first:

    • Allowed leading-char-punctuation:
      {<"spacetab*#
    • Allowed ending-char-punctuation:
      }>"spacetab?!;

  • Examples:
    • (ps> 0.05)
    • group (n=6) received
    • US$

  • Input Term: original term
  • Filter Algorithm:
    • Logics:

      DescriptionFilterTypeNotes
      Norm: strip allowed-leading-char-punctuationFT_TBD
      Norm: strip allowed-ending-char-punctuationFT_TBD
      Check if word contains disallowed punctuationFT_PUNC_DISALLOW
      • filtered invalid terms - contain disallowed punctuation :

    • source code: FilterPuncDisallow.java
    • FilterType: FilterType.FT_PUNC_DISALLOW

  • Accuracy Test on Lexicon:
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2018FT_PUNC_DISALLOW955564955550 14 099.9985%
      2017FT_PUNC_DISALLOW935276935263 13 099.9986%
      2016FT_PUNC_DISALLOW915583915570 13 099.9986%
      2015FT_PUNC_DISALLOW896213896194 19 099.9979%
      2014FT_PUNC_DISALLOW875090875071 19 099.9978%