Exclusive Filter: A Term is a Number

  • Description:
    If a term is a number, it is not a valid/new multiword (already known in Lexicon). Two normalization (lowercase and strip punctuation) are performed in this filter.

  • Examples:
    • five
    • fifth
    • five hundred and fifty five

  • Input Term: core-term.lc
  • Filter Algorithm:
    • Logics:

      DescriptionFilterTypeNotes
      Get words from inTermFT_TBD
      Norm:lowercase and strip punctuationFT_TBD
      Check if all words are numbersFT_LEX_NUMBER
      • filtered invalid terms - numbers after lowercase and strip punctuation

    • source code: FilterNumber.java
    • FilterType: FilterType.FT_LEX_NUMBER

  • Accuracy Test on Lexicon:
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
      • ${IN_DATA}/inData/NRVAR
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2018FT_LEX_NUMBER955564955521 43 099.9955%
      2017FT_LEX_NUMBER935276935233 43 099.9954%
      2016FT_LEX_NUMBER915583915540 43 099.9953%
      2015FT_LEX_NUMBER896213896170 43 099.9952%
      2014FT_LEX_NUMBER875090875049 41 099.9953%

      There are 41/43 numbers are recorded in Lexicon and are trapped by this filter.