Exclusive Filter: A Term Leads with Valid-Lead-Term matches Pattern of no SpVar (VLTP)

  • Description:
    If a term leads with a valid-lead-term and has no spelling variants co-exist in the n-gram set, it is not a valid multiword. These terms are filtered out from the MEDLINE n-gram set. The spelling variants pattern includes hypen (under floor|under-floor), non-space (under floor|underfloor), case (a stage resin|A stage resin), and combination of above cases (a stage resin|A-stage resin).

  • Examples:
    • within type I
    • after a surgery
    • for a policy

    The valid-lead-terms are derived from Lexicon. Some lead-terms from the invalid lead-end-term candidate list are valid-lead-terms and used to checked in the spVar pattern, such as "W", "many", "out of", etc.. N-grams start with any of these pattern valid lead-terms and does not have spelling variant co-exist in n-gram set are most likely not valid multiwords. In 2014, there are 63 valid-lead-terms found from program. 11 of them are removed and only 52 valid-lead-terms are used for the pattern of no spVar. There are two wrong lexRecords associated with "the" and thus "the" is moved to absolute invalid lead-term. Terms - "ex", "insdie", "last", "most", "only", "per", "round", "sersu", "v.", and "w" have valid MWE in Lexicon without spVar and thus they are removed as well. Please refer to design documents of Lead-Term Types for details.

  • Input Term: core-term
  • Filter Algorithm:
    • Logics:

      DescriptionFilterTypeNotes
      Get invalid termsFT_TBD
      • LeadTermPatObj.java
        • HashSet<String> leadTermSet_
        • HashSet<String> natureTermSet_
        • HashSet<String> orgTermSet_
      • collect all terms that starts with valid-lead-term
      • Save all terms with same normSpVar (hyphened or non-spaced)
        • lowercase terms if the valid-lead-term is not upper case
        • get nature-term (strip lead-end-punc, for example: - in details, => in details)
        • save to a hashMap(normSpVar, LeadTermPatObj) if nature-term starts with valid-lead-term
      • Convert hashMap(normSpVar, LeadTermPatObj) to invalid term list
        • Find LeadTermPatObj matches valid-lead-term pattern
        • no spVar exist (checks LeadTermPatObj.natureTermSet_.size() <= 1)
        • not an invalid-lead-term candidates
      Check if an invalid-term FT_LEAD_TERM_INV_PATUse invalid-term-list from above step

    • source code: FilterLeadTermPat.java
    • FilterType: FilterType.FT_LEAD_TERM_INV_PAT

  • Accuracy Test on Lexicon:
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
      • ${OUT_DATA}/03.LeadEndTerm/validLeadTerms.data.pat
      • ${OUT_DATA}/03.LeadEndTerm/invalidLeadEndTermCandidates.data
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2018FT_LEAD_TERM_INV_PAT955564955476 88099.9908%
      2017FT_LEAD_TERM_INV_PAT935276935192 84099.9910%
      2016FT_LEAD_TERM_INV_PAT915583915503 80099.9913%
      2015FT_LEAD_TERM_INV_PAT896213896120 93099.9896%
      2014FT_LEAD_TERM_INV_PAT875090874991 99099.9887%

  • Example Walk Through (invalid terms):

    OperationsContents
    Inputs (21):
    • Test

    • in particular
    • in-particular
    • inparticular
    • In particular
    • IN PARTICULAR
    • -in particular
    • in particular),
    • - in particular,

    • in conclusion
    • In conclusion
    • IN CONCLUSION
    • in conclusion,
    • -in conclusion,

    • in to
    • internal
    • all in all
    • after pressure
    • on-board imager
    • out of kilter
    • one gene-one enzyme hypothesis
    1. Form HashMap
    =>Please note:
    "Test" is not in HashMap because does not match valid-lead-term
    Some terms has multiple matches on lead-term, such as on-board-image

    2. Check:
    2.1 Match Lead-Term
    2.2 Has SpVar
    2.3 Is ILET

    3. Invalid Term:

    • Match lead-term (2.1: true)
    • Has no SpVar (2.2: false)
    • Not a ILET (2.3: false)
    HashMap2.1 Match Lead-Term2.2 Has SpVar2.3 Is ILETValid?
    key (normSpVar)Value (LeadTermPatObj)
    internal
    • lead-terms: in
    • nature-terms: internal
    • org-terms: internal
    falsen/an/avalid
    allinall
    • lead-terms: all, a
    • nature-terms: all in all
    • org-terms: all in all
    truefalsefalseinvalid
    innonspace
    • lead-terms: in
    • nature-terms: innonspace, in nonspace
    • org-terms: innonspace, in nonspace
    truetruen/avalid
    IN PARTICULARINPARTICULAR
    • lead-terms: I
    • nature-terms: IN PARTICULAR
    • org-terms: IN PARTICULAR
    falsen/an/avalid
    inhyphen
    • lead-terms: in
    • nature-terms: in hyphen, in-hyphen
    • org-terms: in hyphen, in-hyphen
    truetruen/avalid
    INCONCLUSION
    • lead-terms: I
    • nature-terms: IN CONCLUSION
    • org-terms: IN CONCLUSION
    falsen/an/avalid
    onegeneoneenzymehypothesis
    • lead-terms: one, on
    • nature-terms: one gene-one enzyme hypothesis
    • org-terms: one gene-one enzyme hypothesis
    truefalsefalseinvalid
    afterpressure
    • lead-terms: a, after
    • nature-terms: after pressure
    • org-terms: after pressure
    truefalsefalseinvalid
    into
    • lead-terms: in
    • nature-terms: in to
    • org-terms: in to
    truefalsetruevalid
    outofkilter
    • lead-terms: out of, out
    • nature-terms: out of kilter
    • org-terms: out of kilter
    truefalsefalseinvalid
    inconclusion
    • lead-terms: in
    • nature-terms: in conclusion
    • org-terms: in conclusion, IN CONCLUSION, In conclusion
    truefalsefalseinvalid
    Inparticular
    • lead-terms: I
    • nature-terms: In particular
    • org-terms: In particular
    falsen/an/avalid
    Inconclusion
    • lead-terms: I
    • nature-terms: In conclusion
    • org-terms: In conclusion
    falsen/an/avalid
    onboardimager
    • lead-terms: on-board, on
    • nature-terms: on-board imager
    • org-terms: on-board imager
    truefalsefalseinvalid
    inparticular
    • lead-terms: in
    • nature-terms: in particular, in-particular, inparticular
    • org-terms: In particular, in particular, in-particular, inparticular, IN PARTICULAR
    truetruen/avalid
    inother
    • lead-terms: in
    • nature-terms: in-other, inother
    • org-terms: in-other, inother
    falsen/an/avalid
    Invalid-Terms (10):

    • in conclusion
    • In conclusion
    • IN CONCLUSION
    • in conclusion,
    • -in conclusion,

    • after pressure
    • all in all
    • on-board imager
    • one gene-one enzyme hypothesis
    • out of kilter