Exclusive Filter: A Term Contians a pattern of Parenthetic Acronym

  • Description:
    If a term contains a parenthetic acronym, it is not a valid multiword. These terms are filtered out from the MEDLINE n-gram set.

  • Examples:
    • the basement membrane zone (BMZ)
    • basement membrane zone (BMZ)
    • membrane zone (BMZ)
    • zone (BMZ)
    • zone (BMZ) of
    • membrane zone (BMZ) of
    • basement membrane zone (BMZ) of
    • zinc finger proteins (ZFPs)

    In this (ACR) parenthetic acronym pattern, it must match the following criteria:

    • Must have left and right parenthesis ( .... )
    • The acronym must be all upper case of A-Z or 0-9
    • The acronym must have more than one character
    • The acronym can't be all digits
    • The last latter of acronym must be the same as the initial letter of it's previous word

    • plural form is also included (ACRs)

  • Filter Algorithm:
    • Logics:

      DescriptionFilterTypeNotes
      Get wordListFT_TBD
      Go through wordList and find ACR and it's previous wordFT_TBD
      • check left and right parenthesis
      • get ACR from the content between parenthesis
      • get the previous word of (ACR)
      Get ACR singular form from (ACRs)FT_TBD
      Check if no ACR found (null)FT_PAR_ACR_NO_PAR
      • basement membrane zone
      Check if ACR is not uppercaseFT_PAR_ACR_NOT_UPPERCASE
      • interleukin 12 (p40)
      • hemolymph stimulatory factor (Manduca)
      • regulatory subunit 1 (alpha)
      Check if ACR has only one charFT_PAR_ACR_SINGLE_CHAR
      • DeltaPsi (M)
      • antipoly (A) antibody
      • recombinant apolipoprotein (a)
      Check if ACR is all digitFT_PAR_ACR_ALL_DIGI
      • yttrium (90)
      • β (2) microglobulin
      • anteiso-C (17:0)
      Check if ACR has no previous wordFT_PAR_ACR_NO_PREV_WORD
      • (BMZ) of
      Check if last char of ACR is not the same as initial of previosu wordFT_PAR_ACR_NOT_ACR
      • copper (II) complex
      • T3 (RIA)
      • Plasmodium berghei (ANKA)
      Check if last char of ACR is the same as initial of previosu wordFT_PAR_ACR
      • membrane zone (BMZ) of
      • zone (BMZ) of

    • source code: FilterParentheticAcronym.java
    • FilterType: FilterType.FT_PAR_ACR

  • Accuracy Test on Lexicon:
    • InFile:
      • ${OUT_DATA}/03.LeadEndTerm/lexWords.data
    • Result:

      LexiconFilterSample NoPass NoTrap NoExp NoPass-Rate
      2018FT_PAR_ACR955564955563 1 25899.9999%
      2017FT_PAR_ACR935276935275 1 22599.9999%
      2016FT_PAR_ACR9155839155830 204100.0000%
      2015FT_PAR_ACR8962138962130 182100.0000%
      2014FT_PAR_ACR8750908750900 171100.0000%