Rule-Types

I. Introduction
The software assign all input words (from N-Grams) to a set of predefined rule-types. The format are: RT_CAT_PATTERN

  • RT: RuleType
  • CAT:
    • LEX: in Lexicon
    • CAN: candidatate
    • INV: invalid multiwords
    • TBD: to be determined
  • PATTERN:
    • EM, LC, LE_PUNC, LC_LE_PUNC, ALL_PUNC, LC_ALL_PUNC, NUMBER, DIGIT
    • SPVAR
    • SINGLE_WORD
    • LEAD_WROD
    • END_ABB
    • END_WORD
    • LEAD_END_WORD

II. RuleTypes - Examples

The following rule types are used to filter out invalid multiwords from MEDLINE n-grams (by the same sequencial order in the program RuleType.java):

TypeDescriptionExamples (with element word "mellitus")
Default: valid candidate multiword
RT_TBDCandidate multiwords
  • mellitus
  • diabetes mellitus
if the whole n-gram (input term) is in Lexicon
RT_LEX_EMExact match
  • diabetes mellitus
  • insulin-dependent diabetes mellitus
RT_LEX_LCMatch (after lowercased)
  • DIABETES MELLITUS
  • Insulin-dependent diabetes mellitus
RT_LEX_LE_PUNCmatch (after removing punctuation at lead or/and end words)
  • diabetes mellitus,
  • (diabetes mellitus,
  • diabetes mellitus),
RT_LEX_LC_LE_PUNCmatch (after lowercased, removing punctuation at lead or/and end words)
  • [Diabetes mellitus
  • DIABETES MELLITUS]
  • [Diabetes mellitus]
RT_LEX_ALL_PUNCmatch (after removing all punctuation)
  • diabetes mellitus -
RT_LEX_LC_ALL_PUNCmatch (after lowercased, removing all punctuation)
  • DIABETES MELLITUS -
RT_LEX_NUMBERnumber (after lowercased, removing all punctuation)
  • fifty
  • , Fifty
RT_LEX_DIGITall digits (after lowercased, removing all punctuation)
  • 50
  • , 50
if the n-gram (input term) is not a valid multiword
RT_INV_SINGLE_WORDa single word (uni-gram)
  • mellitus
  • Mellitus
RT_INV_LEAD_WORDbeginning with a nonLead word
  • Auxiliary (3) - be, do, have, are, don't, has, etc.
  • Complementizer (1) - that
  • Conjunction (67) - and, or, but, as if, as well as, and/or, etc.
  • Determiner (38) - a, all, the, some, each, which, etc.
  • Modal (8) - can, dare, may, must, ought, shall, will, need, might, etc.
  • Preposition (216) - about, across from, to, on, in, at, by, as far as, etc.
  • have diabetes mellitus
  • that diabetes mellitus
  • or diabetes mellitus
  • which diabetes mellitus
  • may diabetes mellitus
  • of diabetes mellitus
RT_INV_END_ABBending word - acronym in parenthesis
  • mellitus (DM)
  • mellitus (DM),
RT_INV_END_WORDending with a nonEnd word
  • Auxiliary (3) - be, do, have, are, don't, has, etc.
  • Complementizer (1) - that
  • Conjunction (67) - and, or, but, as if, as well as, and/or, etc.
  • Determiner (38) - a, all, the, some, each, which, etc.
  • Modal (8) - can, dare, may, must, ought, shall, will, need, might, etc.
  • Preposition (216) - about, across from, to, on, in, at, by, as far as, etc.
  • diabetes mellitus have
  • diabetes mellitus that
  • diabetes mellitus or
  • diabetes mellitus which
  • diabetes mellitus may
  • diabetes mellitus in
RT_INV_LEAD_END_WORDbi-gram coposed of nonLead and nonEnd Words
  • as if
  • across from
If have spelling variants in n-gram
RT_CAN_SPVARA group of n-Gram match spelling variant pattern - considered as candidate of multiwords
  • noninsulin dependent diabetes mellitus
  • non-insulin dependent diabetes mellitus
  • non insulin-dependent diabetes mellitus
  • non insulin dependent diabetes mellitus