Multiwords: Normalization

I. Why Normalization?

A same term could be represented in many different forms (of genitive, punctuation, and case) in MEDLINE. For example, "diabetes mellitus" appears in the following n-gram terms from MEDLINE:

  • diabetes mellitus
  • diabetes mellitus,
  • diabetes mellitus]
  • diabetes mellitus:
  • diabetes mellitus.
  • [diabetes mellitus
  • diabetes mellitus)
  • (diabetes mellitus
  • (diabetes mellitus,
  • diabetes mellitus),
  • (diabetes mellitus;
  • diabetes mellitus?]
  • (diabetes mellitus)
  • diabetes mellitus -

  • Diabetes mellitus
  • Diabetes Mellitus

  • Diabetes mellitus,
  • Diabetes mellitus.
  • [Diabetes mellitus
  • [Diabetes Mellitus:
  • [Diabetes mellitus]
  • Diabetes Mellitus:
  • Diabetes Mellitus,

Normalization (by abstracting away from genitive, punctuation, and case) is applied to n-gram terms so that these terms can be grouped for further reviewed and analysis. Also, the word count of normalized n-gram terms reflects true frequency of usage on the n-gram term.

II. Normalization

  • The normalization uses function of Lexical Tools flow components
    • -f:g (remove genitive)
    • -f:o (replace punctation with space)
    • -f:l (lowercase)

III. Normalization Usage in N-gram to generate (multi)words

We used normalization as follows:

  • Use the WC of normalized terms for the prediction filter to generate high frequency n-gram
  • Candidate multiwords filtered from MEDLINE n-grams are grouped by normalized terms. Both normalized n-gram and original n-gram are sent to linguists for review.