Definition of word count

I. What is a word?

  • Words:
    A word has part of speech, inflections, and meaning.
  • Word boundary:
    spaces (or tabs) are usually used as word boundaries in NLP.
  • Single words:
    A word is separated by a space (tab). Such words are called "single word. Such as "saw", "ice-cream", "clubfoot", and "club-foot".
  • multiwords:
    A multiword is a word (has part of speech and meaning) include space. Such "ice cream", "club foot", and "drop-foot gait".

II. Word count (How many words in the SPECIALIST Lexicon)?

  • The word count is the count for different words.
  • By Definition, a word has part of speech, inflections, and meaning.
    • saw|noun|E0054443 is a different word from saw|verb|E0054444 and saw|verb|E0055007
    • All words with different categories and inflections are considered as different words. However, words with inflection of "base" could be duplicated words and should not be counted twice for categories of adj, adv, verb, aux, modal, and verb (also pres1p23p). The table below illustrates the duplicated cases for inflection of "base".
      CategoryInflVar - InflectionUnique word?Notes
      compl (8)base (1)true 
      conj (16)base (1)true 
      det (32)base (1)true 
      prep (256)base (1)true 
      pron (512)base (1)true 
      adj (1)base (1)falsepositive (256) = base (1)
      adv (2)base (1)falsepositive (256) = base (1)
      verb (1024)base (1)falseinfinitive (1024) = base (1)
      pres1p23p (262144)falseinfinitive (1024) = pres1p23p (262144)
      aux (4)base (1)falseinfinitive (1024) = base (1)
      have - pres123p (2048)falsehave: infinitive (2014) = pre123p (2048)
      modal (64)base (1)falsepres (2097152) = base (1)
      noun (128)base (1)falsesingular (512) = base (1), e.g. paper, fish, sheep
      plural (8) = base (1), e.g. police, fish, sheep
  • For the corpus without information of categories and inflections, only the spelling (forms) are taken into consideration, such as in n-gram set.

III. Lexicon Stats