Definition of word count
I. What is a word?
A word has part of speech, inflections, and meaning.
- Word boundary:
spaces (or tabs) are usually used as word boundaries in NLP.
- Single words:
A word is separated by a space (tab). Such words are called "single word. Such as "saw", "ice-cream", "clubfoot", and "club-foot".
A multiword is a word (has part of speech and meaning) include space. Such "ice cream", "club foot", and "drop-foot gait".
II. Word count (How many words in the SPECIALIST Lexicon)?
- The word count is the count for different words.
- By Definition, a word has part of speech, inflections, and meaning.
- saw|noun|E0054443 is a different word from saw|verb|E0054444 and saw|verb|E0055007
- All words with different categories and inflections are considered as different words. However, words with inflection of "base" could be duplicated words and should not be counted twice for categories of adj, adv, verb, aux, modal, and verb (also pres1p23p). The table below illustrates the duplicated cases for inflection of "base".
Category InflVar - Inflection Unique word? Notes compl (8) base (1) true conj (16) base (1) true det (32) base (1) true prep (256) base (1) true pron (512) base (1) true adj (1) base (1) false positive (256) = base (1) adv (2) base (1) false positive (256) = base (1) verb (1024) base (1) false infinitive (1024) = base (1) pres1p23p (262144) false infinitive (1024) = pres1p23p (262144) aux (4) base (1) false infinitive (1024) = base (1) have - pres123p (2048) false have: infinitive (2014) = pre123p (2048) modal (64) base (1) false pres (2097152) = base (1) noun (128) base (1) false singular (512) = base (1), e.g. paper, fish, sheep
plural (8) = base (1), e.g. police, fish, sheep
- For the corpus without information of categories and inflections, only the spelling (forms) are taken into consideration, such as in n-gram set.
III. Lexicon Stats