N-gram Utilities

I. Introduction

Some utility software are developed for processing n-gram. They are used in many processes and are summarized in this page.

II. Detail Process

  • Dir: ${MULTIWORDS}/bin/06.NGramUtil
  • Programs:
    StepDescriptionInputsOutputsNotes
    1Grep terms (nGrams) then sort
    • NGramUtil.GrepTermsSort
    • nGram.${YEAR}
    • nGram.${YEAR}.term.sort
    • Must create a link of the input nGram.${YEAR}
    2Filter pipe (|) from nGrams
    • NGramUtil.FilterPipe
    • nGram.${YEAR}
    • nGram.${YEAR}.noPipe
    • nGram.${YEAR}.pipe
    3Group nGrams by core-term
    • NGramUtil.GroupByCoreTerm
    • nGram.${YEAR}.noPipe
    • nGram.${YEAR}.noPipe.core
    • nGram.${YEAR}.noPipe.core.detail
    • Group by core-term, also update the WC
    4Group nGrams by norm-term
    • NGramUtil.GroupByNormTerm
    • nGram.${YEAR}.noPipe
    • nGram.${YEAR}.noPipe.norm
    • nGram.${YEAR}.noPipe.norm.detail
    • Group by norm-term, also update the WC
    Convert from WC|core-term back to DC|WC|TERM
    5Sort nGrams by DC|WC|Term
    • NGramUtil.SortNGramsByDcWc
    • nGram.${YEAR}.noPipe
    • nGram.${YEAR}.noPipe.sort.WcDcTerm
    • input is sorted by N, then DC|WC|Term
    6Convert (ungroup) core-term to nGrams
    • NGramUtil.CoreTermToNGram
    • nGram.${YEAR}.noPipe.core
    • nGram.${YEAR}.noPipe.core.detail
    • nGram.${YEAR}.noPipe.core.ungroup
    • the result is sorted, same as results from Step 5
    • in format (core-term): WC|core-term
    • out format (core-term): DC|WC|TERM
    Convert from WC|core-term.lc back to WC|core-term
    7Group nGrams by core-term.lc
    • NGramUtil.GroupByCoreTerm
    • nGram.${YEAR}.noPipe
    • nGram.${YEAR}.noPipe.core.lc
    • nGram.${YEAR}.noPipe.core.lc.detail
    • Results are the same because the input is all lowercase
    8core-term to corm-term nGrams
    • NGramUtil.CoreTermLcToCoreTerm
    • nGram.${YEAR}.noPipe.core.lc
    • nGram.${YEAR}.noPipe.core.lc.detail
    • nGram.${YEAR}.noPipe.core.lc.core
    • nGram.${YEAR}.noPipe.core.lc.core.detail
    • Results are the same because the input is all lowercase
    Group n-gram set by core-term.lc
    10Group nGram set by core-term.lc
    • NGramUtil.CoreTermToNGram
    • ${NGRAM_DIR}nGramSet.${YEAR}.30
    • ${NGRAM_DIR}nGramSet.${YEAR}.30.core.lc
    • ${NGRAM_DIR}nGramSet.${YEAR}.core.lc.detail
    11Group distilled nGram set by core-term.lc
    • NGramUtil.CoreTermToNGram
    • ${NGRAM_DIR}distilledNGram.${YEAR}
    • ${NGRAM_DIR}distilledNGram.${YEAR}.core.lc
    • ${NGRAM_DIR}distilledNGram.${YEAR}.core.lc.detail