The MEDLINE N-gram Set 2014: by Split, Group, Filter, and Combine Algorithm

The MEDLINE n-gram set - 2014 (generated by split, group, filter, and combine algorithm) is listed as bellows. For each MEDLINE record, title and abstract are used as the source of n-grams. They are combined, tokenized into sentences, and then tokenized into tokens (words use space as word boundary). Finally, n-grams are generated by filtering out terms have than 50 characters or the total word count is less than 30. The specifications of generating these n-grams are listed as follows:

  • MEDLINE: 2014 - TI and AB (from PmidTiAbS14nXXXX.txt: 1 ~ 746)
  • Method: Split, Combine, Filter Algorithm
  • Max. Character Size: 50
  • Min. word count: 30
  • Min. document count: 1

  • Total document count: 22,356,869
  • Total sentence count: 126,612,705
  • Total token count: 2,610,209,406

  • N-gram files
    • File format - 3 fields:
      Document countWord countN-gram
    • Sorted by document count, word count, then alphabetic order of n-grams.


    N-gramsFileZip SizeActual SizeNo. of n-grams
    N-gram SetnGramSet.2014.30.tgz170MB437MB17,023,819
    Distilled N-gram SetdistilledNGram.2014.tgz51MB164MB6,351,392