ASCII LEXICON, 1st Version (09~10)

I. Introduction

The Specialist LEXICON is distributed in UTF-8 format annually with UMLS. There are some NLP projects uses the Specialist LEXICON and still only dealing with ASCII characters. Due to the requests from user groups, the pure ASCII version of LEXICON is distributed since 2009.

II. Algorithm

  • Convert LEXICON form UTF-8 to ASCII (7-bit):
    Use Java API class, ToAsciiApi( ), from Lexical Tools (after 2009) to make the conversion.

  • Automatic/manually clean up:
    After the conversion, some data in Lexical records need to be clean up. For example, the spelling variant résumé is converted to resume and should be removed since it is the same as the base form. In 2009 LEXICON, we found following ASCII conversion cases that need to be clean up as shown in the following table. The LexCheck.CheckContent.Check( ) is used to clean up duplications.

    LEXICON contentActionNotes & Example
    {base=fillerN/AAll base is unique
    spelling_variant=fillerremove if it is duplicatedspelling_variant=résumé
    abbreviation_of=abbreviationremove if it is duplicatedNone
    acronym_of=acronymsremove if it is duplicatedNone
    nominalization_of=fillerremove if it is duplicatedNone
    variants=irregremove if it is duplicatedirreg|saute|sautes|sauted|sauted|sauteing|
    compl=pphr(N/ANeeds manual cleanup (none)
    trademark=filler(N/ANeeds manual cleanup (none)