ASCII LEXICON, 2nd Version (since 2011)

I. Introduction

The Specialist LEXICON is distributed in UTF-8 format annually with UMLS. There are some NLP projects uses the Specialist LEXICON and still only dealing with ASCII characters. Due to the requests from user groups, the pure ASCII version of LEXICON is distributed since 2009.

MetaMap porject is a pure ASCII project and use pure ASCII LEXICON and results from Lexical Tools. MetaMap migrates 'C-code' lexical tools to Java Lexical Tools in 2010. A Java-Prolog interface of the Lexical Tools is implemented for this migration. There are three major issues cause difference on the results of 'C-code' and java Lexical Tools:

II. GenerateAsciiLexicon

This section focuses on the details on the ASCII conversion issue. The ASCII conversion is implemented in the program of GenerateAsciiLexicon to perform following tasks:

  • Input:
    • ${Lexicon}/data/${year}/tables/LEXICON

  • Outputs:
    • ${Lexicon}/data/${year}/ascii/LEXICON.ascii
    • ${Lexicon}/data/${year}/ascii/detail/*.log
    • ${Lexicon}/data/${year}/ascii/detail/reports/*.rpt
    • ${Lexicon}/data/${year}/ascii/detail/details/*.txt

  • Algorithm:
    • Convert citation and spelling variants form UTF-8 to ASCII (7-bit):
      • Convert Lexical Records to Java Objects
      • Use Lexical Tools Java API class, ToAsciiApi( ), for ASCII conversion

      • Convert citation (base=citation) to ASCII citationAscii
        ConditionActions
        • if citation is ASCII
        • Do nothing
        • if citation is not ASCII
        • if known by LEXICON
        • Set citation to citationAscii
        • Log: Eui|base|change|from-spVar|citation|citationAscii
        • if citation is not ASCII
        • if not known by LEXICON
        • remove the lexical record
        • Log: Eui|base|delete|not-Lex|citation|citationAscii

      • Convert spelling variants (spelling_variant=spVar) to spVarAscii
        ConditionActions
        • if spVar is ASCII
        • if not same as citationAscii
        • add to spVarAsciis
        • if spVar is ASCII
        • if same as citationAscii
        • Remove
        • Log: Eui|spVar|delete|ascii-dup-base|spVar|spVarAscii
        • if spVar is not ASCII
        • if same as citationAscii
        • Remove
        • Log: Eui|spVar|delete|dup-base|spVar|spVarAscii
        • if spVar is not ASCII
        • if duplicated from ASCII spVars
        • Remove
        • Log: Eui|spVar|delete|dup-spVar|spVar|spVarAscii
        • if citation is not ASCII
        • if not known by LEXICON
        • remove the lexical record
        • Log: Eui|spVar|delete|not-Lex|spVar|spVarAscii
      • Load hashtable<EUI, citationAscii>
      • Input: ${Lexicon}/data/${year}/tables/LEXICON
      • Output: ${Lexicon}/data/${year}/ascii/LEXICON.asciiBase
      • Log: ${Lexicon}/data/${year}/ascii/detail/LEXICON.asciiBaseLog

    • Convert the rest of LEXICON form UTF-8 to ASCII (7-bit):
      • Read in LEXICON.asciiBase and convert to ASCII line by line
        ConditionActions
        • if line is ASCII
        • keep

      • Convert acronym (acronym_of=acronym) to ASCII acronymAscii
        ConditionActions
        • if line is not ASCII
        • if line startsWith "acronym_of="
        • if has EUI
        • replace acronym with acronymAscii by EUI
        • Log: EUI|change|ascii-base|acronym_of=acronym|EUI|acronym_of=acronymAscii|EUI
        • if line is not ASCII
        • if line startsWith "acronym_of="
        • if no EUI
        • Remove
        • Log: EUI|delete|not-Lex|acronym_of=acronym|EUI|acronym_of=acronymAscii|EUI

      • Convert abbreviation (abbreviation_of=abbreviation) to ASCII abbreviationAscii
        ConditionActions
        • if line is not ASCII
        • if line startsWith "abbreviation_of="
        • if has EUI
        • replace abbreviation with abbreviationAscii by EUI
        • Log: EUI|change|ascii-base|abbreviation_of=abbreviation|EUI|abbreviation_of=abbreviationAscii|EUI
        • if line is not ASCII
        • if line startsWith "abbreviation_of="
        • if no EUI
        • Remove
        • Log: EUI|delete|not-Lex|abbreviation_of=abbreviation|EUI|abbreviation_of=abbreviationAscii|EUI

      • Convert nominalization (nominalization_of=nominalization) to ASCII nominalizationAscii
        ConditionActions
        • if line is not ASCII
        • if line startsWith "nominalization="
        • if has EUI
        • replace nominalization with nominalizationAscii by EUI
        • Log: EUI|change|ascii-base|nominalization=nominalization|Cat|EUI|nominalization=nominalizationAscii|Cat|EUI
        • if line is not ASCII
        • if line startsWith "nominalization="
        • if no EUI
        • Remove
        • Log: EUI|delete|not-Lex|nominalization=nominalization|Cat|EUI|nominalization=nominalizationAscii|Cat|EUI

      • Remove all lines (other fields) contain non-ASCII characters
        ConditionActions
        • if line is not ASCII
        • if line not startsWith "acronym_of="
        • if line not startsWith "abbreviation_of="
        • if line not startsWith "nominalization="
        • Remove
        • Log: EUI|delete|not-Lex|line|lineAscii

        Other fields are not used in MetaMap, very little amount has non-ASCII (6 lines) and very hard to tokenize the non-ASCII term. Simply remove the whole line is a good solution. These line are logged. For 2010 release, following lines are removed
        FieldsRemoved lines & Notes
        variants=irreg
        All spVars should have their own irreg and thus non-ASCII irreg (which is either duplication of ASCII spVar or not known to Lexicon) can be removed except for followings:
        • E0028609|delete|non-Ascii| variants=irreg|formula|formulæ|| variants=irreg|formula|formulae|
        • E0344461|delete|non-Ascii| variants=irreg|purée|purées|puréed|puréed|puréeing|| variants=irreg|puree|purees|pureed|pureed|pureeing
        • E0547096|delete|non-Ascii| variants=irreg|tache méningéale|taches méningéales|| variants=irreg|tache meningeale|taches meningeales|
        compl=pphr
        • E0027351|delete|non-Ascii| compl=pphr(of,np|Türck|)| compl=pphr(of,np|Turck|)
        • E0064187|delete|non-Ascii| compl=pphr(of,np|Labbé|)| compl=pphr(of,np|Labbe|
        trademark=filler
        • E0522540|delete|non-Ascii| trademark=Bacillus Calmette-Guérin (BCG), substrain Connaught| trademark=Bacillus Calmette-Guerin (BCG), substrain Connaugh
      • Input: ${Lexicon}/data/${year}/ascii/LEXICON.asciiBase
      • Output: ${Lexicon}/data/${year}/ascii/LEXICON.asciiLine
      • Log: ${Lexicon}/data/${year}/ascii/detail/LEXICON.asciiLineLog

    • Final contents clean up:
      After the above conversion, we apply LexCheck.CheckContent.Check( ) to perfrom the final cleanUp. This step is extra and nothing should be found to clean.
      • Input: ${Lexicon}/data/${year}/ascii/LEXICON.asciiLine
      • Output: ${Lexicon}/data/${year}/ascii/LEXICON.ascii
      • Log: ${Lexicon}/data/${year}/ascii/detail/LEXICON.asciiLog