LexBuild - Fix Illegal Non-ASCII Characters

There are some characters (U+0080 ~ U+009F) has different value between Unicode and HTML entities. These characters defy compliance are defined as nonstandard entities in conformance column in HTML character entities table. These characters are not decoded/viewed well in UTF-8 while they are viewable in html browser. These characters need to be converted to correct Unicode value. This feature is to modify these characters as described in following tables to correct Unicode.

In LEXICON 2008, this fix is done in post-processing on LEXICON. Please notes that annotation field in LEXICON are not included in the official release and thus are not fixed by this feature.

After 2008, LexBuild is upgraded to Unicode (UTF-8) in web browser. All data are also upgrade to UTF-8 after this upgrade. All these mapping are done automatically when users input the data. This mapping program is only used to double check.

I. Automatic fix characters

Illegal CharactersReplace Characters
NameCharValueNameCharValueASCII Mapping in 2008
 [‚]U+0082SINGLE LOW-9 QUOTATION MARK[‚]U+201A[,]: U+002C, COMMA
Florin[ƒ]U+0083LATIN SMALL LETTER F WITH HOOK[ƒ]U+0192 
Right double Quote[„]U+0084DOUBLE LOW-9 QUOTATION MARK[„]U+201E;["]: U+0022, QUOTATION MARK
Ellipsis[…]U+0085HORIZONTAL ELLIPSIS[…]U+2026[.]: U+002E, Periods (Add three period [...])
Dagger[†]U+0086DAGGER[†]U+2020 
Double Dagger[‡]U+0087DOUBLE DAGGER[‡]U+2021 
Circumflex[ˆ]U+0088MODIFIER LETTER CIRCUMFLEX ACCENT[ˆ]U+02C6[^]: U+005E, CIRCUMFLEX ACCENT
Permil[‰]U+0089PER MILLE SIGN[‰]U+2030 
 [Š]U+008ALATIN CAPITAL LETTER S WITH CARON[Š]U+0160 
Less than sign[‹]U+008BSINGLE LEFT-POINTING ANGLE QUOTATION MARK[‹]U+2039[<]: U+003C, LESS-THAN SIGN
Capital OE Ligature[Œ]U+008CLATIN CAPITAL LIGATURE OE[Œ]U+0152 
Left Single Quote[‘]U+0091LEFT SINGLE QUOTATION MARK[‘]U+2018[']: U+0027, APOSTROPHE
Right Single Quote[’]U+0092RIGHT SINGLE QUOTATION MARK[’]U+2019[']: U+0027, APOSTROPHE
Left Double Quote[“]U+0093LEFT DOUBLE QUOTATION MARK[“]U+201C["]: U+0022, QUOTATION MARK
Right Double Quote[”]U+0094RIGHT DOUBLE QUOTATION MARK[”]U+201D["]: U+0022, QUOTATION MARK
Bullet[•]U+0095BULLET[•]U+2022 
Hyphen[‐]U+2010HYPHEN, GENERAL_PUNCTUATION[-]U+002D 
[‎]U+200EGENERAL_PUNCTUATION 
En Dash[–]U+0096EN DASH[–]U+2013 
Em Dash[—]U+0097EM DASH[—]U+2014 
Tilde[˜]U+0098SMALL TILDE[˜]U+02DC 
Trademark[™]U+0099TRADE MARK SIGN[™]U+2122 
 [š]U+009ALATIN SMALL LETTER S WITH CARON[š]U+0161 
Greater than sign[›]U+009BSINGLE RIGHT-POINTING ANGLE QUOTATION MARK[›]U+203A[>]: U+003E, GREATER-THAN SIGN
Small oe ligature[œ]U+009CLATIN SMALL LIGATURE OE[œ]U+0153 
Capital Y, umlaut[Ÿ]U+009FLATIN CAPITAL LETTER Y WITH DIAERESIS[Ÿ]U+0178 

II. Manually fix characters

Illegal CharactersReplace Characters
NameCharValueNameCharValueNotes
Nonbreaking Space[ ]U+00A0Space[ ]U+0020trim it if at the end of string

Program:

shell> $LEXBUILD_DIR/Tools/PostProcessing/NonAscii


------------------------------------
Input file ?
------------------------------------


------------------------------------
Which Program ?
1) Check Non-ASCII
2) Fix Illegal Non-ASCII
------------------------------------

Inputs:

$LEXBUILD_DIR/data/WebApp/Outputs/Lexicon/LEXICON

Outputs:

  • LEXBUILD_DIR/data/WebApp/Outputs/PostProc/nonAscii.txt
  • LEXBUILD_DIR/data/WebApp/Outputs/PostProc/nonAsciiStat.txt
  • LEXBUILD_DIR/data/WebApp/Outputs/PostProc/nonAsciiFix.txt