index_img1.gif LRAGR2Lex
Introduction File Dependencies Usage
1. Introduction
LRAGR2Lex
This is a tool that converts one of the SPECIALIST Lexicon's distribution files, the relational agreement table, and converts it into the specialist .lex files.  This tool is useful to use when a new version of the SPECIALIST Lexicon comes out.
2. File Dependencies
TagSet
dTagger is designed to be used with the SPECIALIST lexicon but  it is also intended to be tag neutral.  Tags are enumerated in a file (tagset.txt). The SPECIALIST Lexicon's tagset is in a file called T1TagSet.txt that has been copied to the tagset.txt

File format
The Shape tags come from shapes that the Xerox Parc tagger identified, or from shapes that  the textTools identifies or hopes to identify.

Tag
Name
Open Class
Shape
Example
noun
noun
1
 
 
 
 
real
real number
0
1
 
 
ap
right quote or double quote
0
 
 
,"

Where
Open Class
A "1" in the open class column indicates this tag  is an open class. This info is useful to know when guessing a class - we only want to guess open classes.  An open class is defined to be a lexical category whose membership is typically large and which can easily accept new members. In English, this includes the categories of noun, verb, and adjective. [A Dictionary of Grammatical Terms in Linguistics, R.L. Trask, c. 1993, pg. 195]
We will presume that we have a lexicon filled with all the closed class words and tokens.
Shape
Those tags that have a 1 in the Shape column, indicates tags that won't be seen identifying a term within an official lexicon. Rather these tags will be put on terms recognized by shape identifiers such as numbers, number words, units of measure, url ...
Example
This field is ignored, but was there to remind me the characters for the punctuation tags, such as what the character for ap is
This tagger heavily relies upon the END tag. This should be present in all tagsets associated with this tagger. The END tag is a tag that is implicitly put before the beginning and after the end of an utterance (sentence).

The java code needs to know about numbers and punctuation. The num and punt tags should always remain in any tagset, as represented here - or alter the TagSet.getNumberTagId() and TagSet.getPunctuationTagId() methods to correspond to the tags used in the tagset.txt.
Files
T1TagSet.txt
See related topics and documents
inflectionTagSet.txt
The inflectionTagSet.txt is a file of the tags used to describe inflections within the SPECIALIST Lexicon. See http://lexsrv2.nlm.nih.gov/SPECIALIST/Projects/lexicon/2006/release/LEXICON/DOCS/techrpt.pdf for a definition of the principle inflections. 
The contents of this file (when present) are carried along with the lexicallookup indexes when these indexes are created from the SPECIALIST.lex file.
The only places the contents of this file is used within the context of the dTagger is to determine if a verb can inflect to a past or present participle, for the purposes of changing verbs to adjs. If this file is not present (because it does not fit within the POS tag set model, the tagger should still be able to work.

The format of this file is:
Inflection Tag
POS's they come from
uncount(thr_sing)
noun
count(thr_plur)
noun
pres(fst_sing,fst- plur,second)
verb
pres(thr_sing)
verb,aux
pres_part
verb
past
verb,aux,modal
past_part
verb
See related topics and documents
3. Usage
LRAGR2Lex [--lexiconName=]

     Converts an LRAGR table into the .lex files needed for lexical lookup within the dTagger.  This program assumes that the LRAGR table is in the data/200X/lexicon directory.
Where

Option
Description
Default Value
--KSYear=
The name of the year or custom dataset
2005
--lexiconName=
Specify the name of the lexical lookup indexes created as part of the training task.
"SPECIALIST"