|
|
|
LRAGR2Lex
This is a tool that converts one of the SPECIALIST Lexicon's distribution files, the relational
agreement table, and converts it into the specialist .lex files. This tool is useful to use when a new
version of the SPECIALIST Lexicon comes out.
|
|
|
|
|
dTagger is designed to be used with the SPECIALIST lexicon but it is also intended to be tag neutral. Tags
are enumerated in a file (tagset.txt). The SPECIALIST Lexicon's tagset is in a file called T1TagSet.txt that has
been copied to the tagset.txt
|
|
The Shape tags come from shapes that the Xerox Parc tagger identified, or from shapes that the
textTools identifies or hopes to identify.
|
Tag
|
Name
|
Open Class
|
Shape
|
Example
|
|
noun
|
noun
|
1
|
|
|
|
real
|
real number
|
0
|
1
|
|
|
ap
|
right quote or double
quote
|
0
|
|
,"
|
Where
|
Open
Class
|
A "1" in the open class column indicates this tag is an open class. This info is useful to know when
guessing a class - we only want to guess open classes. An open class is defined to be a lexical category
whose membership is typically large and which can easily accept new members. In English, this includes
the categories of noun, verb, and adjective. [A Dictionary of Grammatical Terms in Linguistics, R.L. Trask, c.
1993, pg. 195]
We will presume that we have a lexicon filled with all the closed class words and tokens.
|
|
Shape
|
Those tags that have a 1 in the Shape column, indicates tags that won't be seen identifying a term within an
official lexicon. Rather these tags will be put on terms recognized by shape identifiers such as numbers,
number words, units of measure, url ...
|
|
Example
|
This field is ignored, but was there to remind me the characters for the punctuation tags, such as what the
character for ap is
|
This tagger heavily relies upon the END tag. This should be present in all tagsets associated with
this tagger. The END tag is a tag that is implicitly put before the beginning and after the end of an
utterance (sentence).
The java code needs to know about numbers and punctuation. The num and punt tags should
always remain in any tagset, as represented here - or alter the TagSet.getNumberTagId() and
TagSet.getPunctuationTagId() methods to correspond to the tags used in the tagset.txt.
|
|
|
|
|
|
|
|
The inflectionTagSet.txt is a file of the tags used to describe inflections within the SPECIALIST
Lexicon. See
http://lexsrv2.nlm.nih.gov/SPECIALIST/Projects/lexicon/2006/release/LEXICON/DOCS/techrpt.pdf
for a definition of the principle inflections.
The contents of this file (when present) are carried along with the lexicallookup indexes when these
indexes are created from the SPECIALIST.lex file.
The only places the contents of this file is used within the context of the dTagger is to determine if
a verb can inflect to a past or present participle, for the purposes of changing verbs to adjs. If this
file is not present (because it does not fit within the POS tag set model, the tagger should still be
able to work.
The format of this file is:
|
Inflection Tag
|
POS's they come from
|
|
uncount(thr_sing)
|
noun
|
|
count(thr_plur)
|
noun
|
|
pres(fst_sing,fst-
plur,second)
|
verb
|
|
pres(thr_sing)
|
verb,aux
|
|
pres_part
|
verb
|
|
past
|
verb,aux,modal
|
|
past_part
|
verb
|
|
|
LRAGR2Lex [--lexiconName=]
Converts an LRAGR table into the .lex files needed for lexical lookup within the dTagger. This
program assumes that the LRAGR table is in the data/200X/lexicon directory.
Where
|
Option
|
Description
|
Default Value
|
|
--KSYear=
|
The name of the year or custom dataset
|
2005
|
|
--lexiconName=
|
Specify the name of the lexical lookup indexes created as part of
the training task.
|
"SPECIALIST"
|
|
|
|
|
|