Exclusive Filter: A Term Leads with Absolute Invalid-Lead-Terms (ILT)
If a term leads with absolute invalid-lead-terms (ILT), it is not a valid multiword. These terms are filtered out from the MEDLINE n-gram set.
- away from that
- as to support
- but a simple
The absolute invalid-lead-terms (ILT) are derived from Lexicon. Some lead-terms from the invalid lead-end-term candidate list are absolute invalid lead-terms, such as "about", "across", "across from", etc.. N-grams start with any of these absolute invalid lead-terms are not valid multiwords. In 2014, there are 381 abosulute invalid lead-terms derived from coputer program. "the" is moved manually from valid-lead-terms to invalid-lead-terms becasue it was an error in Lexicon. The final file used for this model contains 382 abosulute invalid lead-terms. Please refer to design documents of Lead-Terms Types and Lead-End-Terms Model for details.
- Input Term: core-term
- Filter Algorithm:
Description FilterType Notes Norm: strip punctuation except for '/.- FT_TBD
- (ABB) of: => abb of
- Norm case: go through all abs-inv-lead-terms (AILTs)
- Case-1.1: if AILTs is not upper case and inTerm is uppercase
- Case-1.2: if AILTs is not upper case and inTerm is mixed case and lead-word is not upper case
=> lowercase, use inTerm.lc
- Case-2.1: if AILTs is upper case
- Case-2.2: if AILTs is not upper case and inTerm is lowercase
- Case-2.1: if AILTs is not upper case and inTerm is miexed case and lead-word is upper case
=> use inTerm (no change in case)
Case AILT inTerm inTerm converted LowerCase 1.1 his HIS PROBLEM his problem 1.2 his His problem his problem Keep case 2.1 W/O W/O problems W/O problems 2.2 his his problem his problem 2.3 nor NOR mice NOR mice
- Check if inTerm is abs-inv-lead-terms (AILT)
- Expcetions: AILTs are valid terms
- Check if inTerm leads with AILT + " "
- his problem => invalid
- source code: FilterLeadTermAbs.java
- Accuracy Test on Lexicon:
Lexicon Filter Sample No Pass No Trap No Exp No Pass-Rate 2018 FT_LEAD_TERM_INV_ABS 955564 955508 56 442 99.9941% 2017 FT_LEAD_TERM_INV_ABS 935276 935222 54 422 99.9942% 2016 FT_LEAD_TERM_INV_ABS 915583 915531 52 431 99.9943% 2015 FT_LEAD_TERM_INV_ABS 896213 896167 46 427 99.9949% 2014 FT_LEAD_TERM_INV_ABS 875090 875044 46 427 99.9947%
Please note two types of valid words are filtered out by mistake:
- Init case type
His bundle: is a valid word, a collection of heart muscle fibers were names after Swiss cardiologist Wilhelm His Jr.. who discovered themin 1893.
- Upper case type:
US EPA: United States Environmental Protection Agency
However, these types of valid words are very few. Also, these two trapped words are not multiwords and have been removed from Lexicon after 2015. In other word, "the" should belong to absolute invalid lead-term list.
- the Netherlands
- the Staatliche Frauenklinik und Hebammenschule