Exclusive Filter: A Term with min. Document Count and Word Count
If a term has DC or WC less than the specified minimum DC or WC, it is not a valid (or commonly-used) multiword. These terms are filtered out from the MEDLINE n-gram set. For examples, the following terms are invalid multiwords:
- protocol, recruitment
- embryos (61.9%)
The MEDLINE n-gram set is used to retrieve the DC and WC. It uses 30 as the minimum WC and 1 and the min. DC. There are lots of multiwords in Lexcion are not in the n-gram set due to:
- They are spelling variants - meaning their spVars exist in n-gram set, they don't have enough WC (30).
- n-gram might in different forms, such as case and punctuation, thus they don't have enough WC (30).
- Lexicon records some multiwords which has small occurance (WC).
- Filter Algorithm:
Description FilterType Notes get DC|WC from n-gram FT_TBD if not in the MEDLINE n-gram set FT_WC_DC_NOT_FOUND
- Exceptions: valid terms not found DC|WC in n-gram
Check if (dc < minDc) or (wc < minWc) FT_WC_DC_INV_LES
- filtered invalid terms with DC|WC less than minimum
- source code: FilterDcWc.java
- Accuracy Test on Lexicon (DC >= 1; WC >= 30):
Lexicon Filter Sample No Pass No Trap No Exp No Pass-Rate 2018 FT_WC_DC_INV_LESS 955564 955564 0 625175 100.0000% 2016 FT_WC_DC_INV_LESS 915583 915583 0 618966 100.0000% 2015 FT_WC_DC_INV_LESS 896213 896213 0 612316 100.0000% 2014 FT_WC_DC_INV_LESS 875090 875090 0 603592 100.0000%