Derivations Procedures - suffixD

Generate suffixD pairs in derivation table:

I. Directory: ${DERIVATION}/3.suffixD

II. Input Files (./data/${YEAR}/dataOrg/):
shell> ${SUFFIX_D}/bin/GetSuffixD ${YEAR}
0

  • link LEXICON to LEXICON.${YEAR} (from ${LEXICON_DIR}/LEXICON.release)
  • link inflVars.data to inflVars.data.${YEAR} (from ${LEXICON_DIR})
  • link bases.data from prefixD/data/ (Complete step-1 in prefixD first)
  • link sdRules.data to sdRules.data.${YEAR} (from ${PREV_YEAR} and new rules)
  • link suffixD.tag.txt to suffixD.tag.txt.${YEAR} (copy from previous year)

  • touch/create suffixD.meta.data.conflict.tag.data for init phase

  • Must complete nomD first for auto-tag program to work
  • Must run prefixD step-1 first to get bases.data

III. Final file for allD (release)

  • ${TAR_DIR}/suffixD.yes.data.${YEAR}

IV. Summary of GetSuffixD

StepDescription and ProgramInputOutputNotes
0
  • Prepare directories and files
See section II.See section II.
  • 3.suffixD/data/${YEAR}/dataOrg
    • LEXICON
    • inflVars.data
    • bases.data
    • sdRules.data
    • suffixD.tag.txt
1
  • Retrieve std-raw suffixD pairs
  • GetSuffixDRawFromBaseFile.java
  • ${SRC_DIR}:
    • bases.data
    • sdRules.data
  • suffixD.raw.data.fromBase.all
  • sdRules.rawNo.rpt
  • Must complete prefixD Step-1 to get bases.data
  • Need to rerun from this step if there are new Sd-Rules invloved
    • Add new SD-Rules to ./dataOrg/sdRules.data.${YEAR}
    • Get sd-pair (TBD) for each new sdRules
    • Send TBD to linguist to tag [yes|nos]"
    • Save new tag result to ./dataOrg/newRuleTag/
    • Add new tag result to ./dataOrg/suffixD.tag.txt.${YEAR}
2
  • Combine with nomD.S file (raw)
  • CheckWithNomDFile.java
  • ${NOM_TAR_DIR}:
    • nomD.yes.S.data.${YEAR}

  • ${TAR_DIR}:
    • suffixD.raw.data.fromBase
  • suffixD.raw.data.fromNomD
  • suffixD.raw.data
  • Must link suffixD.raw.data.fromBase to suffixD.raw.data.fromBase.all to run this step
3
  • Add tags to suffixD meta file
  • GetSuffixDMetaFile.java
  • DPairTagList.java
  • ${NOM_TAR_DIR}:
    • nomD.yes.S.data.${YEAR}

  • ${SRC_DIR}:
    • suffixD.tag.txt (suffixD.tag.txt.${YEAR}.uSort

  • ${TAR_DIR}:
    • suffixD.raw.data
  • suffixD.meta.data
  • suffixD.meta.data.conflict
  • Check the Total conflict tag no (conflict between nomD and expert's tag)
    => Shown as "-- Total conflict tag no:" in log.3

  • Remove duplicat tags from ./dataOrg/suffixD.tag.txt
  • Ignore the long list of duplicated tag (between manual tags and normD tags) in the log.3
  • The file (suffixD.meta.data.conflict) are suffixD tag conflict caused by SpVar between two records
  • Ideally, all suffixD tag should be consistent among SpVars between records
  • In the 1st run (before add tags to annually updates), no conflict should exist. That is skip Step-9, go to Step-4 for the 1st run.
  • The suffixD.meta.data.conflict should be empty (except for 1 known exception)
  • If not empty, sent to linguists to tag (yes|no|both) on the EUI lines:
    • yes: all suffixD tags among SpVars between records are valid
    • no: all suffixD tags among SpVars between records are invalid
    • both: suffixD tags among SpVars between records inlcude valid and invalid (exception)
  • There is a known exception (since 2014+):
    1|E0056852|E0234312|both
    # 20092|space|noun|E0056852|spacey|adj|E0234312|no
    # 38379|space|noun|E0056852|spacy|adj|E0234312|yes
    
  • Run the next step (9) to update the results to suffixD.tag.txt automatically, then re-run this Step: 3 until all exception are known
  • If all conflict exceptions are known (fixed), go to step-4
9
  • Auto-fix suffixD.tag.txt
  • FixConflictDPairTags.java
  • ${SRC_DIR}:
    • suffixD.tag.txt.${YEAR}
    • suffixD.meta.data.conflict.tag.data
    ${SRC_DIR}
  • suffixD.tag.txt.${YEAR}.fixDPair
  • Make sure use linguist tagging result to ./dataOrg/suffixD.meta.data.conflict.tag.data
  • Manully exam ./dataOrg/suffixD.tag.txt.${YEAR}.fixDPair
  • If suffixD.tag.txt.${YEAR}.fixDPair passes exam, move it to suffixD.tag.txt.${YEAR}, then re-run Step-3 again.
4
  • Split suffixD meta file (yes|no|tbd)
  • SplitSuffixDMetaFile.java
  • ${TAR_DIR}:
    • suffixD.meta.data
  • suffixD.yes.data
  • suffixD.no.data
  • suffixD.tbd.data
  • suffixD.tbd.data.sort (sent to linguists)
  • suffixD.yesNo.data
  • Make sure suffixD.tbd.data(.sort) is empty. If not, sent to linguists to tag:
    • Tag suffixD: (yes|no)
      • valid suffixD: yes
      • invalid suffixD: no
  • Add (update) these new tagged sd-pairs (to ./dataOrg/suffixD.tag.txt) and rerun steps: 3~4
5
  • Verify dType on suffixD.yes.data
  • DType.java
  • ${ALL_SRC_DIR}:
    • LRSPL
    • dTypeStr.data

  • ${TAR_DIR}:
    • suffixD.yes.data
  • suffixD.yes.data.type
  • suffixD.yes.data.type.Z
  • suffixD.yes.data.type.S
  • suffixD.yes.data.type.P
  • suffixD.yes.data.type.ZS
  • suffixD.yes.data.type.SS
  • suffixD.yes.data.type.PS
  • suffixD.yes.data.type.U
  • Make sure unknonw dType (|U|) from suffixD is empty
6
  • Add negation tag (N|O), sort uniquely
  • AddNegationTagToFile.java
  • DPairTagList.java
  • ${TAR_DIR}:
    • suffixD.yes.data
  • suffixD.yes.data.${YEAR}
  • suffixD.yes.data.${YEAR}.conflict
  • The conflict file (suffixD.yes.data.${YEAR}.conflict) lists all inconsistnent suffixD tags between SpVars in two records
    • Send conflicts to linguist to tag (yes|no|both) on EUI lines
    • In the past, no both cases in suffixD
    • Manually update the results to suffixD.tag.txt
    • Rerun Steps: 3~6 until no unknown conflict (both) exist.
7
  • Check afflix on suffixD.yes.data.${YEAR}
  • CheckDerivationByAffix6.java
  • ${ALL_SRC_DIR}:
    • LRSPL

  • ${SRC_DIR}:
    • suffixD.tagYes.txt

  • ${TAR_DIR}:
    • suffixD.yes.data.${YEAR}
  • suffixD.pattern3.rpt
  • Make sure suffixD.pattern3.rpt is empty.
  • This rpt lists all potential invalid dPair by checking 1st and last 3 characters on afflix.
  • If not, send to linguists to tag (Yes|No):
    • invalid dPair (No): add to suffixD.tagNo.txt (no used!), This should not happen!
    • valid dPair (Yes): add to suffixD.tagYes.txt, then rerun Step: 7
8
  • Steps 1 ~ 7
See aboveSee aboveNot recomended!
Other options
11
  • Get stats for SD-rule
  • GetSdRuleStatsFromTaggedSuffixD.java
  • ${SRC_DIR}:
    • sdRules.data
  • ${TAR_DIR}:
    • suffixD.meta.data
  • sdRules.stats.rpt
  • sdRules.stats.detail.rpt
  • Used for analysis in finding the optimal Sd-Rules set
12
  • Get the HTML files
    ALL
  • GetSdRuleListHtmlFile.java
  • ${SRC_DIR}:
    • sdRules.data
  • ${TAR_DIR}:
    • suffixD.meta.data
  • ${HTML_DIR}:
    • suffixDRules.html
    • SD-Examples
    • SD-Exceptions
  • Copy to ${LEXICON_WEB} for annually Sd-Rules updates

V. Processes Details:

  • shell>cd ${DERIVATION}/suffixD/bin
  • shell>GetSuffixD ${YEAR}

    1: Retrieve std-raw suffixD pairs or
    => generate:

    • ./data/sdRules.rawNo.rpt
    • ./data/suffixD.raw.data.fromBase.all

    2: Check/integrate with nomD.S file (raw)
    => ln -s ./suffixD.raw.data.fromBase.all suffixD.raw.data.fromBase
    => generate:

    • ./data/suffixD.raw.data.fromNomD
    • ./data/suffixD.raw.data (= suffixD.raw.data.fromBase + suffixD.raw.data.fromNomD, has comment line #)

    3: Add tags to suffixD meta file (meta)
    => generate ./data/suffixD.meta.data (commnet lines # are removed from raw)

    • Make sure there is no duplicated tag in ./dataOrg/suffixD.tag.txt
    • Program automatically tags nomD.S as valid suffixD pairs
    • Duplicated dPairs are OK (from nomD)
    • Correct all conflict dPairs (from nomD)
      => verify with linguists

    3.1: Verify suffixD meta file (meta)
    => Check consistency on derivational tag between 2 records with SpVars
    => generate ./data/suffixD.meta.data.conflict

    • All conflict EUI pairs need to be manually reviewed and then update the tag in ./dataOrg/suffixD.tag.txt

    4: Split suffixD meta file (yes|no|tbd)
    => generates

    • ./data/suffixD.yes.data
    • ./data/suffixD.no.data
    • ./data/suffixD.tbd.data (should be 0 if annual updates is completed)
      => send to linguist to tag this annual updates, then add updates to ./dataOrg/suffixD.tag.txt.${YEAR}

    • Duplicated SD pairs are normal because they are generated from parent-child candidate SD-rules.

    5: Verify dType on suffixD.yes.data
    => generates:

    • ./data/suffixD.yes.data.type
      • ./data/suffixD.yes.data.type.Z (must be 0)
      • ./data/suffixD.yes.data.type.P (must be 0)
      • ./data/suffixD.yes.data.type.S (= suffixD.yes.data)
      • ./data/suffixD.yes.data.type.ZS (must be 0)
      • ./data/suffixD.yes.data.type.PS (must be 0)
      • ./data/suffixD.yes.data.type.SS (should be 0)
      • ./data/suffixD.yes.data.type.U (must be 0)

    6: Add negation tag (N|O), sort -u: for annualy suffixD
    generate ./data/suffixD.yes.data.${YEAR}

    10
    => generate ./data/

    7: Get stats for sd-Rule from suffixD.tag.txt use this option to generate all suffixD pair for a specified suffix (check the suffixD.rawNo.rpt)

  • send data/suffixD.tbt.data to linguists for tagging:
    • derivation: yes|no
  • re-run this process until all suffixD are tagged (0 in suffixD.tbd.data)
  • The final suffixD is in ${DERIVATION}/suffixD/data/${YEAR}/data/suffixD.yes.data.${YEAR}

Please refer to derivation design documents in Lexical Tools for deatils.