Derivations Procedures - orgD

Retrieved and verify dPairs from original Lexical Tools DM.DB and add them to derivation table. This should be done after nomD, prefixD, suffixD, and zeroD. All orgD with EUIs (in Lexicon EUIs) and valid dPairs and not included in our system are added to our final derivation table. We don't expect too many valid dPair from orgD because only new LexRecords in orgD will be added. However, this procedure requires many manually updates (in Steps: 2,5,6,7,8,81)

I. Directory:

  • ${DERIVATION}/0.orgD

II. Input Files (./data/${YEAR}/dataOrg/):
shell> ${ORG_D}/bin/GetOrgD ${YEAR}
0

  • Copy following 5 files from ${PREV_YEAR}:
    • convers.fct
    • dm.fct
    • etc.fct
    • nomiz.fct
    • pd.fct

  • Copy ${SRC_DIR}/orgD.yes.data.{YEAR} from previous year
    =>ln -sf ${SRC_DIR}/orgD.yes.data.final to orgD.yes.data.{YEAR}

III. Final files for allD (release)

  • ${TAR_DIR}/orgD.yes.${YEAR}

IV. Summary of GetOrgD

StepDescription and ProgramInputOutputNotes
0
  • Prepare directories and files
See section II.See section II.
  • 0.orgD/data/${YEAR}/dataOrg
    • convers.fct
    • dm.fct
    • etc.fct
    • nomiz.fct
    • pd.fct

    • orgD.yes.data.final
1
  • Get all dParis from Original Lvg Facts
  • ${SRC_DIR}:
    • dm.fct
    • etc.fct
    • convers.fct
    • nomiz.fct
    • pd.fct
  • orgD.raw.data
  • Must run step-0 to copy 5 original files first!
2
  • Reformat to pure dPair file: remove comments, uSort, empty line
  • ${TAR_DIR}:
    • orgD.raw.data
  • orgD.yes.data
  • Manually remove the empty (1st) line
3
  • Add EUI (to orgD.yes.data)
  • AddEuiToOrgD.java
  • ${SRC_DIR}:
    • LEXICON.${YEAR}
    • orgD.yes.data.final
  • orgD.yes.data.final.allEui
  • orgD.yes.data.final.noEui
  • orgD.yes.data.final.yesEui
  • copy and link orgD.yes.data.final from previous year (step-0)
  • Make sure No. of All EUI = yes + no
4
  • Add dType to valid orgD with EUI (in Lexicon)
  • DType.java
  • ${TAR_DIR}:
    • orgD.yes.data.final.yesEui

  • ${ALL_SRC_DIR}:
    • LRSPL
    • dTypeStr.data
  • orgD.yes.data.final.yesEui.type
  • orgD.yes.data.final.yesEui.type.Z
  • orgD.yes.data.final.yesEui.type.S
  • orgD.yes.data.final.yesEui.type.P
  • orgD.yes.data.final.yesEui.type.ZS
  • orgD.yes.data.final.yesEui.type.SS
  • orgD.yes.data.final.yesEui.type.PS
  • orgD.yes.data.final.yesEui.type.U
  • Go through steps 5 ~ 8 to take care of types of Z, P, U, S.
5
  • Add known tags to type Z - zeroD
  • AppendFieldSeparator.java
  • GetZeroDMetaFile.java
  • SplitZeroDMetaFile.java
  • ${TAR_DIR}:
    • orgD.yes.data.final.yesEui.type.Z

  • ${ZERO_SRC_DIR}:
    • zeroD.tag.txt

  • ${NOM_TAR_DIR}:
    • nomD.yes.Z.data.${YEAR}
  • orgD.yes.data.final.yesEui.type.Z.raw
  • orgD.yes.data.final.yesEui.type.Z.meta

  • orgD.yes.data.final.yesEui.type.Z.yes.data
  • orgD.yes.data.final.yesEui.type.Z.no.data
  • orgD.yes.data.final.yesEui.type.Z.tbd.data

  • Manually update orgD.yes.data.final.yesEui.type.Z.tbd.data.yes.${YEAR}
    => See details in the next column
  • All dPairs in orgD.yes.data.final.yesEui.type.Z.tbd.data must be tagged. This file should be empty. If not empty (or known from past), send them to linguists to tag.
    • add the tags to orgD.yes.data.final.yesEui.type.Z.tbd.data.tag
    • Manually retrieve valid (yes) of above file to orgD.yes.data.final.yesEui.type.Z.tbd.data.yes.${YEAR}
    • In 2015 release, orgD.yes.data.final.yesEui.type.Z.tbd.data.yes.${YEAR} is empty.
    • if the tbd file is empty (no updates), the yes.${YEAR} file is empty.
      => touch orgD.yes.data.final.yesEui.type.Z.tbd.data.yes.${YEAR}

    • These valid zeroD will be add to orgD.yes.${YEAR} in step 9
6
  • Add known tags to type P - prefixD
  • AppendFieldSeparator.java
  • GetPrefixMetaFile.java
  • SplitPrefixDMetaFile.java
  • ${TAR_DIR}:
    • orgD.yes.data.final.yesEui.type.P

  • ${PREFIX_SRC_DIR}:
    • prefixD.tag.txt
    • prefixList.data
  • orgD.yes.data.final.yesEui.type.P.raw
  • orgD.yes.data.final.yesEui.type.P.meta

  • orgD.yes.data.final.yesEui.type.P.yes.data
  • orgD.yes.data.final.yesEui.type.P.no.data
  • orgD.yes.data.final.yesEui.type.P.tbd.data
  • orgD.yes.data.final.yesEui.type.P.tbt.data
  • prefixD.yesNo.data

  • Manually update orgD.yes.data.final.yesEui.type.P.tbd.data.yes.${YEAR}
    => See details in the next column
  • All dPairs in orgD.yes.data.final.yesEui.type.P.tbd.data must be tagged. If it is not empty (or 1 known exception from past), send them to linguists to tag.
    • In 2015 release, there is 1 known tbd (invalid) prefixD from past
      motor neuron|noun|E0354096|neuron|noun|E0042456|no
      => motor is not a prefix, "motor neuron" is a compound.
    • add the linguist's tags to orgD.yes.data.final.yesEui.type.P.tbd.data.tag
    • Manually retrieve valid (yes) of above file to orgD.yes.data.final.yesEui.type.P.tbd.data.yes.${YEAR}
    • In 2015 release, orgD.yes.data.final.yesEui.type.P.tbd.data.yes.${YEAR} is empty.
    • If no yes is tagged,
      => touch orgD.yes.data.final.yesEui.type.P.tbd.data.yes.${YEAR}
    • These valid prefixD will be add to orgD.yes.${YEAR} in step 9
7
  • Add known tags to type U - unknown type
  • AppendFieldSeparator.java
    =>These orgD can't be identified dType by program
  • ${TAR_DIR}:
    • orgD.yes.data.final.yesEui.type.U

  • ${PREV_TAR_DIR}:
    • orgD.yes.data.final.yesEui.type.U.raw
  • orgD.yes.data.final.yesEui.type.U.raw
  • orgD.yes.data.final.yesEui.type.U.raw.old
  • orgD.yes.data.final.yesEui.type.U.raw.new

  • Manually copy and update orgD.yes.data.final.yesEui.type.U.yes.${YEAR}
    => See details in the next column
  • orgD.yes.data.final.yesEui.type.U.raw.new should be empty (0)
  • If not, send it to linguists to tag (yes|no)
    => The difference (new) are the orgDs from new EUIs
  • Copy orgD.yes.data.final.yesEui.type.U.yes.${YEAR} from previous year
  • Manually update negation|dType|prefix on new valid orgDs to orgD.yes.data.final.yesEui.type.U.yes.${YEAR}
8
  • Add known tags to type S - suffixD
  • AppendFieldSeparator.java
  • GetSuffixDMetaFile.java
  • SplitSuffixDMetaFile.java
  • ${TAR_DIR}:
    • orgD.yes.data.final.yesEui.type.S

  • ${SUFFIX_SRC_DIR}:
    • suffixD.tag.txt

  • ${NOM_TAR_DIR}:
    • nomD.yes.S.data.${YEAR}
  • orgD.yes.data.final.yesEui.type.S.raw
  • orgD.yes.data.final.yesEui.type.S.meta

  • orgD.yes.data.final.yesEui.type.S.yes.data
  • orgD.yes.data.final.yesEui.type.S.no.data
  • orgD.yes.data.final.yesEui.type.S.yesNo.data
    => Already known in suffixD (duplicates)
  • orgD.yes.data.final.yesEui.type.S.tbd.data
    => Need to be tagged for these suffixD from new Lexicon updates
  • All dPairs in orgD.yes.data.final.yesEui.type.S.tbd.data must be tagged.
  • If not empty (or known from past), send it to linguists to tag.
  • Follow steps 81-83 to complete suffix with TBD tags in orgD
    => old: most of them were tagged (known) from previous years
    => new: new orgD in the updated Lexcion, need to sent to linguist to tag
81
  • Find new suffix TBD orgD and Manually complete tag file
  • Subset1Way.java
  • ${PREV_YEAR_ORG_TAR_DIR}:
    • orgD.yes.data.final.yesEui.type.S.tbd.data

  • ${ORG_TAR_DIR}:
    • orgD.yes.data.final.yesEui.type.S.tbd.data
  • orgD.yes.data.final.yesEui.type.S.tbd.data.old
  • orgD.yes.data.final.yesEui.type.S.tbd.data.new

  • Manually copy (from previous year) and update orgD.yes.data.final.yesEui.type.S.tag.data.${YEAR}, see detail from the next column
  • The new TBD suffix orgD (orgD.yes.data.final.yesEui.type.S.tbd.data.new), must be empty (0)
    => these are from updates of new Lexicon, SpVar, or nominalizations
    => even if it is not empty, should be very small
  • If not empty, send to linguists to tag (yes|no)
  • Manually copy orgD.yes.data.final.yesEui.type.S.tag.data.${YEAR} from previous year
  • Manually add tagging results (yes|no) suffixD to orgD.yes.data.final.yesEui.type.S.tag.data.${YEAR}
82
  • Add tags (yes|no) to suffix TBD orgD file
  • Split tagged file (yes|no|tbd|yesNo)
  • GetSuffixDMetaFile.java
  • SplitSuffixDMetaFile.java
  • ${NOM_TAR_DIR}:
    • nomD.yes.S.data.${YEAR}
  • ${ORG_TAR_DIR}:
    • orgD.yes.data.final.yesEui.type.S.tbd.data
    • orgD.yes.data.final.yesEui.type.S.tag.data.${YEAR}
      => Copy from previous year
  • orgD.yes.data.final.yesEui.type.S.tag.data

  • orgD.yes.data.final.yesEui.type.S.tag.data.yes
  • orgD.yes.data.final.yesEui.type.S.tag.data.no
  • orgD.yes.data.final.yesEui.type.S.tag.data.yesNo
  • orgD.yes.data.final.yesEui.type.S.tag.data.tbd
  • make sure no conflict tags from nomD when adding tag (this might happen due to new nomalization)?
  • make sure No. of tbd is 0 (orgD.yes.data.final.yesEui.type.S.tag.data.tbd)
  • If not, check orgD.yes.data.final.yesEui.type.S.tag.data.${YEAR} in Step 81 and rerun Steps: 81 ~ 82
83
  • Finalize suffix OrgD: auto add negation (N|O), dType|prefix (S|None)
  • AddNegationTagToFile.java
  • GenerateSuffixDTable.java
  • ${ORG_TAR_DIR}:
    • orgD.yes.data.final.yesEui.type.S.tag.data.yes
  • orgD.yes.data.final.yesEui.type.S.tag.data.yes.negation
  • orgD.yes.data.final.yesEui.type.S.tag.data.yes.${YEAR}
  • orgD.yes.data.final.yesEui.type.S.tbd.data.yes.${YEAR} is valid suffixD from TBD orgD
  • This final file is used in Step 9
9
  • Combine Z, S, P, U (Steps 4-7) to orgD.yes.${YEAR}
  • ${TAR_DIR}: Must run 5-8 to get following files.
    • orgD.yes.data.final.yesEui.type.Z.tbd.data.yes.${YEAR}
    • orgD.yes.data.final.yesEui.type.P.tbd.data.yes.${YEAR}
    • orgD.yes.data.final.yesEui.type.S.tbd.data.yes.${YEAR}
    • orgD.yes.data.final.yesEui.type.U.yes.${YEAR}
orgD.yes.${YEAR}
  • Must run through steps 4 ~ 8 (81-83) first!

IV. Processes details:

  • shell>cd ${DERIVATION}/0.orgD/bin
  • shell>GetOrgD ${YEAR}

    1) Combine Original Lvg Fact dPairs from 5 files
    Combine above five files:
    => Generate ./data/${YEAR}/data/orgD.raw.data

    2) Reformat: remove comments, uSort, empty line
    Reformat: remove comments, uSort, empty lines:
    => Generate ./data/${YEAR}/data/orgD.yes.data
    => Remove 1st (empty) line in ./data/${YEAR}/data/orgD.yes.data

    3) Add EUI to orgD.yes.data.final (use E0000000 for no EUI)
    Add EUI to dPairs in orgD.yes.data, use E0000000 if no EUI found
    => Prepare:

    • cd ./data/${YEAR}/dataOrg/
    • link LEXICON.${YEAR} to /nfsvol/lex/Lu/Development/Lexicon/data/${YEAR}/data/LEXICON.release
    • link orgD.yes.data.final to ../../${PRE_YEAR}/data/orgD.yes.data.${PRE_YEAR}

    => Generate:

    • ./data/${YEAR}/data/orgD.yes.data.final.all
    • ./data/${YEAR}/data/orgD.yes.data.final.noEui
      => Send to linguists to add in new record with E0000000
    • ./data/${YEAR}/data/orgD.yes.data.final.yesEui
      => go to step 4.

    4) Add dType of orgD.yes.data.final.yesEui
    Add dType to orgD.yes.data.final.yesEui to zeroD, suffixD, and prefixD:
    generate:

    • orgD.yes.data.final.yesEui.P
      => go to step 6
    • orgD.yes.data.final.yesEui.S
      => go to step 7
    • orgD.yes.data.final.yesEui.Z
      => go to step 8
    • orgD.yes.data.final.yesEui.PS
      => ignore because PDs by SpVars are excluded
    • orgD.yes.data.final.yesEui.SS
      => ignore because SDs by SpVars are excluded
    • orgD.yes.data.final.yesEui.ZS
      => ignore because ZDs by SpVars are excluded
    • orgD.yes.data.final.yesEui.U
      => Manually review and assign dTag and dType
      • If valid dPairs:
        => manually add to P, S, Z
        => add to ${ALL_D}/data/${YEAR}/dataOrg/dTypeStr.data
      • If invalid dPairs
        => add to ./dataOrg/orgD.tag.txt

    5) Add tag to prefixD: orgD.yes.data.final.yesEui.type.P
    Generates:

    • orgD.yes.data.final.yesEui.type.P.raw
    • orgD.yes.data.final.yesEui.type.P.meta
      • orgD.yes.data.final.yesEui.type.P.yes.data
      • orgD.yes.data.final.yesEui.type.P.no.data
      • orgD.yes.data.final.yesEui.type.P.tbd.data
        => send to linguist, then add all "yes" of it to prefixD
      • orgD.yes.data.final.yesEui.type.P.tbt.data

    6) Add tag to suffixD: orgD.yes.data.final.yesEui.type.S
    Generates:

    • orgD.yes.data.final.yesEui.type.S.raw
    • orgD.yes.data.final.yesEui.type.S.meta
      • orgD.yes.data.final.yesEui.type.S.yesNo.data
      • orgD.yes.data.final.yesEui.type.S.yes.data
      • orgD.yes.data.final.yesEui.type.S.no.data
      • orgD.yes.data.final.yesEui.type.S.tbd.data
        => send to linguist, then add all "yes" of it to suffixD

    7) Add tag to zeroD: orgD.yes.data.final.yesEui.type.Z
    Generates:

    • orgD.yes.data.final.yesEui.type.Z.raw
    • orgD.yes.data.final.yesEui.type.Z.meta
      • orgD.yes.data.final.yesEui.type.Z.yes.data
      • orgD.yes.data.final.yesEui.type.Z.no.data
      • orgD.yes.data.final.yesEui.type.Z.tbd.data
        should be 0 because all zeroZ.raw are generated automatically.
        Some of these might be acronym or abbreviation, which is not legal in zeroD.
        => send to linguist, then add all "yes" of it to zeroD

Ideally, all orgD should be automatically generated by our new derivations generation processes by adding: more prefix (for prefixD) and SD candidate rules (for suffixD). No new zeroD should be found because our system should cover all possible zeroD (please notes that acronyms or abbreviations can't be zeroD). In 2014 release, we manually verify and add orgD into derivational table. Please see the reports on orgD, 2014 for detail.

Please refer to derivation design documents in Lexical Tools for details.