
  -----------------------------------------------------
  TRAINING Procedure for FreeLing 3.0 - NER/NEC modules
  -----------------------------------------------------

  This directory contains all necessary scripts and data to train
  FreeLing NER and NEC modules.

  The provided scripts expect certain filename formats and locations,
  respect them in order to get them working.

  Note that some scripts are language dependent. You may need to fix them
  to match your language particulars. 

  utilities/nerc/

         common/nec/ Used during training to emulate /usr/local/share/FreeLing/common/nec,
                     contains generic data files (e.g. cities.dat)

	 XX/         (where XX is a language code: "es","en"...)
	 	     Used during training to emulate /usr/local/share/FreeLing/XX/,
                     contains language specific data files (e.g. tagset.dat)

	 XX/nerc/ner (where XX is a language code: "es","en"...)
                     Used during training to emulate /usr/local/share/FreeLing/XX/nerc/ner,
                     contains language specific data files for
                     NER (basically feature extraction rules)

	 XX/nerc/nec (where XX is a language code: "es","en"...)
                     Used during training to emulate /usr/local/share/FreeLing/XX/nerc/nec,
                     contains language specific data files for
                     NEC (basically feature extraction rules)

	 XX/nerc/data (where XX is a language code: "es","en"...)
                     Used during training to emulate /usr/local/share/FreeLing/XX/nerc/data,
                     contains language specific data files for both NER and NEC
                     (gazeteers, trigger word lists, ...)

         corpus/     Some corpus format samples, and format-converting scripts.

             bin/    Format converting scripts

                prepare-corpus.sh  assuming you have a corpus in "full" format (see below)
                                   converts it to necessary formats to train and test NER 
                                   and NEC. It also extracts gazetteers from training corpus.
                                   This scripts calls the following:
 
                extract-gaz.sh     extract a gazetter from train corpus
                full2nec.awk       Convert to train/test corpus for NEC
                full2ner.awk       Convert to train/test corpus for NER
                nec2gold.awk       Format test as a gold standard for NEC evaluation
                ner2gold.awk       Format test as a gold standard for NER evaluation

		nec2full.awk       Usually, you don't have a corpus in "full" format (see below)
                                   This srcipt will convert a corpus in "nec" format (usually easier
                                   to get, see below for details) into "full" format, calling freeling 
                                   to obtain all possible PoS tags for each word.  
				   **WARNING** This script may need some adjusting to take care
                                   of different tokenization criteria between your corpus and 
                                   FreeLing output.

	 ner/	     Directory where NER experiments are performed

	      bin/         Scripts for feature extraction/train/test/evaluation

	      corpus/      train/test corpus (generated by "prepare-corpus.sh")

	      trained-XX/  (where XX is a language code) Directory where learned models are left.
	         *.lex      Feature lexicon files
	         adaboost/  AdaBoost models
                 svm/       SVM models

	      results-XX/  (where XX is a language code) Directory where test results and 
                           statistics are left.
	         adaboost/  Results for AdaBoost models
                 svm/       Results for SVM models


	 nec/	     Directory where NEC experiments are performed

	      bin/         Scripts for feature extraction/train/test/evaluation

	      corpus/      train/test corpus

	      trained-XX/  (where XX is a language code) Directory where learned models are left.
	         *.lex      Feature lexicon files
	         adaboost/  AdaBoost models
                 svm/       SVM models

	      results-XX/  (where XX is a language code) Directory where test results and 
                           statistics are left.
	         adaboost/  Results for AdaBoost models
                 svm/       Results for SVM models



      ner/bin and nec/bin directories contain several scripts to perform
      feature extraction, corpus encoding, model training, and testing.
      
      Usage details are described below.
      A brief summary of the main scripts and related programs are:

      encode-corpus.sh  -> encodes a corpus using a given file of feature 
                           extraction rules (.rgf)
			   Calls the program lexicon.cc which does the 
 			   actual extraction and encoding. It also
			   Generates different filtered feature lexicons

      ner-adaboost.sh
      nec-adaboost.sh	-> Perform experiments on ner/nec using adaboost.
      			   These scripts call train-adaboost.cc to learn 
			   the models (NER calls also train-viterbi.cc), 
			   then they call test-adaboost.cc to apply the 
			   learned model to the test corpus, and finally
			   they compute statistics on the output.

      ner-svm.sh
      nec-svm.sh	-> Perform experiments on ner/nec using svm.
      			   These scripts call svm-train.c (included in
                           libsvm) to learn the models (ner calls also
			   train-viterbi.cc), then they call test-svm.cc
			   to apply the learned model to the test corpus,
			   and finally they compute statistics on the output.

			   to apply the learned model to the test 
			   corpus, and finally they compute statistics
			   on the output.
 
      			   They call svm-train (provided in libsvm), 
			   to learn the model  (NER calls also 
			   train-viterbi.cc), then they call test-svm.cc 
			   to apply the the learned model, and compute
			   statistics on the output.

  ** Procedure to follow and intermediate steps and results **

     To train a model, follow this steps:

  --> 0) Set up you installation.
        * Install FreeLing-3.0
        * compile extraction/train/test programs (You may need to adjust 
          some paths in the Makefiles)
            cd utilities/nerc/ner/bin; make
            cd utilities/nerc/nec/bin; make

  --> 1) Format your train/test corpus appropriately

       This is the first step, and only has to be performed once.
       When you have you analyzed corpus and gold standards, you
       can keep them safe and never do this step again. 

       1.a) Create analyzed train/test corpus  

        1.a.1: FULL CORPUS FORMAT

	The corpus you need must include all PoS tags and lemmas for each word, as well
        as the Named Entites marked and classified. 
        If you don't have all this in your corpus, don't despair, and see section 1.b below.

        The "full" format for train and test corpus has the following format:

         - One word per line, one empty line between sentences.
         - Each word line has the following fields:
	     * Form
             * BIO tag, with class: It may be B-class, I-class, or O, 
               where "class" may be any class that the system is expected
               to learn. Usual classes are PER, ORG, LOC, and MISC (thus, usually 
               possible NERC-tags are B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, 
               B-MISC, I-MISC, and O)
               If you want to train NER but not NEC, you can ommit the class
               and use just B,I,O as tags
             * lemma
             * PoS tag (Ideally, FreeLing compatible, since it's what the classifier will
               be faced when tested, so it needs to learn from the same codes).
             * A separator field "#"
             * A list of triples lemma-tag-probabilty for all possible word readings.
               Probabilties are not used, but should be present and set to "-1".


	 Example of "full" corpus format:

            EU B-ORG eu NP00O00 # 
            rejects O reject VBZ # reject VBZ -1
            German B-MISC german NP00V00 # german JJ -1
            call O call VB # call JJ -1 call VB -1 call VBP -1
            to O to TO # to IN -1 to TO -1
            boycott O boycott VB # boycott NN -1 boycott VB -1 boycott VBP -1
            British B-MISC british NP00V00 # british JJ -1
            lamb O lamb NN # lamb NN -1 lamb VB -1 lamb VBP -1
            . O . Fp # . Fp -1
        
            The O the DT # the DT -1
            European B-ORG european_commission NP00O00 # european JJ -1
            Commission I-ORG european_commission NP00O00 # commission NN -1 commission VB -1 commission VBP -1
            said O say VBD # say VBD -1 say VBN -1
            on O on IN # on IN -1 on RB -1 on RP -1
            Thursday O [J:??/??/??:??.??:??] W # [J:??/??/??:??.??:??] W -1
            it O it PRP # it PRP -1
            disagreed O disagree VBD # disagree VBD -1 disagree VBN -1
            with O with IN # with IN -1 with RP -1
            German B-MISC german NP00V00 # german JJ -1
            advice O advice NN # advice NN -1
            to O to TO # to IN -1 to TO -1
            consumers O consumer NNS # consumer NNS -1
            to O to TO # to IN -1 to TO -1
            shun O shun VB # shun VB -1 shun VBP -1
            British B-MISC british NP00V00 # british JJ -1
            lamb O lamb NN # lamb NN -1 lamb VB -1 lamb VBP -1
            . O . Fp # . Fp -1


	  Given this corpus, the "prepare-corpus.sh" script will extract data sets 
          formatted both for NER and NEC.


        1.a.2: NER corpus format

          The "prepare-corpus.sh" script will call "full2ner.awk" to extract data format
          for NER train/test scripts.

          The NER format consists of:
	      
           - One word per line, one empty line between sentences.
           - Each word line has the following fields:
	      * BIO tag, without class. I.e. "B" for entity beginning words, 
                "I" for internal entity words, and "O" for words outside any entity.
              * form
              * A list of triples lemma-tag-probabilty for all possible word readings.
                Probabilties are not used, but should be present and set to "-1".
                The list of triples should be empty if the word has no possible
                meaning other than Named Entity.

	 Example of "ner" corpus format:

          B EU
          O rejects reject VBZ -1
          B German german JJ -1
          O call call JJ -1 call VB -1 call VBP -1
          O to to IN -1 to TO -1
          O boycott boycott NN -1 boycott VB -1 boycott VBP -1
          B British british JJ -1
          O lamb lamb NN -1 lamb VB -1 lamb VBP -1
          O . . Fp -1

          O The the DT -1
          B European european JJ -1
          I Commission commission NN -1 commission VB -1 commission VBP -1
          O said say VBD -1 say VBN -1
          O on on IN -1 on RB -1 on RP -1
          O Thursday [J:??/??/??:??.??:??] W -1
          O it it PRP -1
          O disagreed disagree VBD -1 disagree VBN -1
          O with with IN -1 with RP -1
          B German german JJ -1
          O advice advice NN -1
          O to to IN -1 to TO -1
          O consumers consumer NNS -1
          O to to IN -1 to TO -1
          O shun shun VB -1 shun VBP -1
          B British british JJ -1
          O lamb lamb NN -1 lamb VB -1 lamb VBP -1


        1.a.3: NEC corpus format

          The "prepare-corpus.sh" script will call "full2nec.awk" to extract data format
          for NEC train/test scripts.

          The NEC format consists of:

           - One word per line, one empty line between sentences.
           - Each word line has the following fields:
	      * form
	      * lemma
	      * PoS tag, encoding NE class in 5-6 positions of "NP" labels.
	      	E.g.:  NP00SP0  -> Person
                       NP00G00  -> Geographical location
                       NP00O00  -> Organization
                       NP00V00  -> Others
	
             EU eu NP00O00
             rejects reject VBZ
             German german NP00V00
             call call VB
             to to TO
             boycott boycott VB
             British british NP00V00
             lamb lamb NN
	     . . Fp

             The the DT
             European_Commission european_commission NP00O00
             said say VBD
             on on IN
             Thursday [J:??/??/??:??.??:??] W
             it it PRP
             disagreed disagree VBD
             with with IN
             German german NP00V00
             advice advice NN
             to to TO
             consumers consumer NNS
             to to TO
             shun shun VB
             British british NP00V00
             lamb lamb NN
             . O . Fp # . Fp -1
      

        1.a.4 GOLD corpus format

          The "prepare-corpus.sh" script will also generate the corpora "XX.nec.gold" and "XX.ner.gold",
          used by evaluation scripts. These corpora contain only the word and the expected answer
          for NEC and NER respectively, with no extra fields.
	 

       1.b) HOW DO I BUILD ALL THESE CORPORA ?

	  Ideally, you have an already annotated corpus which you need to reformat, and maybe map
          to FreeLing tagset.

          If you have a corpus in "NEC" format (or in a similar format that you can convert to that), 
          the script "nec2full.awk" will call FreeLing to obtain the missing information (all possible
          tags&lemmas for each word) and produce a corpus in "full" format.

          If you don't have an annotated corpus, you can run FreeLing on plain text using option
          "--outf train". This will produce something very similar to "full" format), where you will
          need to hand correct the annotations (rigth lemma/PoS, right class for named entities
          --label NPxxxxx).  With that corpus in "full" format, you'll be able to run "prepare-corpus.sh"
         
       
  --> 2) Create/tune your gazeetters

       "prepare-corpus.sh" will also extract basic gazetters from your train corpus.

       It will create two files for each NE class (PER,LOC,ORG,MISC).
       Files with "-c" suffix contain complete named entities.
       Files with "-p" suffix contain named entity parts.

       Those gazeteers should be kept in the language specific directory
         utilities/nerc/XX/nerc/data

       Four sets of gazetteers will be created, with suffixed ".train" or ".test"
       and "poor1" or "rich"
       The encoding script expects to find all sets in the data directory.

       The "rich.train" gazetteers omit some entitites, to avoid overfitting. 
       By default, the ".train" gazetteers contain enough entities to have 
       the same coverage on the training set that the complete gazetteer has
       on the test set (so the classifier is trained in similar conditions to
       testing).
       The "rich.test" gazetteers contain all entities occurred in the train corpus.

       The "poor1.train" gazetteers selects entities corresponding to 1% of occurrences 
       in training corpus. Note that 1% of entities may cover around 15%-20% of 
       occurrences in a corpus, since some selected entities may ocurr many times.
       The "poor1.test" gazetteers contains a random selection of entities to provide 
       a similar coverage on test.

       You can control the percentage of selected entities changing the parameter 
       given to "extract-gaz" when called from "prepare-corpus-sh"

       You can modify or enlarge those gazetters by any means, but
       note that using very large gazeteers does not help the learner:
       If all NEs are found in the gazeteer when training, the 
       model will just learn to check in the gazeteer. Then, when 
       applied to new text when not all entities are found, the 
       performance will drop.
       So: train your model in similar conditions to that where it will be
       used, i.e. use gazeteers that provide useful information only part 
       of the time.

  --> 3) Define a set of feature extraction rules (.rgf)

       You can define your own rules, or use/adapt any of the existing sets:

        utilities/nerc/XX/nerc/ner/*.rgf
        utilities/nerc/XX/nerc/nec/*.rgf

       If the feature extraction rules refer to external files (such as
       gazeteers or word lists), use relative paths (to access, e.g. 
       ../../common/cities.dat, ../data/gazPER-c.dat, etc.)

      If you create or modifiy feature sets, *never* overwrite the
      previous version. Save the modifications under a new name 
      (e.g. ner-f88.rgf).  It is *very* likely your modifications will
      produce worse results and you want to go back to the previous
      rule set.
 
  --> 4) Encode your corpus as feature vectors.
 
       Extracting features is a costly procedure. You want to avoid 
       doing it at each experiment. You only have to repeat it
       when you are testing a new feature set.

       For NER, just do:

          cd utilities/nerc/ner
          bin/encode-corpus XX YYY

          where XX is the language code ("es", "en"...)
          and YYY if the rgf file code (e.g. "f42") YYY will
          be used to build the name ner-YYY.rgf which will be 
          expected to be at utilities/nerc/XX/nerc/ner.

	  This script will create the files:

          a) Your encoded train/test corpus for NER:
            utilities/nerc/ner/corpus/XX.train.YYY.enc
            utilities/nerc/ner/corpus/XX.test.YYY.enc   

          b) Feature lexicons with different filter levels
            utilities/nerc/ner/trained-XX/ner-YYY-all.lex
            utilities/nerc/ner/trained-XX/ner-YYY-1abs.lex
            utilities/nerc/ner/trained-XX/ner-YYY-2abs.lex
            utilities/nerc/ner/trained-XX/ner-YYY-3abs.lex
	    etc..

       For NEC, just do:

          cd utilities/nerc/nec
          bin/encode-corpus XX YYY ZZZ

          where XX is the language code ("es", "en"...)
          and YYY if the rgf file code (e.g. "f42") YYY will
          be used to build the name nec-YYY.rgf which will be 
          expected to be at utilities/nerc/XX/nec.
          ZZZ is the gazetteer to be used to encode the corpus
          into features. It can be "rich.train", "rich.test",
          "poor1.train" or "poor1.test", depending on whether
          you want to use the rich or poor gazetteer (extracted
          as described above)
          If it is "train", only the train corpus will be encoded.
          If it is "test", only the test corpus will be encoded.
          
	  This script will create the files:

          a) Your encoded train/test corpus for NEC:
            utilities/nerc/nec/corpus/XX.YYY.ZZZ.enc
            utilities/nerc/nec/corpus/XX.YYY.ZZZ.enc   

          b) If ZZZ contains "train", feature lexicons with 
           different filter levels will be created:
            utilities/nerc/nec/trained-XX/ner-YYY-ZZZ-all.lex
            utilities/nerc/nec/trained-XX/ner-YYY-ZZZ-1abs.lex	
            utilities/nerc/nec/trained-XX/ner-YYY-ZZZ-3abs.lex
            utilities/nerc/nec/trained-XX/ner-YYY-ZZZ-5abs.lex
            etc.


  --> 4) Train and test your models
 
        The last step is launch the script that will train a model, 
       run it on the test, and provide you with performance statistics. 

        You may want to run different models corresponding, for instance,
       to different rgf rule sets, or using different filters for the 
       feature lexicon.
        You don't need to encode the corpus for each model you want to 
       try. Use the corpus encoded above: they contain all features and 
       are filtered by the given lexicon.
 
        Different instances of the scripts may be run in parallel, provided 
       they have at least one different parameter.

      4.1) To train an AdaBoost NER, launch:

          cd utilities/nerc/ner
          bin/ner-adaboost.sh XX YYY ZZZ SSS TTT

         This will build the model for language XX (es, en, etc..)
         using feature rules YYY (f86, f87, ...)
         and lexicon filter ZZZZ (1abs, 2abs, ...)
         and SSS gazetteer for training (SSS may be "rich" or "poor1")
         and TTT gazetteer for testing (TTT may be "rich" or "poor1")

         Note that this script will require the existence of the files
         build in previous steps:

	   utilities/nerc/ner/corpus/XX.YYY.ZZZZ.train.enc   
	   utilities/nerc/ner/corpus/XX.YYY.ZZZZ.test.enc   
           utilities/nerc/ner/trained-XX/ner-YYY-ZZZZ-*.lex

         The script will produce:

	  a) Model files:
           utilities/nerc/ner/trained-XX/adaboost/ner-YYY-SSS-ZZZZ.abm
           utilities/nerc/ner/trained-XX/adaboost/ner-YYY-SSS-ZZZZ.dat

          b) test results and statistics
           utilities/nerc/ner/results-XX/adaboost/ner-YYY-SSS.TTT-ZZZZ.out
           utilities/nerc/ner/results-XX/adaboost/ner-YYY-SSS.TTT-ZZZZ.stats


      4.2) To train an svm NER, launch:

          cd utilities/nerc/ner
          bin/ner-svm.sh XX YYY ZZZZ

          You'll need to have libsvm installed (check the path in 
          ner-svm.sh script) 
          http://www.csie.ntu.edu.tw/~cjlin/libsvm/

	  The requisites and results will be the same than above,
          buit left in the directories:

           utilities/nerc/ner/trained-XX/svm/
           utilities/nerc/ner/results-XX/svm/


      4.3) To train an AdaBoost NEC, launch:

          cd utilities/nerc/nec
          bin/nec-adaboost.sh XX YYY ZZZ SSS TTT

         This will build the model for language XX (es, en, etc..)
         using feature rules YYY (f86, f87, ...)
         and lexicon filter ZZZZ (1abs, 2abs, ...)
         and SSS gazetteer for training (SSS may be "rich" or "poor1")
         and TTT gazetteer for testing (TTT may be "rich" or "poor1")

	  The requisites and results will be the same than above,
          buit left in the directories:

           utilities/nerc/nec/trained-XX/adaboost/
           utilities/nerc/nec/results-XX/adaboost/

      4.4) To train an svm NEC, launch:

           cd utilities/nerc/nec
           bin/nec-svm XX YYY ZZZZ

          You'll need to have libsvm installed (check the path in 
          ner-svm.sh script) 
          http://www.csie.ntu.edu.tw/~cjlin/libsvm/

	  The requisites and results will be the same than above,
          buit left in the directories:

           utilities/nerc/nec/trained-XX/svm/
           utilities/nerc/nec/results-XX/svm/


   To train SVM models, you'll need to have libsvm installed
   http://www.csie.ntu.edu.tw/~cjlin/libsvm/

   (And adjust the appropriate path in ner-svm.sh script) 


  --> 4) Move your models to production stage

   Obtained model files may be moved into FreeLing data directories and used
   to instantiate 'bioner' NE recognizers or 'nec' classifiers.

   The script bin/pack.sh (both in ner and nec directories) will select and 
   adapt the files you need.

   It is called like this:
   
      bin/pack.sh XX YYY SSSS ZZZZ NN

   This will build the model for language XX (es, en, etc..)
   using feature rules YYY (f86, f87, ...)
   and lexicon filter ZZZZ (1abs, 2abs, ...)
   and SSS gazetteer for training ("rich" or "poor1")
   and using NN weakrules.

   The script will create a tarball named (depending on whether you are on "nec" or "ner")

     nec-ab-XX-YYY-SSS-ZZZZ-NN.tgz
     ner-ab-XX-YYY-SSS-ZZZZ-NN.tgz

   Containing:

     data/   A directory containing all test gazetters, and trigger words list.
             These data will be the same for all models, so you can copy it only 
             once if you are installing more than one model.

     nec-YYY.rgf           The feature extraction rule set
     nec-ab-SSS.dat        The main configuration file for the nec classifier
     nec-YYY-SSS-ZZZZ.abm  The adaboost model with NNN weakrules
     nec-YYY-SSS-ZZZZ.lex  The feature lexicon

     (for NER, the same filenames hold, prefixed by "ner" instead of "nec")

     You'll need to move these files to FreeLing production directories as
     follows:

     Content of "data" directory should be moved to  XX/nerc/data in freeling 
     share installation directory (e.g. /usr/local/share/freeling)
 
     All the other files should be moved to XX/nerc/nec in freeling 
     share installation directory (e.g. /usr/local/share/freeling)
     (or to XX/nerc/ner if the models are for NER) 

     Finally, FreeLing configuration files should be changed to load the new models.
