better README, also added download links for trained models

shyamupa · Feb 13, 2019 · e98f668 · e98f668
1 parent f60639a
commit e98f668
Show file tree

Hide file tree

Showing 12 changed files with 303 additions and 133 deletions.
diff --git a/README.md b/README.md
@@ -1,10 +1,38 @@
-Code for the EMNLP paper, "[Bootstrapping Transliteration with Guided Discovery for Low-Resource Languages](http://shyamupa.com/papers/UKR18.pdf)".
+### Using Trained Models for Generating Transliterations
 
-[[https://github.com/shyamupa/hma-translit/blob/master/image.pdf|alt=model figure]]
+Download and untar the relevant trained model.
+Right now the models for [bengali](http://bilbo.cs.illinois.edu/~upadhya3/bengali.tar.gz), [kannada](http://bilbo.cs.illinois.edu/~upadhya3/kannada.tar.gz) or [hindi](http://bilbo.cs.illinois.edu/~upadhya3/hindi.tar.gz) trained on the NEWS2015 datasets are available. 
 
-Tested using pytorch version '0.3.1.post2' with python3.
+Each tarball contains the vocab files and the pytorch model.
 
-## Running the code
+#### Interactive Mode
+To run in interactive mode
+
+```bash
+./load_and_test_model_interactive.sh hindi_data.vocab hindi.model
+```
+
+#### Get Predictions for Test input
+1. First prepare a test file (let's call it `hindi.test`) such that each line contains a sequence of space separated characters of each input token,
+
+```
+आ च र े क र
+आ च व ल
+```
+
+2. Then run the trained model on it using the following command,
+```bash
+./load_and_test_model_on_files.sh hindi_data.vocab hindi.model hindi.test hindi.test.out
+```
+This will generate output in the test file as follows,
+
+```
+आ च र े क र      a c h a r e k a r;a c h a b e k a r;a a c h a r e k a r -0.6695770507547368;-2.079195646460341;-2.465612842870943
+``` 
+
+where the 2nd column is the (; delimited) output from the beam search (using `beam_width` of 3) and 3rd column contains the (';' delimited) corresponding scores for each item. 
+That is, the model score for `a c h a r e k a r` was  `-0.6695770507547368`. 
+### Training Your Own Model
 
 1. First compile the C code for the aligner.
 ```bash
@@ -20,18 +48,28 @@ x1 x2 x3<tab>y1 y2 y3 y4 y5
 where `x1x2x3` is the input word (`xi` is the character), and `y1y2y3y4y5` is the desired output (transliteration). Example train and test files for bengali are in data/ folder. There is a optional 3rd column marking whether the word is *native* or *foreign* (see the paper for these terms); this column can be ignored for most purposes. 
 
 
-3. Run `train_model_on_files.sh` on your train (say `train.txt`) and dev file (say `dev.txt`) as follows,
+3. Create the vocab files and aligned data using `prepare_data.sh`
 
+```bash
+./prepare_data.sh hindi_train.txt hindi_dev.txt 100 hindi_data.vocab hindi_data.aligned  
 ```
-./train_model_on_files.sh train.txt dev.txt 100 translit.model
-```
 
-where 100 is the random seed and translit.model is the output model. Other parameters(see `utils/arguments.py` for options) can be specified by modifying the `train_model_on_files.sh` script appropriately.
+This will create two vocab files `hindi_data.vocab.envoc` and `hindi_data.vocab.frvoc`, and a file `hindi_data.aligned` containing the (monotonically) aligned training data .
+
 
-4. Test the trained model as follows,
+4. Run `train_model_on_files.sh` on your train (say train.txt) and dev file (dev.txt) as follows,
 
+```bash
+./train_model_on_files.sh hindi_data.vocab hindi_data.aligned hindi_dev.txt 100 hindi.model
 ```
-./load_and_test_model_on_files.sh train.txt test.txt translit.model 100 output.txt
+
+where 100 is the random seed and hindi.model is the output model. 
+Other parameters like embedding size, hidden size (see `utils/arguments.py` for all options) can be specified by modifying the `train_model_on_files.sh` script appropriately.
+
+5. Test the trained model as follows,
+
+```bash
+./load_and_test_model_on_files.sh hindi_data.vocab hindi.model hindi_test.txt output.txt
 ```
 
 The output should report relevant metrics,
@@ -59,8 +97,12 @@ The output should report relevant metrics,
 
 There is also a interactive mode where one can input test words directly,
 
+```bash
+./load_and_test_model_interactive.sh <ftrain> <model> <seed>
+```
+
+You will see a prompt to enter surface forms in the source writing script (see below)
 ```
-./load_and_test_model_interactive.sh train.txt translit.model 100
 ...
 ...
 :INFO: => loading checkpoint hindi.model
@@ -70,13 +112,3 @@ enter surface:ओबामा
 [(-0.4624647759074629, 'o b a m a')]
 ```
 
-### Citation
-
-```
-@InProceedings{UKR18,
-  author =       {Upadhyay, Shyam and Kodner, Jordan and Roth, Dan},
-  title =        {Bootstrapping Transliteration with Guided Discovery for Low-Resource Languages},
-  booktitle =    {EMNLP},
-  year =         {2018},
-}
-```
diff --git a/create_data.sh b/create_data.sh
@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+ME=`basename $0` # for usage message
+
+if [[ "$#" -ne 5 ]]; then 	# number of args
+    echo "USAGE: ${ME} <ftrain> <ftest> <seed> <vocabfile> <aligned_file>"
+    exit
+fi
+ftrain=$1
+ftest=$2
+seed=$3
+vocabfile=$4
+aligned_file=$5
+
+time python -m seq2seq.prepare_data \
+     --ftrain ${ftrain} \
+     --ftest ${ftest} \
+     --vocabfile ${vocabfile} \
+     --aligned_file ${aligned_file} \
+     --seed ${seed}
+
+
+
+
+
+if [[ $? == 0 ]]        # success
+then
+    :                   # do nothing
+else                    # something went wrong
+    echo "SOME PROBLEM OCCURED";            # echo file with problems
+fi
diff --git a/load_and_test_model_interactive.sh b/load_and_test_model_interactive.sh
@@ -1,21 +1,19 @@
 #!/usr/bin/env bash
 ME=`basename $0` # for usage message
 
-if [ "$#" -ne 3 ]; then 	# number of args
-    echo "USAGE: ${ME} <ftrain> <model> <seed>"
+if [[ "$#" -ne 2 ]]; then 	# number of args
+    echo "USAGE: ${ME} <vocabfile> <model>"
     echo 
     exit
 fi
-ftrain=$1
+vocabfile=$1
 model=$2
-seed=$3
-time python -m seq2seq.main \
-     --ftrain ${ftrain} \
+time python -m seq2seq.predict \
+     --vocabfile ${vocabfile} \
      --mono \
      --beam_width 1 \
      --restore ${model} \
-     --interactive \
-     --seed ${seed}
+     --interactive
 
 if [[ $? == 0 ]]        # success
 then

diff --git a/load_and_test_model_on_files.sh b/load_and_test_model_on_files.sh
@@ -1,24 +1,22 @@
 #!/usr/bin/env bash
 ME=`basename $0` # for usage message
 
-if [ "$#" -ne 5 ]; then 	# number of args
-    echo "USAGE: <ftrain> <ftest> <model> <seed> <outfile>"
+if [[ "$#" -ne 4 ]]; then 	# number of args
+    echo "USAGE: <vocabfile> <model> <ftest> <outfile>"
     echo "$ME"
     exit
 fi
-ftrain=$1
-ftest=$2
-model=$3
-seed=$4
-out=$5
-time python -m seq2seq.main \
-     --ftrain ${ftrain} \
+vocabfile=$1
+model=$2
+ftest=$3
+outfile=$4
+time python -m seq2seq.predict \
+     --vocabfile ${vocabfile} \
      --ftest ${ftest} \
      --mono \
      --beam_width 1 \
      --restore ${model} \
-     --seed ${seed} \
-     --dump ${out}
+     --dump ${outfile}
 
 
 

diff --git a/readers/aligned_reader.py b/readers/aligned_reader.py
@@ -1,20 +1,15 @@
 from __future__ import division
 from __future__ import print_function
 
-import sys
 import logging
+import random
 
-from seq2seq.lang import Lang
-from seq2seq.constants import ALIGN_SYMBOL
 from baseline import align_utils
-
-import random
-from collections import Counter
-# from seq2seq.main import oracle_action
+from seq2seq.constants import ALIGN_SYMBOL
 from seq2seq.constants import STEP
 
+
 # logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
-import argparse
 
 
 def safe_replace_spaces(s):
@@ -24,6 +19,29 @@ def safe_replace_spaces(s):
     return s
 
 
+def subsample_examples(examples, frac, single_token):
+    new_examples = []
+    for ex in examples:
+        fr, en, weight, is_eng = ex
+        frtokens, entokens = fr.split(" "), en.split(" ")
+        if len(frtokens) != len(entokens): continue
+        if single_token:
+            if len(frtokens) > 1 or len(entokens) > 1: continue
+        for frtok, entok in zip(frtokens, entokens):
+            new_examples.append((frtok, entok, weight, is_eng))
+    examples = new_examples
+    logging.info("new examples %d", len(examples))
+    # subsample if needed
+    random.shuffle(examples)
+    if frac < 1.0:
+        tmp = examples[0:int(frac * len(examples))]
+        examples = tmp
+    elif frac > 1.0:
+        tmp = examples[0:int(frac)]
+        examples = tmp
+    return examples
+
+
 def read_examples(fpath, native_or_eng="both", remove_spaces=False, weight=1.0):
     examples = []
     bad = 0

diff --git a/seq2seq/main.py b/seq2seq/main.py
@@ -1,67 +1,27 @@
-import random
 import logging
+import random
 import sys
 
+import numpy as np
 import torch
 import torch.nn as nn
-import numpy as np
 
-from utils.arguments import PARSER
-from readers.aligned_reader import load_aligned_data, read_examples
-from seq2seq.constants import STEP
+from readers.aligned_reader import read_examples
 from seq2seq.evaluators.reporter import AccReporter, get_decoded_words
-from seq2seq.lang import Lang
+from seq2seq.model_utils import load_checkpoint, model_builder, setup_optimizers
+from seq2seq.prepare_data import langcodes, load_vocab_and_examples
 from seq2seq.runner import run
 from seq2seq.trainers.monotonic_train import MonotonicTrainer
-from seq2seq.model_utils import load_checkpoint, model_builder, setup_optimizers
+from utils.arguments import PARSER
 
 # logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 logging.basicConfig(format=':%(levelname)s: %(message)s', level=logging.INFO)
 
-
-def subsample_examples(examples, frac, single_token):
-    new_examples = []
-    for ex in examples:
-        fr, en, weight, is_eng = ex
-        frtokens, entokens = fr.split(" "), en.split(" ")
-        if len(frtokens) != len(entokens): continue
-        if single_token:
-            if len(frtokens) > 1 or len(entokens) > 1: continue
-        for frtok, entok in zip(frtokens, entokens):
-            new_examples.append((frtok, entok, weight, is_eng))
-    examples = new_examples
-    logging.info("new examples %d", len(examples))
-    # subsample if needed
-    random.shuffle(examples)
-    if frac < 1.0:
-        tmp = examples[0:int(frac * len(examples))]
-        examples = tmp
-    elif frac > 1.0:
-        tmp = examples[0:int(frac)]
-        examples = tmp
-    return examples
-
-
-def index_vocab(examples, fr_lang, en_lang):
-    for ex in examples:
-        raw_x, raw_y, xs, ys, weight, is_eng = ex
-        fr_lang.index_words(xs)
-        en_lang.index_words(ys)
-    logging.info("train size %d", len(examples))
-
-
-langcodes = {"hi": "hindi", "fa": "farsi", "ta": "tamil", "ba": "bengali", "ka": "kannada", "he": "hebrew",
-             "th": "thai"}
-
 if __name__ == '__main__':
     args = PARSER.parse_args()
     args = vars(args)
     logging.info(args)
-    batch_first = args["batch_first"]
-    device_id = args["device_id"]
     seed = args["seed"]
-    native_or_eng = args["nat_or_eng"]
-    single_token = args["single_token"]
 
     remove_spaces = True
     np.random.seed(seed)
@@ -71,31 +31,15 @@ def index_vocab(examples, fr_lang, en_lang):
 
     lang = langcodes[args["lang"]]
 
-    trainpath = "data/%s/%s_train_annotateEN" % (lang, lang) if args["ftrain"] is None else args["ftrain"]
-    testpath = "data/%s/%s_test_annotateEN" % (lang, lang) if args["ftest"] is None else args["ftest"]
-
-    examples = read_examples(fpath=trainpath,
-                             native_or_eng=native_or_eng,
-                             remove_spaces=remove_spaces)
+    testpath = args["ftest"]
 
-    examples = subsample_examples(examples=examples, frac=args["frac"], single_token=single_token)
-
-    fr_lang, en_lang = Lang(name="fr"), Lang(name="en")
-    examples = load_aligned_data(examples=examples,
-                                 mode="mcmc",
-                                 seed=seed)
-    index_vocab(examples, fr_lang, en_lang)
-    en_lang.index_word(STEP)
-    fr_lang.compute_maps()
-    en_lang.compute_maps()
-    # see_phrase_alignments(examples=examples)
+    fr_lang, en_lang, examples = load_vocab_and_examples(vocabfile=args["vocabfile"], aligned_file=args["aligned_file"])
     logging.info(fr_lang.word2index)
     logging.info(en_lang.word2index)
+
     # ALWAYS READ ALL TEST EXAMPLES
     test = read_examples(fpath=testpath)
-    train = read_examples(fpath=trainpath)
 
-    train = [ex for ex in train if '  ' not in ex[0] and '  ' not in ex[1]]
     logging.info("input vocab: %d", fr_lang.n_words)
     logging.info("output vocab: %d", en_lang.n_words)
     logging.info("beam width: %d", args["beam_width"])
@@ -113,8 +57,6 @@ def index_vocab(examples, fr_lang, en_lang):
     # Begin!
     test_reporter = AccReporter(args=args,
                                 dump_file=args["dump"])
-    train_reporter = AccReporter(args=args,
-                                 dump_file=args["dump"] + ".train.txt" if args["dump"] is not None else None)
 
     if args["restore"]:
         if "," in args["restore"]:
@@ -147,5 +89,4 @@ def index_vocab(examples, fr_lang, en_lang):
         run(args=args,
             examples=examples,
             trainer=trainer, evaler=evaler, criterion=criterion,
-            train=train, test=test,
-            train_reporter=train_reporter, test_reporter=test_reporter)
+            test=test,test_reporter=test_reporter)
diff --git a/seq2seq/model_utils.py b/seq2seq/model_utils.py
@@ -31,7 +31,7 @@ def setup_optimizers(args, encoder, decoder):
 
 def model_builder(args, fr_lang, en_lang):
     bidi = args["bidi"]
-    device_id = args["device_id"]
+    # device_id = args["device_id"]
     batch_first = args["batch_first"]
     vector_size = args["wdim"]
     hidden_size = args["hdim"]
@@ -73,9 +73,9 @@ def model_builder(args, fr_lang, en_lang):
     logging.info(encoder)
     logging.info(decoder)
     # Move models to GPU
-    if device_id is not None:
-        encoder.cuda(device_id)
-        decoder.cuda(device_id)
+    # if device_id is not None:
+    #     encoder.cuda(device_id)
+    #     decoder.cuda(device_id)
     return encoder, decoder, evaler