diff --git a/README.md b/README.md index 2136ff0..d22598f 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,38 @@ -Code for the EMNLP paper, "[Bootstrapping Transliteration with Guided Discovery for Low-Resource Languages](http://shyamupa.com/papers/UKR18.pdf)". +### Using Trained Models for Generating Transliterations -[[https://github.com/shyamupa/hma-translit/blob/master/image.pdf|alt=model figure]] +Download and untar the relevant trained model. +Right now the models for [bengali](http://bilbo.cs.illinois.edu/~upadhya3/bengali.tar.gz), [kannada](http://bilbo.cs.illinois.edu/~upadhya3/kannada.tar.gz) or [hindi](http://bilbo.cs.illinois.edu/~upadhya3/hindi.tar.gz) trained on the NEWS2015 datasets are available. -Tested using pytorch version '0.3.1.post2' with python3. +Each tarball contains the vocab files and the pytorch model. -## Running the code +#### Interactive Mode +To run in interactive mode + +```bash +./load_and_test_model_interactive.sh hindi_data.vocab hindi.model +``` + +#### Get Predictions for Test input +1. First prepare a test file (let's call it `hindi.test`) such that each line contains a sequence of space separated characters of each input token, + +``` +आ च र े क र +आ च व ल +``` + +2. Then run the trained model on it using the following command, +```bash +./load_and_test_model_on_files.sh hindi_data.vocab hindi.model hindi.test hindi.test.out +``` +This will generate output in the test file as follows, + +``` +आ च र े क र a c h a r e k a r;a c h a b e k a r;a a c h a r e k a r -0.6695770507547368;-2.079195646460341;-2.465612842870943 +``` + +where the 2nd column is the (; delimited) output from the beam search (using `beam_width` of 3) and 3rd column contains the (';' delimited) corresponding scores for each item. +That is, the model score for `a c h a r e k a r` was `-0.6695770507547368`. +### Training Your Own Model 1. First compile the C code for the aligner. ```bash @@ -20,18 +48,28 @@ x1 x2 x3y1 y2 y3 y4 y5 where `x1x2x3` is the input word (`xi` is the character), and `y1y2y3y4y5` is the desired output (transliteration). Example train and test files for bengali are in data/ folder. There is a optional 3rd column marking whether the word is *native* or *foreign* (see the paper for these terms); this column can be ignored for most purposes. -3. Run `train_model_on_files.sh` on your train (say `train.txt`) and dev file (say `dev.txt`) as follows, +3. Create the vocab files and aligned data using `prepare_data.sh` +```bash +./prepare_data.sh hindi_train.txt hindi_dev.txt 100 hindi_data.vocab hindi_data.aligned ``` -./train_model_on_files.sh train.txt dev.txt 100 translit.model -``` -where 100 is the random seed and translit.model is the output model. Other parameters(see `utils/arguments.py` for options) can be specified by modifying the `train_model_on_files.sh` script appropriately. +This will create two vocab files `hindi_data.vocab.envoc` and `hindi_data.vocab.frvoc`, and a file `hindi_data.aligned` containing the (monotonically) aligned training data . + -4. Test the trained model as follows, +4. Run `train_model_on_files.sh` on your train (say train.txt) and dev file (dev.txt) as follows, +```bash +./train_model_on_files.sh hindi_data.vocab hindi_data.aligned hindi_dev.txt 100 hindi.model ``` -./load_and_test_model_on_files.sh train.txt test.txt translit.model 100 output.txt + +where 100 is the random seed and hindi.model is the output model. +Other parameters like embedding size, hidden size (see `utils/arguments.py` for all options) can be specified by modifying the `train_model_on_files.sh` script appropriately. + +5. Test the trained model as follows, + +```bash +./load_and_test_model_on_files.sh hindi_data.vocab hindi.model hindi_test.txt output.txt ``` The output should report relevant metrics, @@ -59,8 +97,12 @@ The output should report relevant metrics, There is also a interactive mode where one can input test words directly, +```bash +./load_and_test_model_interactive.sh +``` + +You will see a prompt to enter surface forms in the source writing script (see below) ``` -./load_and_test_model_interactive.sh train.txt translit.model 100 ... ... :INFO: => loading checkpoint hindi.model @@ -70,13 +112,3 @@ enter surface:ओबामा [(-0.4624647759074629, 'o b a m a')] ``` -### Citation - -``` -@InProceedings{UKR18, - author = {Upadhyay, Shyam and Kodner, Jordan and Roth, Dan}, - title = {Bootstrapping Transliteration with Guided Discovery for Low-Resource Languages}, - booktitle = {EMNLP}, - year = {2018}, -} -``` diff --git a/create_data.sh b/create_data.sh new file mode 100755 index 0000000..978c701 --- /dev/null +++ b/create_data.sh @@ -0,0 +1,30 @@ +#!/usr/bin/env bash +ME=`basename $0` # for usage message + +if [[ "$#" -ne 5 ]]; then # number of args + echo "USAGE: ${ME} " + exit +fi +ftrain=$1 +ftest=$2 +seed=$3 +vocabfile=$4 +aligned_file=$5 + +time python -m seq2seq.prepare_data \ + --ftrain ${ftrain} \ + --ftest ${ftest} \ + --vocabfile ${vocabfile} \ + --aligned_file ${aligned_file} \ + --seed ${seed} + + + + + +if [[ $? == 0 ]] # success +then + : # do nothing +else # something went wrong + echo "SOME PROBLEM OCCURED"; # echo file with problems +fi diff --git a/load_and_test_model_interactive.sh b/load_and_test_model_interactive.sh index 99f5868..6ddd4a7 100755 --- a/load_and_test_model_interactive.sh +++ b/load_and_test_model_interactive.sh @@ -1,21 +1,19 @@ #!/usr/bin/env bash ME=`basename $0` # for usage message -if [ "$#" -ne 3 ]; then # number of args - echo "USAGE: ${ME} " +if [[ "$#" -ne 2 ]]; then # number of args + echo "USAGE: ${ME} " echo exit fi -ftrain=$1 +vocabfile=$1 model=$2 -seed=$3 -time python -m seq2seq.main \ - --ftrain ${ftrain} \ +time python -m seq2seq.predict \ + --vocabfile ${vocabfile} \ --mono \ --beam_width 1 \ --restore ${model} \ - --interactive \ - --seed ${seed} + --interactive if [[ $? == 0 ]] # success then diff --git a/load_and_test_model_on_files.sh b/load_and_test_model_on_files.sh index 713e5ac..eedf37a 100755 --- a/load_and_test_model_on_files.sh +++ b/load_and_test_model_on_files.sh @@ -1,24 +1,22 @@ #!/usr/bin/env bash ME=`basename $0` # for usage message -if [ "$#" -ne 5 ]; then # number of args - echo "USAGE: " +if [[ "$#" -ne 4 ]]; then # number of args + echo "USAGE: " echo "$ME" exit fi -ftrain=$1 -ftest=$2 -model=$3 -seed=$4 -out=$5 -time python -m seq2seq.main \ - --ftrain ${ftrain} \ +vocabfile=$1 +model=$2 +ftest=$3 +outfile=$4 +time python -m seq2seq.predict \ + --vocabfile ${vocabfile} \ --ftest ${ftest} \ --mono \ --beam_width 1 \ --restore ${model} \ - --seed ${seed} \ - --dump ${out} + --dump ${outfile} diff --git a/readers/aligned_reader.py b/readers/aligned_reader.py index 46ab9ab..2ef9599 100644 --- a/readers/aligned_reader.py +++ b/readers/aligned_reader.py @@ -1,20 +1,15 @@ from __future__ import division from __future__ import print_function -import sys import logging +import random -from seq2seq.lang import Lang -from seq2seq.constants import ALIGN_SYMBOL from baseline import align_utils - -import random -from collections import Counter -# from seq2seq.main import oracle_action +from seq2seq.constants import ALIGN_SYMBOL from seq2seq.constants import STEP + # logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) -import argparse def safe_replace_spaces(s): @@ -24,6 +19,29 @@ def safe_replace_spaces(s): return s +def subsample_examples(examples, frac, single_token): + new_examples = [] + for ex in examples: + fr, en, weight, is_eng = ex + frtokens, entokens = fr.split(" "), en.split(" ") + if len(frtokens) != len(entokens): continue + if single_token: + if len(frtokens) > 1 or len(entokens) > 1: continue + for frtok, entok in zip(frtokens, entokens): + new_examples.append((frtok, entok, weight, is_eng)) + examples = new_examples + logging.info("new examples %d", len(examples)) + # subsample if needed + random.shuffle(examples) + if frac < 1.0: + tmp = examples[0:int(frac * len(examples))] + examples = tmp + elif frac > 1.0: + tmp = examples[0:int(frac)] + examples = tmp + return examples + + def read_examples(fpath, native_or_eng="both", remove_spaces=False, weight=1.0): examples = [] bad = 0 diff --git a/seq2seq/main.py b/seq2seq/main.py index 5732fae..92d1f70 100644 --- a/seq2seq/main.py +++ b/seq2seq/main.py @@ -1,67 +1,27 @@ -import random import logging +import random import sys +import numpy as np import torch import torch.nn as nn -import numpy as np -from utils.arguments import PARSER -from readers.aligned_reader import load_aligned_data, read_examples -from seq2seq.constants import STEP +from readers.aligned_reader import read_examples from seq2seq.evaluators.reporter import AccReporter, get_decoded_words -from seq2seq.lang import Lang +from seq2seq.model_utils import load_checkpoint, model_builder, setup_optimizers +from seq2seq.prepare_data import langcodes, load_vocab_and_examples from seq2seq.runner import run from seq2seq.trainers.monotonic_train import MonotonicTrainer -from seq2seq.model_utils import load_checkpoint, model_builder, setup_optimizers +from utils.arguments import PARSER # logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) logging.basicConfig(format=':%(levelname)s: %(message)s', level=logging.INFO) - -def subsample_examples(examples, frac, single_token): - new_examples = [] - for ex in examples: - fr, en, weight, is_eng = ex - frtokens, entokens = fr.split(" "), en.split(" ") - if len(frtokens) != len(entokens): continue - if single_token: - if len(frtokens) > 1 or len(entokens) > 1: continue - for frtok, entok in zip(frtokens, entokens): - new_examples.append((frtok, entok, weight, is_eng)) - examples = new_examples - logging.info("new examples %d", len(examples)) - # subsample if needed - random.shuffle(examples) - if frac < 1.0: - tmp = examples[0:int(frac * len(examples))] - examples = tmp - elif frac > 1.0: - tmp = examples[0:int(frac)] - examples = tmp - return examples - - -def index_vocab(examples, fr_lang, en_lang): - for ex in examples: - raw_x, raw_y, xs, ys, weight, is_eng = ex - fr_lang.index_words(xs) - en_lang.index_words(ys) - logging.info("train size %d", len(examples)) - - -langcodes = {"hi": "hindi", "fa": "farsi", "ta": "tamil", "ba": "bengali", "ka": "kannada", "he": "hebrew", - "th": "thai"} - if __name__ == '__main__': args = PARSER.parse_args() args = vars(args) logging.info(args) - batch_first = args["batch_first"] - device_id = args["device_id"] seed = args["seed"] - native_or_eng = args["nat_or_eng"] - single_token = args["single_token"] remove_spaces = True np.random.seed(seed) @@ -71,31 +31,15 @@ def index_vocab(examples, fr_lang, en_lang): lang = langcodes[args["lang"]] - trainpath = "data/%s/%s_train_annotateEN" % (lang, lang) if args["ftrain"] is None else args["ftrain"] - testpath = "data/%s/%s_test_annotateEN" % (lang, lang) if args["ftest"] is None else args["ftest"] - - examples = read_examples(fpath=trainpath, - native_or_eng=native_or_eng, - remove_spaces=remove_spaces) + testpath = args["ftest"] - examples = subsample_examples(examples=examples, frac=args["frac"], single_token=single_token) - - fr_lang, en_lang = Lang(name="fr"), Lang(name="en") - examples = load_aligned_data(examples=examples, - mode="mcmc", - seed=seed) - index_vocab(examples, fr_lang, en_lang) - en_lang.index_word(STEP) - fr_lang.compute_maps() - en_lang.compute_maps() - # see_phrase_alignments(examples=examples) + fr_lang, en_lang, examples = load_vocab_and_examples(vocabfile=args["vocabfile"], aligned_file=args["aligned_file"]) logging.info(fr_lang.word2index) logging.info(en_lang.word2index) + # ALWAYS READ ALL TEST EXAMPLES test = read_examples(fpath=testpath) - train = read_examples(fpath=trainpath) - train = [ex for ex in train if ' ' not in ex[0] and ' ' not in ex[1]] logging.info("input vocab: %d", fr_lang.n_words) logging.info("output vocab: %d", en_lang.n_words) logging.info("beam width: %d", args["beam_width"]) @@ -113,8 +57,6 @@ def index_vocab(examples, fr_lang, en_lang): # Begin! test_reporter = AccReporter(args=args, dump_file=args["dump"]) - train_reporter = AccReporter(args=args, - dump_file=args["dump"] + ".train.txt" if args["dump"] is not None else None) if args["restore"]: if "," in args["restore"]: @@ -147,5 +89,4 @@ def index_vocab(examples, fr_lang, en_lang): run(args=args, examples=examples, trainer=trainer, evaler=evaler, criterion=criterion, - train=train, test=test, - train_reporter=train_reporter, test_reporter=test_reporter) + test=test,test_reporter=test_reporter) diff --git a/seq2seq/model_utils.py b/seq2seq/model_utils.py index afd454a..61ae2aa 100644 --- a/seq2seq/model_utils.py +++ b/seq2seq/model_utils.py @@ -31,7 +31,7 @@ def setup_optimizers(args, encoder, decoder): def model_builder(args, fr_lang, en_lang): bidi = args["bidi"] - device_id = args["device_id"] + # device_id = args["device_id"] batch_first = args["batch_first"] vector_size = args["wdim"] hidden_size = args["hdim"] @@ -73,9 +73,9 @@ def model_builder(args, fr_lang, en_lang): logging.info(encoder) logging.info(decoder) # Move models to GPU - if device_id is not None: - encoder.cuda(device_id) - decoder.cuda(device_id) + # if device_id is not None: + # encoder.cuda(device_id) + # decoder.cuda(device_id) return encoder, decoder, evaler diff --git a/seq2seq/predict.py b/seq2seq/predict.py new file mode 100644 index 0000000..f0a724b --- /dev/null +++ b/seq2seq/predict.py @@ -0,0 +1,62 @@ +import logging +import sys + +from seq2seq.evaluators.reporter import get_decoded_words +from seq2seq.model_utils import load_checkpoint, model_builder, setup_optimizers +from seq2seq.prepare_data import load_vocab +from utils.arguments import PARSER + +logging.basicConfig(format=':%(levelname)s: %(message)s', level=logging.INFO) + +if __name__ == '__main__': + args = PARSER.parse_args() + args = vars(args) + logging.info(args) + + fr_lang, en_lang = load_vocab(vocabfile=args["vocabfile"]) + logging.info(fr_lang.word2index) + logging.info(en_lang.word2index) + + logging.info("input vocab: %d", fr_lang.n_words) + logging.info("output vocab: %d", en_lang.n_words) + logging.info("beam width: %d", args["beam_width"]) + + # Initialize models + encoder, decoder, evaler = model_builder(args, fr_lang=fr_lang, en_lang=en_lang) + enc_opt, dec_opt, enc_sch, dec_sch = setup_optimizers(args=args, encoder=encoder, decoder=decoder) + + if args["restore"]: + load_checkpoint(encoder=encoder, decoder=decoder, + enc_opt=enc_opt, dec_opt=dec_opt, + ckpt_path=args["restore"]) + if args["interactive"]: + try: + while True: + surface = input("enter surface:") + surface = " ".join(list(surface)) + print(surface) + x, y, weight, is_eng = surface, None, 1.0, False + decoded_outputs = evaler.infer_on_example(sentence=x) + scores_and_words = get_decoded_words(decoded_outputs) + decoded_words = [w for s, w in scores_and_words] + scores = [s for s, w in scores_and_words] + print(scores_and_words) + except KeyboardInterrupt: + print('interrupted!') + sys.exit(0) + else: + testpath = args["ftest"] + with open(args["dump"], "w") as out: + for idx, line in enumerate(open(testpath)): + surface = line.strip() + x, y, weight, is_eng = surface, None, 1.0, False + if idx > 0 and idx % 200 == 0: + logging.info("running infer on example %d", idx) + decoded_outputs = evaler.infer_on_example(sentence=x) + scores_and_words = get_decoded_words(decoded_outputs) + # decoded_words = [w for s, w in scores_and_words] + # scores = [s for s, w in scores_and_words] + beam_outputs = ";".join([word for score, word in scores_and_words]) + beam_scores = ";".join([str(score) for score, word in scores_and_words]) + buf = f"{x}\t{beam_outputs}\t{beam_scores}\n" + out.write(buf) diff --git a/seq2seq/prepare_data.py b/seq2seq/prepare_data.py new file mode 100644 index 0000000..19a22e8 --- /dev/null +++ b/seq2seq/prepare_data.py @@ -0,0 +1,87 @@ +import logging +import pickle +import random + +import numpy as np +import torch + +from utils.arguments import PARSER +from readers.aligned_reader import load_aligned_data, read_examples, subsample_examples +from seq2seq.lang import Lang +from seq2seq.constants import STEP + + +def index_vocab(examples, fr_lang, en_lang): + for ex in examples: + raw_x, raw_y, xs, ys, weight, is_eng = ex + fr_lang.index_words(xs) + en_lang.index_words(ys) + logging.info("train size %d", len(examples)) + + +def load_vocab_and_examples(vocabfile, aligned_file): + with open(vocabfile + ".frvoc", 'rb') as f: + fr_lang = pickle.load(f) + with open(vocabfile + ".envoc", 'rb') as f: + en_lang = pickle.load(f) + with open(aligned_file, 'rb') as f: + examples = pickle.load(f) + return fr_lang, en_lang, examples + + +def load_vocab(vocabfile): + with open(vocabfile + ".frvoc", 'rb') as f: + fr_lang = pickle.load(f) + with open(vocabfile + ".envoc", 'rb') as f: + en_lang = pickle.load(f) + return fr_lang, en_lang + + +def save_vocab_and_examples(fr_lang, en_lang, examples, vocabfile, aligned_file): + with open(vocabfile + ".frvoc", 'wb') as f: + pickle.dump(fr_lang, file=f) + with open(vocabfile + ".envoc", 'wb') as f: + pickle.dump(en_lang, file=f) + with open(aligned_file, 'wb') as f: + pickle.dump(examples, file=f) + + +langcodes = {"hi": "hindi", "fa": "farsi", "ta": "tamil", "ba": "bengali", "ka": "kannada", "he": "hebrew", + "th": "thai"} + +if __name__ == '__main__': + args = PARSER.parse_args() + args = vars(args) + logging.info(args) + # batch_first = args["batch_first"] + # device_id = args["device_id"] + seed = args["seed"] + native_or_eng = args["nat_or_eng"] + single_token = args["single_token"] + + remove_spaces = True + np.random.seed(seed) + random.seed(seed) + torch.manual_seed(seed) + torch.cuda.manual_seed(seed) + + lang = langcodes[args["lang"]] + + trainpath = "data/%s/%s_train_annotateEN" % (lang, lang) if args["ftrain"] is None else args["ftrain"] + testpath = "data/%s/%s_test_annotateEN" % (lang, lang) if args["ftest"] is None else args["ftest"] + + examples = read_examples(fpath=trainpath, + native_or_eng=native_or_eng, + remove_spaces=remove_spaces) + + examples = subsample_examples(examples=examples, frac=args["frac"], single_token=single_token) + + fr_lang, en_lang = Lang(name="fr"), Lang(name="en") + examples = load_aligned_data(examples=examples, + mode="mcmc", + seed=seed) + index_vocab(examples, fr_lang, en_lang) + en_lang.index_word(STEP) + fr_lang.compute_maps() + en_lang.compute_maps() + save_vocab_and_examples(fr_lang, en_lang, examples, vocabfile=args["vocabfile"], aligned_file=args["aligned_file"]) diff --git a/seq2seq/runner.py b/seq2seq/runner.py index ec664fc..cd591ca 100644 --- a/seq2seq/runner.py +++ b/seq2seq/runner.py @@ -8,7 +8,7 @@ __author__ = 'Shyam' -def run(args, examples, trainer, criterion, evaler, train, test, test_reporter, train_reporter): +def run(args, examples, trainer, criterion, evaler, test, test_reporter, train=None, train_reporter=None): n_epochs = args["iters"] logging.info("training on %d examples for %d epochs", len(examples), n_epochs) random.shuffle(examples) diff --git a/train_model_on_files.sh b/train_model_on_files.sh index 98a2f8d..42d54ed 100755 --- a/train_model_on_files.sh +++ b/train_model_on_files.sh @@ -1,18 +1,20 @@ #!/usr/bin/env bash ME=`basename $0` # for usage message -if [ "$#" -ne 4 ]; then # number of args - echo "USAGE: ${ME} " +if [[ "$#" -ne 5 ]]; then # number of args + echo "USAGE: ${ME} " exit fi -ftrain=$1 -ftest=$2 -seed=$3 -model=$4 +vocabfile=$1 +aligned_file=$2 +fdev=$3 +seed=$4 +model=$5 time python -m seq2seq.main \ - --ftrain ${ftrain} \ - --ftest ${ftest} \ + --vocabfile ${vocabfile} \ + --aligned_file ${aligned_file} \ + --ftest ${fdev} \ --mono \ --beam_width 1 \ --save ${model} \ diff --git a/utils/arguments.py b/utils/arguments.py index 088cb8c..d805df5 100644 --- a/utils/arguments.py +++ b/utils/arguments.py @@ -1,6 +1,6 @@ import argparse -PARSER = argparse.ArgumentParser(description='entity linker') +PARSER = argparse.ArgumentParser(description='transliteration with monotonic attention') PARSER.add_argument('--iters', type=int, default=20, help='# train iters (default: 20)') PARSER.add_argument('--maxsteps', type=int, default=500000, help='# train iters (default: 5)') PARSER.add_argument('--batch_size', type=int, default=1, help='batch size (default: 1)') @@ -32,8 +32,8 @@ PARSER.add_argument('--ftest', type=str, help='test/val file') PARSER.add_argument('--frac', type=float, default=1.0, help='frac of train data') PARSER.add_argument('--dump', type=str, default=None, help='to dump test predictions') -PARSER.add_argument('--device_id', type=int, default=None, help='gpu device') -PARSER.add_argument('--ncands', type=int, default=20, help='ncands') +# PARSER.add_argument('--device_id', type=int, default=None, help='gpu device') +# PARSER.add_argument('--ncands', type=int, default=20, help='ncands') PARSER.add_argument('--no-bidi', dest='bidi', action='store_false', help='do not use bidirectional') PARSER.set_defaults(bidi=True) PARSER.add_argument('--no-batch-first', dest='batch_first', action='store_false', help='do not use batch first') @@ -41,4 +41,6 @@ PARSER.add_argument('--mono', dest='mono', action='store_true', help='use monotonic transliteration model') PARSER.add_argument('--interactive', action="store_true", dest="interactive") PARSER.add_argument('--outfile', action="store", dest="outfile") +PARSER.add_argument('--vocabfile', action="store", dest="vocabfile") +PARSER.add_argument('--aligned_file', action="store", dest="aligned_file")