Skip to content

Commit

Permalink
better README, also added download links for trained models
Browse files Browse the repository at this point in the history
  • Loading branch information
shyamupa committed Feb 13, 2019
1 parent f60639a commit e98f668
Show file tree
Hide file tree
Showing 12 changed files with 303 additions and 133 deletions.
74 changes: 53 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,38 @@
Code for the EMNLP paper, "[Bootstrapping Transliteration with Guided Discovery for Low-Resource Languages](http://shyamupa.com/papers/UKR18.pdf)".
### Using Trained Models for Generating Transliterations

[[https://github.com/shyamupa/hma-translit/blob/master/image.pdf|alt=model figure]]
Download and untar the relevant trained model.
Right now the models for [bengali](http://bilbo.cs.illinois.edu/~upadhya3/bengali.tar.gz), [kannada](http://bilbo.cs.illinois.edu/~upadhya3/kannada.tar.gz) or [hindi](http://bilbo.cs.illinois.edu/~upadhya3/hindi.tar.gz) trained on the NEWS2015 datasets are available.

Tested using pytorch version '0.3.1.post2' with python3.
Each tarball contains the vocab files and the pytorch model.

## Running the code
#### Interactive Mode
To run in interactive mode

```bash
./load_and_test_model_interactive.sh hindi_data.vocab hindi.model
```

#### Get Predictions for Test input
1. First prepare a test file (let's call it `hindi.test`) such that each line contains a sequence of space separated characters of each input token,

```
आ च र े क र
आ च व ल
```

2. Then run the trained model on it using the following command,
```bash
./load_and_test_model_on_files.sh hindi_data.vocab hindi.model hindi.test hindi.test.out
```
This will generate output in the test file as follows,

```
आ च र े क र a c h a r e k a r;a c h a b e k a r;a a c h a r e k a r -0.6695770507547368;-2.079195646460341;-2.465612842870943
```

where the 2nd column is the (; delimited) output from the beam search (using `beam_width` of 3) and 3rd column contains the (';' delimited) corresponding scores for each item.
That is, the model score for `a c h a r e k a r` was `-0.6695770507547368`.
### Training Your Own Model

1. First compile the C code for the aligner.
```bash
Expand All @@ -20,18 +48,28 @@ x1 x2 x3<tab>y1 y2 y3 y4 y5
where `x1x2x3` is the input word (`xi` is the character), and `y1y2y3y4y5` is the desired output (transliteration). Example train and test files for bengali are in data/ folder. There is a optional 3rd column marking whether the word is *native* or *foreign* (see the paper for these terms); this column can be ignored for most purposes.


3. Run `train_model_on_files.sh` on your train (say `train.txt`) and dev file (say `dev.txt`) as follows,
3. Create the vocab files and aligned data using `prepare_data.sh`

```bash
./prepare_data.sh hindi_train.txt hindi_dev.txt 100 hindi_data.vocab hindi_data.aligned
```
./train_model_on_files.sh train.txt dev.txt 100 translit.model
```

where 100 is the random seed and translit.model is the output model. Other parameters(see `utils/arguments.py` for options) can be specified by modifying the `train_model_on_files.sh` script appropriately.
This will create two vocab files `hindi_data.vocab.envoc` and `hindi_data.vocab.frvoc`, and a file `hindi_data.aligned` containing the (monotonically) aligned training data .


4. Test the trained model as follows,
4. Run `train_model_on_files.sh` on your train (say train.txt) and dev file (dev.txt) as follows,

```bash
./train_model_on_files.sh hindi_data.vocab hindi_data.aligned hindi_dev.txt 100 hindi.model
```
./load_and_test_model_on_files.sh train.txt test.txt translit.model 100 output.txt

where 100 is the random seed and hindi.model is the output model.
Other parameters like embedding size, hidden size (see `utils/arguments.py` for all options) can be specified by modifying the `train_model_on_files.sh` script appropriately.

5. Test the trained model as follows,

```bash
./load_and_test_model_on_files.sh hindi_data.vocab hindi.model hindi_test.txt output.txt
```

The output should report relevant metrics,
Expand Down Expand Up @@ -59,8 +97,12 @@ The output should report relevant metrics,

There is also a interactive mode where one can input test words directly,

```bash
./load_and_test_model_interactive.sh <ftrain> <model> <seed>
```

You will see a prompt to enter surface forms in the source writing script (see below)
```
./load_and_test_model_interactive.sh train.txt translit.model 100
...
...
:INFO: => loading checkpoint hindi.model
Expand All @@ -70,13 +112,3 @@ enter surface:ओबामा
[(-0.4624647759074629, 'o b a m a')]
```

### Citation

```
@InProceedings{UKR18,
author = {Upadhyay, Shyam and Kodner, Jordan and Roth, Dan},
title = {Bootstrapping Transliteration with Guided Discovery for Low-Resource Languages},
booktitle = {EMNLP},
year = {2018},
}
```
30 changes: 30 additions & 0 deletions create_data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/usr/bin/env bash
ME=`basename $0` # for usage message

if [[ "$#" -ne 5 ]]; then # number of args
echo "USAGE: ${ME} <ftrain> <ftest> <seed> <vocabfile> <aligned_file>"
exit
fi
ftrain=$1
ftest=$2
seed=$3
vocabfile=$4
aligned_file=$5

time python -m seq2seq.prepare_data \
--ftrain ${ftrain} \
--ftest ${ftest} \
--vocabfile ${vocabfile} \
--aligned_file ${aligned_file} \
--seed ${seed}





if [[ $? == 0 ]] # success
then
: # do nothing
else # something went wrong
echo "SOME PROBLEM OCCURED"; # echo file with problems
fi
14 changes: 6 additions & 8 deletions load_and_test_model_interactive.sh
Original file line number Diff line number Diff line change
@@ -1,21 +1,19 @@
#!/usr/bin/env bash
ME=`basename $0` # for usage message

if [ "$#" -ne 3 ]; then # number of args
echo "USAGE: ${ME} <ftrain> <model> <seed>"
if [[ "$#" -ne 2 ]]; then # number of args
echo "USAGE: ${ME} <vocabfile> <model>"
echo
exit
fi
ftrain=$1
vocabfile=$1
model=$2
seed=$3
time python -m seq2seq.main \
--ftrain ${ftrain} \
time python -m seq2seq.predict \
--vocabfile ${vocabfile} \
--mono \
--beam_width 1 \
--restore ${model} \
--interactive \
--seed ${seed}
--interactive

if [[ $? == 0 ]] # success
then
Expand Down
20 changes: 9 additions & 11 deletions load_and_test_model_on_files.sh
Original file line number Diff line number Diff line change
@@ -1,24 +1,22 @@
#!/usr/bin/env bash
ME=`basename $0` # for usage message

if [ "$#" -ne 5 ]; then # number of args
echo "USAGE: <ftrain> <ftest> <model> <seed> <outfile>"
if [[ "$#" -ne 4 ]]; then # number of args
echo "USAGE: <vocabfile> <model> <ftest> <outfile>"
echo "$ME"
exit
fi
ftrain=$1
ftest=$2
model=$3
seed=$4
out=$5
time python -m seq2seq.main \
--ftrain ${ftrain} \
vocabfile=$1
model=$2
ftest=$3
outfile=$4
time python -m seq2seq.predict \
--vocabfile ${vocabfile} \
--ftest ${ftest} \
--mono \
--beam_width 1 \
--restore ${model} \
--seed ${seed} \
--dump ${out}
--dump ${outfile}



Expand Down
34 changes: 26 additions & 8 deletions readers/aligned_reader.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,15 @@
from __future__ import division
from __future__ import print_function

import sys
import logging
import random

from seq2seq.lang import Lang
from seq2seq.constants import ALIGN_SYMBOL
from baseline import align_utils

import random
from collections import Counter
# from seq2seq.main import oracle_action
from seq2seq.constants import ALIGN_SYMBOL
from seq2seq.constants import STEP


# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import argparse


def safe_replace_spaces(s):
Expand All @@ -24,6 +19,29 @@ def safe_replace_spaces(s):
return s


def subsample_examples(examples, frac, single_token):
new_examples = []
for ex in examples:
fr, en, weight, is_eng = ex
frtokens, entokens = fr.split(" "), en.split(" ")
if len(frtokens) != len(entokens): continue
if single_token:
if len(frtokens) > 1 or len(entokens) > 1: continue
for frtok, entok in zip(frtokens, entokens):
new_examples.append((frtok, entok, weight, is_eng))
examples = new_examples
logging.info("new examples %d", len(examples))
# subsample if needed
random.shuffle(examples)
if frac < 1.0:
tmp = examples[0:int(frac * len(examples))]
examples = tmp
elif frac > 1.0:
tmp = examples[0:int(frac)]
examples = tmp
return examples


def read_examples(fpath, native_or_eng="both", remove_spaces=False, weight=1.0):
examples = []
bad = 0
Expand Down
79 changes: 10 additions & 69 deletions seq2seq/main.py
Original file line number Diff line number Diff line change
@@ -1,67 +1,27 @@
import random
import logging
import random
import sys

import numpy as np
import torch
import torch.nn as nn
import numpy as np

from utils.arguments import PARSER
from readers.aligned_reader import load_aligned_data, read_examples
from seq2seq.constants import STEP
from readers.aligned_reader import read_examples
from seq2seq.evaluators.reporter import AccReporter, get_decoded_words
from seq2seq.lang import Lang
from seq2seq.model_utils import load_checkpoint, model_builder, setup_optimizers
from seq2seq.prepare_data import langcodes, load_vocab_and_examples
from seq2seq.runner import run
from seq2seq.trainers.monotonic_train import MonotonicTrainer
from seq2seq.model_utils import load_checkpoint, model_builder, setup_optimizers
from utils.arguments import PARSER

# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logging.basicConfig(format=':%(levelname)s: %(message)s', level=logging.INFO)


def subsample_examples(examples, frac, single_token):
new_examples = []
for ex in examples:
fr, en, weight, is_eng = ex
frtokens, entokens = fr.split(" "), en.split(" ")
if len(frtokens) != len(entokens): continue
if single_token:
if len(frtokens) > 1 or len(entokens) > 1: continue
for frtok, entok in zip(frtokens, entokens):
new_examples.append((frtok, entok, weight, is_eng))
examples = new_examples
logging.info("new examples %d", len(examples))
# subsample if needed
random.shuffle(examples)
if frac < 1.0:
tmp = examples[0:int(frac * len(examples))]
examples = tmp
elif frac > 1.0:
tmp = examples[0:int(frac)]
examples = tmp
return examples


def index_vocab(examples, fr_lang, en_lang):
for ex in examples:
raw_x, raw_y, xs, ys, weight, is_eng = ex
fr_lang.index_words(xs)
en_lang.index_words(ys)
logging.info("train size %d", len(examples))


langcodes = {"hi": "hindi", "fa": "farsi", "ta": "tamil", "ba": "bengali", "ka": "kannada", "he": "hebrew",
"th": "thai"}

if __name__ == '__main__':
args = PARSER.parse_args()
args = vars(args)
logging.info(args)
batch_first = args["batch_first"]
device_id = args["device_id"]
seed = args["seed"]
native_or_eng = args["nat_or_eng"]
single_token = args["single_token"]

remove_spaces = True
np.random.seed(seed)
Expand All @@ -71,31 +31,15 @@ def index_vocab(examples, fr_lang, en_lang):

lang = langcodes[args["lang"]]

trainpath = "data/%s/%s_train_annotateEN" % (lang, lang) if args["ftrain"] is None else args["ftrain"]
testpath = "data/%s/%s_test_annotateEN" % (lang, lang) if args["ftest"] is None else args["ftest"]

examples = read_examples(fpath=trainpath,
native_or_eng=native_or_eng,
remove_spaces=remove_spaces)
testpath = args["ftest"]

examples = subsample_examples(examples=examples, frac=args["frac"], single_token=single_token)

fr_lang, en_lang = Lang(name="fr"), Lang(name="en")
examples = load_aligned_data(examples=examples,
mode="mcmc",
seed=seed)
index_vocab(examples, fr_lang, en_lang)
en_lang.index_word(STEP)
fr_lang.compute_maps()
en_lang.compute_maps()
# see_phrase_alignments(examples=examples)
fr_lang, en_lang, examples = load_vocab_and_examples(vocabfile=args["vocabfile"], aligned_file=args["aligned_file"])
logging.info(fr_lang.word2index)
logging.info(en_lang.word2index)

# ALWAYS READ ALL TEST EXAMPLES
test = read_examples(fpath=testpath)
train = read_examples(fpath=trainpath)

train = [ex for ex in train if ' ' not in ex[0] and ' ' not in ex[1]]
logging.info("input vocab: %d", fr_lang.n_words)
logging.info("output vocab: %d", en_lang.n_words)
logging.info("beam width: %d", args["beam_width"])
Expand All @@ -113,8 +57,6 @@ def index_vocab(examples, fr_lang, en_lang):
# Begin!
test_reporter = AccReporter(args=args,
dump_file=args["dump"])
train_reporter = AccReporter(args=args,
dump_file=args["dump"] + ".train.txt" if args["dump"] is not None else None)

if args["restore"]:
if "," in args["restore"]:
Expand Down Expand Up @@ -147,5 +89,4 @@ def index_vocab(examples, fr_lang, en_lang):
run(args=args,
examples=examples,
trainer=trainer, evaler=evaler, criterion=criterion,
train=train, test=test,
train_reporter=train_reporter, test_reporter=test_reporter)
test=test,test_reporter=test_reporter)
8 changes: 4 additions & 4 deletions seq2seq/model_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ def setup_optimizers(args, encoder, decoder):

def model_builder(args, fr_lang, en_lang):
bidi = args["bidi"]
device_id = args["device_id"]
# device_id = args["device_id"]
batch_first = args["batch_first"]
vector_size = args["wdim"]
hidden_size = args["hdim"]
Expand Down Expand Up @@ -73,9 +73,9 @@ def model_builder(args, fr_lang, en_lang):
logging.info(encoder)
logging.info(decoder)
# Move models to GPU
if device_id is not None:
encoder.cuda(device_id)
decoder.cuda(device_id)
# if device_id is not None:
# encoder.cuda(device_id)
# decoder.cuda(device_id)
return encoder, decoder, evaler


Expand Down
Loading

0 comments on commit e98f668

Please sign in to comment.