Skip to content

Commit

Permalink
code upload
Browse files Browse the repository at this point in the history
  • Loading branch information
shyamupa committed Jan 26, 2019
1 parent f0d2146 commit 897eb1e
Show file tree
Hide file tree
Showing 32 changed files with 3,906 additions and 8 deletions.
16 changes: 16 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
*.tab
*.dict
*.pred
*.model
*.tar
*.vocab
*.vocab.romanized
*.tar.gz
phone_index*
data/
.idea
*.txt
*.pyc
*.log
*.so
m2m/
71 changes: 63 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,67 @@
Code for the EMNLP paper, "Bootstrapping Transliteration with Guided Discovery for Low-Resource Languages".

Coming soon.
## Running the code

1. First compile the C code for the aligner.
```bash
cd baseline/
make
```
@InProceedings{UKR18,
author = {Upadhyay, Shyam and Kodner, Jordan and Roth, Dan},
title = {Bootstrapping Transliteration with Guided Discovery for Low-Resource Languages},
booktitle = {EMNLP},
year = {2018},
}

2. write you train, dev and test data in the following format,

```
x1 x2 x3<tab>y1 y2 y3 y4 y5
```
where `x1x2x3` is the input word (`xi` is the character), and `y1y2y3y4y5` is the desired output (transliteration). Example train and test files for bengali are in data/ folder. There is a optional 3rd column marking whether the word is *native* or *foreign* (see the paper for these terms); this column can be ignored for most purposes.


3. Run `train_model_on_files.sh` on your train (say train.txt) and dev file (dev.txt) as follows,

```
./train_model_on_files.sh train.txt dev.txt 100 translit.model
```

where 100 is the random seed and translit.model is the output model. Other parameters(see `utils/arguments.py` for options) can be specified by modifying the `train_model_on_files.sh` script appropriately.

4. Test the trained model as follows,

```
./load_and_test_model_on_files.sh train.txt test.txt translit.model 100 output.txt
```

The output should report relevant metrics,

```
...
...
:INFO: --------------------TEST--------------------
:INFO: running infer on example 200
:INFO: running infer on example 400
:INFO: running infer on example 600
:INFO: running infer on example 800
:INFO: accuracy 367/997=0.37
:INFO: accuracy (nat) 308/661=0.47
:INFO: accuracy (eng) 59/336=0.18
:INFO: ********************total********************
:INFO: ACC: 0.371457 (367/988)
:INFO: Mean F-score: 0.910995
:INFO: Mean ED@1: 1.136640+-1.167
:INFO: Mean NED@1: 0.084884
:INFO: Median ED@1: 1.000000
...
...
```

There is also a interactive mode where one can input test words directly,

```
./load_and_test_model_interactive.sh <ftrain> <model> <seed>
...
...
:INFO: => loading checkpoint hindi.model
:INFO: => loaded checkpoint!
enter surface:ओबामा
ओ ब ा म ा
[(-0.4624647759074629, 'o b a m a')]
```

10 changes: 10 additions & 0 deletions baseline/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
all: libperceptron.so libalign.so

libperceptron.so: perceptron.c
gcc -O3 -Wall -Wextra -shared -fPIC perceptron.c -o libperceptron.so

libalign.so: align.c
gcc -O3 -Wall -Wextra -shared -fPIC align.c -o libalign.so

clean:
/bin/rm libperceptron.so libalign.so *.pyc
Loading

0 comments on commit 897eb1e

Please sign in to comment.