Overview

Reader-Translator-Generator (RTG) is a Neural Machine Translation toolkit based on pytorch.

See all versions
Demo: 500-Eng multilingual NMT: http://rtg.isi.edu/many-eng/

Features

Reproducible experiments: one conf.yml that has everything — data paths, params, and hyper params — required to reproduce experiments.
Pre-processing options: sentencepiece or nlcodec (or add your own)
- word/char/bpe etc types
- shared vocabulary, separate vocabulary
  - one-way, two-way, three-way tied embeddings
Transformer model from "Attention is all you need"
- Automatically detects and parallelizes across multi GPUs
  - Lot of varieties of transformer: width varying, skip transformer etc configurable from YAML files
  - RNN based Encoder-Decoder with Attention. (No longer using it, but it’s available for experimentation)
Language Modeling: RNN, Transformer
And more …
- Easy and interpretable code (for those who read code as much as papers)
- Object Orientated Design. (Not too many levels of functions and function factories like Tensor2Tensor)
- Experiments and reproducibility are main focus. To control an experiment you edit an YAML file that is inside the experiment directory.
- Where ever possible, prefer convention-over-configuration. Have a look at this experiment directory structure (below).

Google Colab Example

Use this Google Colab notebook for learning how to train your NMT model with RTG: https://colab.research.google.com/drive/198KbkUcCGXJXnWiM7IyEiO1Mq2hdVq8T?usp=sharing

Setup

pip install rtg

Development Setup:

Note	This mode of setup is required only if you are developing (i.e. modifying RTG code). If you are planning to use RTG without modifying source code, then `pip install rtg` should be all you need.

While most users are Add the root of this repo to PYTHONPATH or install it via pip --editable

There are two versions of code:

https://github.com/isi-nlp/rtg
https://github.com/isi-nlp/rtg-in

Both rtg and rtg-in have the same code on their master branches. rtg has stable code base and meant to be used by anyone, so it is recommended for the new users. rtg-in is internal to ISI NLP with some unfinished/work-in progress ideas (maybe unpublished), with issues and pull-requests by members of USC ISI team, and often less stable. We sync both code bases often (sync-xt.sh at the root of the repo). If you like to collaborate with us and/or to get access to rtg-in, email TG or Jonathan May.

git clone https://github.com/isi-nlp/rtg.git
cd rtg                # go to the code

conda create -n rtg python=3.7   # creates a conda env named rtg
conda activate rtg       # activate it


pip install --editable .
# The requirements are in setup.py; you may customize it if you wish

export PYTHONPATH=$PWD # or add it to PYTHONPATH

Requirements

Note	The required libraries are automatically installed by `pip`, so manual installation is not required. We are listing the requirements here for informative purposes only.

Note	To view or modify the version numbers of libraries, please go to `setup.py` at the root of this project.

The following libraries are used:

Table 1. Table Summary of CLI tools

Library	Purpose
torch	deep learning library
tensorboard	logging and visualizing training and validation losses
sacrebleu	BLEU scorer
sacremoses	tokenization and detokenization
tqdm	Progress bar
ruamel.yaml	configuration management
sentencepiece	(optional) vocabulary creation using word, char, BPE
nlcodec	(optional) similar to `sentencepiece`, but easily customizable; scales to big datasets using pyspark, offers efficient storage of encoded parallel data
flask, jinja	(optional) HTTP API and web interface for serving the models
pyspark	(optional) parallelized data preparation (using `nlcodec`) for massive datasets.

Thanks to all the awesome developers of these above tools.

Usage

Refer to scripts/rtg-pipeline.sh bash script and examples/transformer.base.yml file for specific examples.

The pipeline takes source (.src) and target (.tgt) files. The sources are in one language and the targets in another. At a minimum, supply a training source, training target, validation source, and validation target. It is best to use .tok files for training. (.tok means tokenized.)

Example of training and running a mdoel:

# if you wish to disable gpu, unset
# export CUDA_VISIBLE_DEVICES=

python -m rtg.pipeline experiments/sample-exp/

# or use CLI tool installed by pip install
rtg-pipe experiments/sample-exp/

# or use shell script, edit it to your needs, to submit to Slurm/SGE
scripts/rtg-pipeline.sh -d experiments/sample-exp/ -c experiments/sample-exp/conf.yml

# Then to use the model to translate something:
# (VERY poor translation due to small training data)
echo "Chacun voit midi à sa porte." | python -m rtg.decode experiments/sample-exp/

The 001-tfm directory that hosts an experiment looks like this:

001-tfm
├── _PREPARED    <-- Flag file indicating experiment is prepared
├── _TRAINED     <-- Flag file indicating experiment is trained
├── conf.yml     <-- Where all the params and hyper params are! You should look into this
├── data
│   ├── samples.tsv.gz          <-- samples to log after each check point during training
│   ├── sentpiece.shared.model  <-- as the name says, sentence piece model, shared
│   ├── sentpiece.shared.vocab  <-- as the name says
│   ├── train.db                <-- all the prepared trainig data in a sqlite db
│   └── valid.tsv.gz            <-- and the validation data
├── githead       <-- whats was the git HEAD hash this experiment was started?
├── job.sh.bak    <-- job script used to submit this to grid. Just in case
├── models        <-- All checkpoints go inside this
│   ├── model_400_5.265583_4.977106.pkl
│   ├── model_800_4.478784_4.606745.pkl
│   ├── ...
│   └── scores.tsv <-- train and validation losses. incase you dont want to see tensorboard
├── rtg.log   <-- the python logs are redirected here
├── rtg.zip   <-- the source code used to run. just `export PYTHONPATH=rtg.zip` to
├── scripts -> /Users/tg/work/me/rtg/scripts  <-- link to some perl scripts for detok+BLEU
├── tensorboard    <-- Tensorboard stuff for visualizations
│   ├── events.out.tfevents.1552850552.hackb0x2
│   └── ....
└── test_step2000_beam4_ens5   <-- Tests after the end of training, BLEU scores
    ├── valid.ref -> /Users/tg/work/me/rtg/data/valid.ref
    ├── valid.src -> /Users/tg/work/me/rtg/data/valid.src
    ├── valid.out.tsv
    ├── valid.out.tsv.detok.tc.bleu
    └── valid.out.tsv.detok.lc.bleu

Credits / Thanks

OpenNMT and the Harvard NLP team for Annotated Transformer, I learned a lot from their work
Fairseq has taught and influenced some
My team at USC ISI for everything else

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

00-intro.adoc

00-intro.adoc

Overview

Features

Google Colab Example

Setup

Development Setup:

Requirements

Usage

Credits / Thanks

Files

00-intro.adoc

Latest commit

History

00-intro.adoc

File metadata and controls

Overview

Features

Google Colab Example

Setup

Development Setup:

Requirements

Usage

Credits / Thanks