Skip to content

Latest commit

 

History

History
371 lines (315 loc) · 14.6 KB

20-clitools.adoc

File metadata and controls

371 lines (315 loc) · 14.6 KB

RTG CLI

All the below CLI tools give you finer control to go step by step if you want to test only a part of the pipeline. For end usage of the RTG toolkit, the workflow should be as simple as:

  1. Edit the conf.yml file

  2. Run the pipeline using python -m rtg.pipeline or rtg-pipe command

  3. Occasionally, to decode newer tests files that were not listed in conf.yml, use python -m rtg.decode or rtg-decode

Summary:

The following command line tools are added when rtg is installed using pip.

Table 1. Table Summary of CLI tools

Command

Purpose

rtg-pipe

Run rtg-prep, rtg-train and test case evaluation

rtg-decode

Decode new source files using the values set in conf.yml

rtg-export

Export an experiment

rtg-fork

Fork an experiment with/without same conf, code, data, vocabularies etc

rtg-serve

Serve an RTG model over HTTP API using Flask server

rtg-decode-pro

Decode new source files using the values that you supply from CLI args

rtg-prep

Prepare an experiment. You should be using rtg-pipe

rtg-train

Train a model. You should be using rtg-pipe

rtg-syscomb

System combination. Dont bother about it for now.

rtg-launch

Launch data distributed training

rtg-params

Show parameters in model

rtg-pipe: Pipeline

This is the CLI interface that most likely use.

$ python -m rtg.pipeline -h

usage: rtg.prep [-h] [-G] exp [conf]

prepare NMT experiment

positional arguments:
  exp             Working directory of experiment
  conf            Config File. By default <work_dir>/conf.yml is used

optional arguments:
  -h, --help      show this help message and exit
  -G, --gpu-only  Crash if no GPU is available

rtg-prep: Prepare an experiment

    $ python -m rtg.prep -h
    usage: rtg.prep [-h] work_dir [conf_file]

    prepare NMT experiment

    positional arguments:
      work_dir    Working directory
      conf_file   Config File. By default <work_dir>/conf.yml is used

    optional arguments:
      -h, --help  show this help message and exit

rtg-train : Train a Model

    $ python -m rtg.train -h
    usage: rtg.train [-h] [-rs SEED] [-st STEPS] [-cp CHECK_POINT]
                     [-km KEEP_MODELS] [-bs BATCH_SIZE] [-op {ADAM,SGD}]
                     [-oa OPTIM_ARGS] [-ft]
                     work_dir

    Train NMT model

    positional arguments:
      work_dir              Working directory

    optional arguments:
      -h, --help            show this help message and exit
      -rs SEED, --seed SEED
                            Seed for random number generator. Set it to zero to
                            not touch this part. (default: 0)
      -st STEPS, --steps STEPS
                            Total steps (default: 128000)
      -cp CHECK_POINT, --check-point CHECK_POINT
                            Store model after every --check-point steps (default:
                            1000)
      -km KEEP_MODELS, --keep-models KEEP_MODELS
                            Number of checkpoints to keep. (default: 10)
      -bs BATCH_SIZE, --batch-size BATCH_SIZE
                            Mini batch size of training and validation (default:
                            256)
      -op {ADAM,SGD}, --optim {ADAM,SGD}
                            Name of optimizer (default: ADAM)
      -oa OPTIM_ARGS, --optim-args OPTIM_ARGS
                            Comma separated key1=val1,key2=val2 args to optimizer.
                            Example: lr=0.01,warmup_steps=1000 The arguments
                            depends on the choice of --optim (default: lr=0.001)
      -ft, --fine-tune      Use fine tune corpus instead of train corpus.
                            (default: False)

rtg-decode: Decoder

usage: rtg.decode [-h] [-if [INPUT [INPUT ...]]] [-of [OUTPUT [OUTPUT ...]]]
                  [-sc] [-b BATCH_SIZE] [-msl MAX_SRC_LEN] [-nb]
                  exp_dir

Decode using NMT model

positional arguments:
  exp_dir               Experiment directory

optional arguments:
  -h, --help            show this help message and exit
  -if [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]
                        Input file path. default is STDIN (default:
                        [<_io.TextIOWrapper name='<stdin>' encoding='utf-8'>])
  -of [OUTPUT [OUTPUT ...]], --output [OUTPUT [OUTPUT ...]]
                        Output File path. default is STDOUT (default:
                        [<_io.TextIOWrapper name='<stdout>'
                        encoding='utf-8'>])
  -sc, --skip-check     Skip Checking whether the experiment dir is prepared
                        and trained (default: False)
  -b BATCH_SIZE, --batch-size BATCH_SIZE
                        batch size for 1 beam. effective_batch =
                        batch_size/beam_size (default: None)
  -msl MAX_SRC_LEN, --max-src-len MAX_SRC_LEN
                        max source len; longer seqs will be truncated
                        (default: None)
  -nb, --no-buffer      Processes one line per batch followed by flush output
                        (default: False)

rtg-decode-pro: Pro Decoder

Note: for simple use with defauls from conf.yml, use rtg-decode or python -m rtg.decode.

    $ python -m rtg.decode_pro -h
    usage: rtg.decode [-h] [-if INPUT] [-of OUTPUT] [-bs BEAM_SIZE] [-ml MAX_LEN]
                      [-nh NUM_HYP] [--prepared]
                      [-bp {E1D1,E2D2,E1D2E2D1,E2D2E1D2,E1D2,E2D1}] [-it] [-sc]
                      [-en ENSEMBLE] [-cb SYS_COMB]
                      work_dir [model_path [model_path ...]]

    Decode using NMT model

    positional arguments:
      work_dir              Working directory
      model_path            Path to model's checkpoint. If not specified, a best
                            model (based on the score on validation set) from the
                            experiment directory will be used. If multiple paths
                            are specified, then an ensembling is performed by
                            averaging the param weights (default: None)

    optional arguments:
      -h, --help            show this help message and exit
      -if INPUT, --input INPUT
                            Input file path. default is STDIN (default:
                            <_io.TextIOWrapper name='<stdin>' mode='r'
                            encoding='UTF-8'>)
      -of OUTPUT, --output OUTPUT
                            Output File path. default is STDOUT (default:
                            <_io.TextIOWrapper name='<stdout>' mode='w'
                            encoding='UTF-8'>)
      -bs BEAM_SIZE, --beam-size BEAM_SIZE
                            Beam size. beam_size=1 is greedy, In theory: higher
                            beam is better approximation but expensive. But in
                            practice, higher beam doesnt always increase.
                            (default: 5)
      -ml MAX_LEN, --max-len MAX_LEN
                            Maximum output sequence length (default: 100)
      -nh NUM_HYP, --num-hyp NUM_HYP
                            Number of hypothesis to output. This should be smaller
                            than beam_size (default: 1)
      --prepared            Each token is a valid integer which is an index to
                            embedding, so skip indexifying again (default: False)
      -bp {E1D1,E2D2,E1D2E2D1,E2D2E1D2,E1D2,E2D1}, --binmt-path {E1D1,E2D2,E1D2E2D1,E2D2E1D2,E1D2,E2D1}
                            Sub module path inside BiNMT. applicable only when
                            model is BiNMT (default: None)
      -it, --interactive    Open interactive shell with decoder (default: False)
      -sc, --skip-check     Skip Checking whether the experiment dir is prepared
                            and trained (default: False)
      -en ENSEMBLE, --ensemble ENSEMBLE
                            Ensemble best --ensemble models by averaging them
                            (default: 1)
      -cb SYS_COMB, --sys-comb SYS_COMB
                            System combine models at the softmax layer using the
                            weights specified in this file. When this argument is
                            supplied, model_path argument is ignored. (default:
                            None)

rtg-fork: Fork an experiment

usage: rtg-fork [-h] [--conf | --no-conf] [--data | --no-data]
                [--vocab | --no-vocab] [--code | --no-code]
                EXP_DIR TO_DIR

fork an experiment.

positional arguments:
  EXP_DIR     From experiment. Should be valid experiment dir
  TO_DIR      To experiment. This will be created.

optional arguments:
  -h, --help  show this help message and exit
  --conf      Copy config: from/conf.yml → to/conf.yml (default: True)
  --no-conf   Negation of --conf (default: False)
  --data      Link data dir . This includes vocab. (default: True)
  --no-data   Negation of --data (default: False)
  --vocab     copy vocabularies. dont use it with --data (default: False)
  --no-vocab  Negation of --vocab (default: True)
  --code      copy source code. (default: True)
  --no-code   Negation of --code (default: False)

rtg-export Export

Export an experiment:

    python -m rtg.export -h
    usage: export.py [-h] [-en ENSEMBLE] [-nm NAME] [--config | --no-config]
                     [--vocab | --no-vocab]
                     source target

    positional arguments:
      source                Path to experiment (source)
      target                Path to destination where the export should be

    optional arguments:
      -h, --help            show this help message and exit
      -en ENSEMBLE, --ensemble ENSEMBLE
                            Maximum number of checkpoints to average and export.
                            set 0 to disable (default: 5)
      -nm NAME, --name NAME
                            Name for the exported model (active when --ensemble >
                            0). Value should be a single word. This will be useful
                            if you are going to place multiple exports in a same
                            dir for system combination (default: None)
      --config              Copy config (default: True)
      --no-config           See --config (default: False)
      --vocab               Copy vocabulary files (such as sentence piece models)
                            (default: True)
      --no-vocab            See --vocab (default: False)

Other tools:

rtg-syscomb System Combiner

    python -m rtg.syscomb -h
    usage: __main__.py [-h] [-b BATCH_SIZE] [-s STEPS]
                       experiment models [models ...]

    positional arguments:
      experiment            Path to experiment directory
      models                Path to models

    optional arguments:
      -h, --help            show this help message and exit
      -b BATCH_SIZE, --batch-size BATCH_SIZE
                            Batch size (default: 128)
      -s STEPS, --steps STEPS
                            Training steps (default: 2000)

Perplexity

Compute perplexity of a language model on a test set.

    $ python -m rtg.eval.perplexity -h
    usage: rtg.eval.perplexity [-h] [-t TEST] [-en ENSEMBLE]
                           work_dir [model_path [model_path ...]]

    positional arguments:
    work_dir              Working/Experiment directory
    model_path            Path to model's checkpoint. If not specified, a best
                        model (based on the score on validation set) from the
                        experiment directory will be used. If multiple paths
                        are specified, then an ensembling is performed by
                        averaging the param weights (default: None)

    optional arguments:
    -h, --help            show this help message and exit
    -t TEST, --test TEST  test file path. default is STDIN (default:
                        <_io.TextIOWrapper name='<stdin>' mode='r'
                        encoding='UTF-8'>)
    -en ENSEMBLE, --ensemble ENSEMBLE
                        Ensemble best --ensemble models by averaging them
                        (default: 1)

Line Bleu

Computes BLEU per line

    python -m rtg.eval.linebleu -h
    usage: linebleu.py [-h] [-c CANDS] [-r REFS] [-n N] [-nr] [-nc] [-o OUT] [-v]

    Computes BLEU score per record.

    optional arguments:
      -h, --help            show this help message and exit
      -c CANDS, --cands CANDS
                            Candidate (aka output from NLG system) file (default:
                            <_io.TextIOWrapper name='<stdin>' mode='r'
                            encoding='UTF-8'>)
      -r REFS, --refs REFS  Reference (aka human label) file (default:
                            <_io.TextIOWrapper name='<stdin>' mode='r'
                            encoding='UTF-8'>)
      -n N, --n N           maximum n as in ngram. (default: 4)
      -nr, --no-refs        Do not write references to --out (default: False)
      -nc, --no-cands       Do not write candidates to --out (default: False)
      -o OUT, --out OUT     Output file path to store the result. (default:
                            <_io.TextIOWrapper name='<stdout>' mode='w'
                            encoding='UTF-8'>)
      -v, --verbose         verbose mode (default: False)

OOV

Compute Out-of-Vocabulary(OOV) rate

    $ python -m rtg.tool.oov -h
    usage: oov.py [-h] -tr TRAIN [-ts [TESTS [TESTS ...]]]

    optional arguments:
      -h, --help            show this help message and exit
      -tr TRAIN, --train TRAIN
                            Train file path (default: None)
      -ts [TESTS [TESTS ...]], --test [TESTS [TESTS ...]]
                            Test file paths (default: [<_io.TextIOWrapper
                            name='<stdin>' mode='r' encoding='UTF-8'>])

Class imbalance, Sequence lengths

Computes class Imbalance on training data and reports mean and median sequence lengths Get the stats reported in Gowda and May 's Neural Machine Translation with Imbalanced Classes

$ python -m rtg.eval.imbalance -h
usage: imbalance.py [-h] exp

positional arguments:
  exp         Path to experiment directory

optional arguments:
  -h, --help  show this help message and exit

Example:

$ python -m rtg.eval.imbalance runs/001-tfm
Experiment: runs/001-tfm shared_vocab:True
src types: 500 toks: 2,062,912 len_mean: 15.8686 len_median: 15.0 imbalance: 0.4409
tgt types: 500 toks: 1,711,685 len_mean: 13.1668 len_median: 12.0 imbalance: 0.4632
n_segs: 130,000