Switchable PyTorch backend #580

msperber · 2019-12-09T15:48:38Z

This addresses #420 and implements switchable DyNet / Pytorch backends for XNMT. Different backends have different advantages, such as autobatching in DyNet vs. multi GPU support, mixed precision training, CTC training in Pytorch, both of which potentially critical in certain situations. Another motivation is that it can be easier to replicate prior work when using the same deep learning framework.

All technical details are described in the updated doc, so please take a look there. I did my best to keep the changes as unobtrusive as possible, which was relatively easy given the similar design principles of DyNet and Pytorch. Switchable backends imply somewhat increased maintenance effort for some of the core modeling code, although this code is fairly stable now so I think things should be fine in this respect. For advanced features, I don’t think we need to aim for keeping things in parallel.

The status is as follows:

most example configs are supported with both backends (with exception of a few advanced features: 17_minrisk, 18_lexiconbias, 21_char_segment; these are not implemented with Pytorch backend)
Most unit tests run with both backends. Those that don’t support the Pytorch backend are skipped automatically in case of this backend.
I did comprehensive checks of activations, gradients, and updates, as well as complete training curves, to confirm that both backends perform the same computations (modulo numerical stability)
the 3 recipes are tested and produce similar results with both backends.
the speed is more or less similar with both backends. The Pytorch backend needs less GPU memory and introduces a new CUDNN-based LSTM though, which has less features but gives significantly higher speed.
DyNet-trained models can be loaded with the Pytorch backend and evaluated or finetuned from there. The opposite direction is currently not implemented, as reading of serialized Pytorch models is less straightforward.

There is one minor breaking change: saved model files now use a dash instead of a period, e.g. “Linear.9c2beb79” -> “Linear-9c2beb79”. This is because Pytorch complains when model names contain a period. When using old saved models, these would need to be manually renamed.

One potential question that might be raised about the chosen design is why DyNet and Pytorch code are mixed in the same Python modules, as opposed to having clean separate modules for each. The main reason for this is to allow for clean implementation of default components. For example, DefaultTranslator is backend-independent, and uses bare(embedders.SimpleWordEmbedder) as default for it’s src_embedder init argument. embedders.SimpleWordEmbedder has two different implementations, embedders.SimpleWordEmbedderDynet and embedders.SimpleWordEmbedderTorch. embedders.SimpleWordEmbedder will point to the appropriate one given the active backend. Moving both implementations to different modules would require importing things from the base module, leading to circular imports (e.g., xnmt.embedders and xnmt.embedders_dynet would both import each other). Nevertheless, I did make sure that running with either backend works even without the other backend installed in the Python environment.

There are a few extra changes and fixes that are not central to the PyTorch backend, but were very helpful for debugging and unit testing:

loss reports were incorrect with the “avg” loss_comb_method, and tensorboard logging step counters were not working correctly.
a new —settings=pretend mode that runs training / evaluation on 1 input and then finishes (useful to quickly make sure everything runs smoothly, as a sanity check before launching a long training)
extended tensorboard support
more flexible parameter initialization, especially regarding components with multiple param matrices, and direct initialization to given numpy arrays
a few other minor details

— Matthias

…ests

fix typo fix cudnn lstm better cudnn lstm padding check fix tb-reporting LR for dynet optimizers fix longtensor device cudnn lstm: move seq_lengths to device fix to beam search stopping criteria (neulab#572) torch.no_grad() for LossEvalTask no_grad() for inference code update doc string unit tests for cudnn lstm (passing even though training behavior seems buggy) comment for cudnn lstm save memory by freeing training data fix a unit test initial resource code fix type annot implement ResourceFile synta resolve ResourceFile when loading saved models made resource naming and _remove_data_dir() compatible more convenient message for existing log files support recent pyyaml new 'pretend' settings standard example: revert back no epochs fix error when trying to subsample more sentences than are in the training set fix previous fix cudnn lstm: use total_length option attempted cudnn lstm fix removed unused code in cudnn lstm fix missing train=True events in multi task training attempt transplosed plot fix fix code indentation in unicode tokenizer OOVStatisticsReporter: don't crash in case of empty hypo SkipOutOfMemory for simple training regimen (pytorch only) cleaned up manual tests; fix grad logging fix missing desc string in WER/CER scores

msperber added 30 commits February 13, 2019 11:35

allow minor upgrade of pyyaml

c6ad071

WIP: first runnable torch example

72c8642

first non-crashing test

a020029

worked on optimizers, model saving+reverting

06de00b

tested model loading

3a98dd7

param init working

1dccf75

handling device

465ef42

run dynet backend w/o torch installed

1332ae4

WIP: toward using torch backend w/o dynet installed

b2739d6

introduced decorators for backends

7070504

WIP: toward working w/o dynet installed

6f849ea

finished separating out dynet and pytorch code

2c4028f

settings and command line arguments

cefe075

remove dynet_profiling flag

2b38497

building API doc works

11feeb6

fixed some unit test problems

8c377cd

make tensorboard optional, it's causing some interference with unit t…

db02b19

…ests

Merge branch 'master' into torch

d2e8544

run unit tests in either dynet or torch mode

b72aba3

bugfix: reload_example

4e026df

skip unsupported test_beam_search + better error message

e6e516e

WIP: bug fixes + skip unit tests unsupported by torch backend

4a2726a

all unit tests running or skipped if unsupported by backend

5260c37

merge dynet/torch classifier unit tests to use same config file

ec82d38

backend-agnostic LM running

5057237

update .gitignore

9ca0c7d

seq_labeler works independently of backend

dca20c8

torch/GPU fixes

139020e

fix loss function

61d6242

add missing call to optimizer.step()

14e8b04

msperber added 27 commits May 10, 2019 14:55

fix (minor?) bug with label smoothing

4d096df

added label smoothing to manual test

c0f0856

more work on manual tests

c7175c1

produced a test failing with SGD as well (not only Adam)

2856036

attender fix?

ab0dc2e

manual classifier tests: refactor + better precision

b79ffb8

working on basic seq2seq test

ccf98b2

basic s2s tests refactored

52811cd

worked up to failing mlp att test

c7422e1

cleaned up mlp attender and tests

4642e47

finished unit test refactoring round

95b02f3

add basic sec2sec grad test + clean up some

71b8c05

manual tests fairly complete and passing

409c850

WIP: load dynet weights into pytorch tensors

e861677

loading dynet models into pytorch backend works

abdf4fa

cleaner solution for ignoring redundant lstm bias

c68ceb5

remove reference to outdated backward hook

c7994d7

grad rescaling unit test

44d0dbb

fix lattice attender: incorrect var name

9511b88

simplified / unified grad clip configuration

cf85c25

fix tensorboardx version

8b352f9

document tensor tools

1b5990c

fix type annotation

80c254c

fix variational recurrent dropout

02e97d6

consistent use of sent_len()

19e83c0

replace usages of dim() by more readable semantic accessors

159298e

msperber requested a review from neubig December 9, 2019 15:48

msperber closed this Dec 9, 2019

msperber deleted the torchpr branch December 9, 2019 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switchable PyTorch backend #580

Switchable PyTorch backend #580

msperber commented Dec 9, 2019

Switchable PyTorch backend #580

Switchable PyTorch backend #580

Conversation

msperber commented Dec 9, 2019