NFNet training throughput #500

aljoschaleonhardt · 2021-03-16T09:59:29Z

aljoschaleonhardt
Mar 16, 2021

First off, I'm super impressed by how quickly the NFNet-F* implementations and weights landed in timm. Absolutely fantastic work @rwightman :)

I've been tinkering with NFNet-F0 on some typical workloads but can't reproduce anything close to the latency values described in the paper, Brock et al. (2021). Running a really bare-bones setup in Lightning, I can push about 110 images per second through the timm SiLU version (nfnet_f0s) and a little under 100 through dm_nfnet_f0. The efficientnet_b5 in timm (which should roughly match ImageNet perf of NFNet-F0) gives me around 180 images/sec under the exact same conditions (V100, native PyTorch AMP, synthetic data).

According to the paper, EN-B5 should be ~8x slower (measured by time per training step) than F0. @rwightman On your benchmarks, are you able to reproduce their JAX numbers -- at least approximately? Any idea where/if I'm on the completely wrong track here?

Thanks in advance for everyone's opinion!

rwightman · 2021-03-16T17:36:13Z

rwightman
Mar 16, 2021
Maintainer

@aljoschaleonhardt they are slower in PyTorch, especially with their official config for the dm variants, XLA optimizes away some aspects of the model that eager mode PyTorch doesn't do well with.

For my nfnet_f0s on an RTX3090, trying to replicate their latencies I get 306.75 samples/sec, 3.17 ms/sample, 101.347 ms/step. That is at batch_size 32 (they say 32 per device for their latencies) and the F0 train time res of 192x192.

I can hit 566 img/s training at batch size 256 and 192x192 res.

The dm_nfnet_f0 is slower at 245.59 samples/sec, 3.96 ms/sample, 126.800 ms/step.

I ran those on the PyTorch NGC 20.12 container. One thing I noticed is that the perf of these NFNet models, dm and my silu variants has significant variation across recent Pytorch/NGC versions and hardware. Definitely do not enable channels-last (if you were doing so).

See a recent comparison I did that includes the f0s and my l0c light variant of nfnet. https://gist.github.com/rwightman/bb59f9e245162cee0e38bd66bd8cd77f

3 replies

aljoschaleonhardt Mar 16, 2021
Author

That's super helpful -- thanks. I'm training with larger images (260x260), so in combination with training on slower hardware that seems to match up OK-ish. Might have to take their JAX implementation for a spin.

bishshoy Jun 18, 2022

Hi @rwightman / @aljoschaleonhardt, were you able to train either of the nfnet models to the accuracy reported in the paper? If so, then can you kindly share the training recipe for it? Thanks.

rwightman Jun 18, 2022
Maintainer

@bishshoy I've trained the 'eca' lighter attention variants myself, see nfnet.yaml ... that should be easy enough to adjust for any other nfnet but incr regularization a bit for larger models... https://gist.github.com/rwightman/e69d5f456047c16773a77182cea68c3c

bishshoy · 2022-06-18T21:02:57Z

bishshoy
Jun 18, 2022

@rwightman,
Would it be faster if the eager mode functions were implemented as a PyTorch extension in C++?

3 replies

rwightman Jun 18, 2022
Maintainer

One of the issues with the origina nfnet was slow GELU impl in mixed-precision, that has been fixed and will be fast in current pytorch versions.

The group conv thing is still an issue, especially for channels-last, where you will se a decrease in performance. It really needs some improved cudnn kernels but doesnt' seem to be the highest priority as it's been well over a year since I raised the concerns.

rwightman Jun 18, 2022
Maintainer

One promising perf thing that will require less needs for custom ops, default backend for torchscript in PyTorch 1.12 is going to be nvfuser, and nvfuser is looking much better now, it's resulted in some noteworthy gains for quite a few nets I've tried

bishshoy Jun 18, 2022

Cool! Looking forward to it. Thanks a ton for making and maintaining this project.

I was wondering if you could also take up a new open source project called timm-recipes, in which individuals would be able to submit training recipes for timm networks. Let me give you a context on why this is a huge benefit for the entire DL research community. Out of the two categories that DL research can broadly be categorized into, (the two categories being (1) Downstream tasks or applied DL and (2) Training method research (sorta pure DL)), downstream research makes extensive use of the pretrained models that you provide; they are excellent. However, some researchers (myself included) focus on training techniques. For instance, I am currently conducting research on the development of a clipping method that is an enhanced version of agc and aims to improve the performance of nfnets and nf-resnets over the current agc clipping method. What we need most are training recipes for achieving close-to-exact accuracy values cited in the papers. Doing do will accelerate training technique research a lot. Therefore, I hope that a separate project or repository could be created to collect training recipes. Any researcher can easily upload his or her recipe and associated logs. I have searched the Internet for training recipes on timm but have been unable to locate anything other than the ones provided in your documentation website.

A more awesome thing would be to have a --share flag built into timm, which will do the following:

Verify that the code residing on the local machine does not have any additional changes on top of the whatever commit that the person is using. git can do this check.
If the check passes, then all training logs, recipes, commit id, machine specs (GPUs, used mem, avg. CPU usage etc) and model.best checkpoints would automatically be uploaded to the timm-recipes repo upon training completion. Unfinished training routines would be rejected. The philosophy is same as what the folks at Leela Chess Zero used to train its models.
Bonus: A visual tool on the timm-recipes repo to help folks visiting the repo determine which recipe they want, by allowing them to narrow down, shortlist and sort models and their accuracies (similar to the ones on the paperswithcode website).

(3) is not a must but having (1) and (2) would go a long way making models on this repo and DL in general, more accessible to everyone.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

NFNet training throughput #500

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

NFNet training throughput #500

Uh oh!

aljoschaleonhardt Mar 16, 2021

Replies: 2 comments · 6 replies

Uh oh!

Uh oh!

rwightman Mar 16, 2021 Maintainer

Uh oh!

aljoschaleonhardt Mar 16, 2021 Author

Uh oh!

bishshoy Jun 18, 2022

Uh oh!

rwightman Jun 18, 2022 Maintainer

Uh oh!

bishshoy Jun 18, 2022

Uh oh!

rwightman Jun 18, 2022 Maintainer

Uh oh!

rwightman Jun 18, 2022 Maintainer

Uh oh!

bishshoy Jun 18, 2022

aljoschaleonhardt
Mar 16, 2021

Replies: 2 comments 6 replies

rwightman
Mar 16, 2021
Maintainer

aljoschaleonhardt Mar 16, 2021
Author

rwightman Jun 18, 2022
Maintainer

bishshoy
Jun 18, 2022

rwightman Jun 18, 2022
Maintainer

rwightman Jun 18, 2022
Maintainer