Possible to replace `std_mean` with two calls `std` and `mean`? #547

jiasenwu · 2021-04-07T12:02:19Z

jiasenwu
Apr 7, 2021

I benchmarked the NFNet_Fx on TPU v3-8 (batch-size 32), but it ran significantly slower than that in its paper. After some profiling work, I find the std_mean op used in std_conv module should be the problem.

@rwightman My question is: can it be replaced by two calls std and mean? I don't know how much overhead can be added if run on GPU, but it is significantly fast (0.2 it/s ===> 8.5 it/s!!!) on TPU because all ops can be translated to TPU then.

I ran my code as PT_XLA_DEBUG=1 MODEL="nfnet_f0" python play.py. Then following messages are printed to screen at each training step.

pt-xla-profiler: TransferFromServerTime too frequent: 58 counts during 2 steps
pt-xla-profiler: Op(s) not lowered: aten::std_mean,  Please open a GitHub issue with the above op lowering requests.

Here 'not lowered' means the op cannot be translated to TPU, and has to copied and computed on CPU.

Related issues & discussions:

rwightman · 2021-04-07T23:54:32Z

rwightman
Apr 7, 2021
Maintainer

@jiasenwu it's not something I intend to do. It looks like the ONNX issue has been fixed and the XLA lowering should be done at some point. Aside from the manual patch, using the F.layer_norm codepath might work for XLA. I was planning to possibly make the F.layer_norm path the default after more testing and epsilon calibration as it's a bit more efficient.

1 reply

jiasenwu Apr 8, 2021
Author

Thank you very much! Yes, the layer_norm path can be XLA'ed. I have tried running it. It gives roughly the same speed as the std+mean path (but I didn't check if the model is doing good since I feed only fake data). It's really good to know your plan.

rwightman · 2021-06-05T16:03:34Z

rwightman
Jun 5, 2021
Maintainer

That op isn't lowered, known issue that I was hoping would be fixed. An alternative is to force the use_layernorm flag to True, that path works too... Don't intend to use sep std mean pytorch/xla#2790

…

On Sat, Jun 5, 2021, 8:46 AM Sarthak Yadav ***@***.***> wrote: I might be mentioning this at the wrong place (please let me know if it's so), but leaving it here in case someone has the same issue. Calling std_mean on TPU v3 used CPU for some reason (not sure why: new to TPUs and torch-xla), causing a significant slowdown. Simply calling mean and std separately fixed the problem! [With torch==1.8.1, torch-xla==1.8.1] — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#547 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLQICHXYW33GIS7IGACXNTTRJBFJANCNFSM42QU3CVQ> .

1 reply

SarthakYadav Jun 5, 2021

Yeah, I just read that on the next tab that I was supposed to read! Thanks though!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Possible to replace `std_mean` with two calls `std` and `mean`? #547

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Possible to replace std_mean with two calls std and mean? #547

Uh oh!

jiasenwu Apr 7, 2021

Replies: 2 comments · 2 replies

Uh oh!

rwightman Apr 7, 2021 Maintainer

Uh oh!

jiasenwu Apr 8, 2021 Author

Uh oh!

rwightman Jun 5, 2021 Maintainer

Uh oh!

SarthakYadav Jun 5, 2021

Possible to replace `std_mean` with two calls `std` and `mean`? #547

jiasenwu
Apr 7, 2021

Replies: 2 comments 2 replies

rwightman
Apr 7, 2021
Maintainer

jiasenwu Apr 8, 2021
Author

rwightman
Jun 5, 2021
Maintainer