Replies: 2 comments 2 replies
-
@jiasenwu it's not something I intend to do. It looks like the ONNX issue has been fixed and the XLA lowering should be done at some point. Aside from the manual patch, using the F.layer_norm codepath might work for XLA. I was planning to possibly make the F.layer_norm path the default after more testing and epsilon calibration as it's a bit more efficient. |
Beta Was this translation helpful? Give feedback.
1 reply
-
That op isn't lowered, known issue that I was hoping would be fixed. An
alternative is to force the use_layernorm flag to True, that path works
too... Don't intend to use sep std mean
pytorch/xla#2790
…On Sat, Jun 5, 2021, 8:46 AM Sarthak Yadav ***@***.***> wrote:
I might be mentioning this at the wrong place (please let me know if it's
so), but leaving it here in case someone has the same issue.
Calling std_mean on TPU v3 used CPU for some reason (not sure why: new to
TPUs and torch-xla), causing a significant slowdown. Simply calling mean
and std separately fixed the problem!
[With torch==1.8.1, torch-xla==1.8.1]
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#547 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLQICHXYW33GIS7IGACXNTTRJBFJANCNFSM42QU3CVQ>
.
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I benchmarked the NFNet_Fx on TPU v3-8 (batch-size 32), but it ran significantly slower than that in its paper. After some profiling work, I find the
std_mean
op used in std_conv module should be the problem.@rwightman My question is: can it be replaced by two calls
std
andmean
? I don't know how much overhead can be added if run on GPU, but it is significantly fast (0.2 it/s ===> 8.5 it/s!!!) on TPU because all ops can be translated to TPU then.I ran my code as
PT_XLA_DEBUG=1 MODEL="nfnet_f0" python play.py
. Then following messages are printed to screen at each training step.Here 'not lowered' means the op cannot be translated to TPU, and has to copied and computed on CPU.
Related issues & discussions:
Beta Was this translation helpful? Give feedback.
All reactions