-
Notifications
You must be signed in to change notification settings - Fork 790
[SYCL][CUDA][libclc] Add approx. tanhf built-in #5265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch adds the support for an approximate hyperbolic tangent single-precision built-in function introduced in PTX 7.0 for devices having compute capabilities >= 7.5. If this built-in is available, it is possible use it by setting the `-fcuda-approx-tanhf` flag.
Basically, the implemented approach extends the Another possible solution is exposing this built-in via a
Hi @bader, what do you think about it? |
@pgorlani, sorry for the delay.
I suggest adding a check for correct usages of the flag to the compiler.
This sounds like a useful feature to have, but considering amount of work to enable it, I suggest we handle it separately. Let's create a feature request for |
Tagging @andykaylor for awareness. |
Thank for your answer, @bader.
This is a very good suggestion, and I think we need to apply this flag to the normal built-in that falls into the fast-math category within The
This implies a modification that will be quite complex within the driver/CudaToolChain ad hoc for this built-in. Actually, the compiler error out in case the installed cudatoolkit does not support the PTX version of the specified architecture, not on specific instructions. In order to simplify things, I introduced a check in |
Okay. Thank you for the clarification. |
Co-authored-by: Alexey Bader <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Please, update PR description as well.
It says:
This patch adds the support for an approximate
hyperbolic tangent single-precision built-in
function introduced in PTX 7.0 for devices having
compute capabilities >= 7.5.
I have a few concerns about the approach. This is very a specific solution to a general problem, I don't think adding a flag just for tanh is a solution if other builtins are also impacted by this need. The true concern I have is this applies a global module state that will pollute other modules. And this is potentially serious due to the mandatory LTO, this will prevents you from compiling a kernel with approx on and an other with approx off. I'm not sure what to suggest here though but the approach here shouldn't prevent a future proper solution if merged. |
I try to look at an approach based on the |
This patch adds a note on the Get Started Guide regarding the minimum CUDA toolkit version required for fully utilize Turing devices (sm_75). CUDA toolkit version 11.0 introduces PTX7.0. This version supports for the first time the Ampere architecture (sm_80), however some instructions introduced by PTX7.0 (e.g. approximated tanh (#5265) and ex2 for halfs) can be executed also by Turing devices (sm_75), if CUDA 11.0 (or above) is installed. Compilation on Turing devices is possible also using CUDA 10.2 (the actual version reported as tested), however if one these PTX7.0 instruction is used, it will generate an error.
In #5747, we implemented an extension for defining native builtins outside the SYCL specification in order to achieve a more generic solution for this kind of problems. For this reason, I converted this PR in a draft. |
This patch adds the support for an approximate
hyperbolic tangent single-precision built-in
function introduced in PTX 7.0 for devices having
compute capabilities >= 8.0.
If this built-in is available, it is possible use
it by setting the
-fcuda-approx-tanhf
flag.