Skip to content

[SYCL][CUDA][libclc] Add approx. tanhf built-in #5265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

pgorlani
Copy link
Contributor

@pgorlani pgorlani commented Jan 6, 2022

This patch adds the support for an approximate
hyperbolic tangent single-precision built-in
function introduced in PTX 7.0 for devices having
compute capabilities >= 8.0.

If this built-in is available, it is possible use
it by setting the -fcuda-approx-tanhf flag.

This patch adds the support for an approximate
hyperbolic tangent single-precision built-in
function introduced in PTX 7.0 for devices having
compute capabilities >= 7.5.

If this built-in is available, it is possible use
it by setting the `-fcuda-approx-tanhf` flag.
@pgorlani
Copy link
Contributor Author

pgorlani commented Jan 6, 2022

Basically, the implemented approach extends the nvvm_reflect functionality adding the __CUDA_APPROX_TANHF parameter in order to select the built-in tanh intrinsic. This parameter will be set by the user compiling the code with the -Xclang -fcuda-approx-tanhf flag. This implies that the user is aware of the built-in availability (i.e., device compute capability >= 7.5 and PTX version >= 7.0) and uses the compiler flag accordingly.

Another possible solution is exposing this built-in via a sycl::native:: function, without the need for a flag.
This approach presents a couple of issues. First, sycl::native::tanh is not present in the standard (this is a minor issue since it could be added with an extension, I guess). Second, nvvm_reflect doesn't have a parameter for querying the PTX version (it has just one for querying the architecture). The automatic introduction of this build-in without knowing the PTX version (and without the user awareness) could lead to compilation failures (when c.c. >= 7.5 and PTX v. < 7.0) that are hard to understand. I.e.,

ptxas /tmp/tester-sm_75-143011.s, line 53; error   : Not a name of any known instruction: 'tanh'
ptxas /tmp/tester-sm_75-143011.s, line 97; error   : Not a name of any known instruction: 'tanh'
ptxas /tmp/tester-sm_75-143011.s, line 132; error   : Not a name of any known instruction: 'tanh'
ptxas /tmp/tester-sm_75-143011.s, line 170; error   : Not a name of any known instruction: 'tanh'
ptxas fatal   : Ptx assembly aborted due to errors
llvm-foreach: 
clang-14: error: ptxas command failed with exit code 255 (use -v to see invocation)
clang version 14.0.0 ([email protected]:pgorlani/llvm.git 91bb219cf39832dee4b9ddaae36418ea6b9cf4a9)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/paologorlani/llvm/build/bin
clang-14: note: diagnostic msg: Error generating preprocessed source(s).

Hi @bader, what do you think about it?

@bader
Copy link
Contributor

bader commented Jan 18, 2022

@pgorlani, sorry for the delay.

This implies that the user is aware of the built-in availability (i.e., device compute capability >= 7.5 and PTX version >= 7.0) and uses the compiler flag accordingly.

I suggest adding a check for correct usages of the flag to the compiler.
Does it make sense to add support for -fsycl-approx-transcendentals (similar to -fcuda-approx-transcendentals) enabling all support approximated functions?

Another possible solution is exposing this built-in via a sycl::native:: function, without the need for a flag.

This sounds like a useful feature to have, but considering amount of work to enable it, I suggest we handle it separately. Let's create a feature request for sycl::native::tanh and see if there will be any interest in it. What do you think?

@bader bader requested a review from Naghasan January 18, 2022 14:07
@bader
Copy link
Contributor

bader commented Jan 18, 2022

Tagging @andykaylor for awareness.

@pgorlani
Copy link
Contributor Author

Thank for your answer, @bader.

Does it make sense to add support for -fsycl-approx-transcendentals (similar to -fcuda-approx-transcendentals) enabling all support approximated functions?

This is a very good suggestion, and I think we need to apply this flag to the normal built-in that falls into the fast-math category within libdevice (sin, cos, div, ...). The users are aware of their behavior by using the --use_fast_math in nvcc. These functions introduce relatively small errors (just of few ULPs) compared to the tanhf built-in.

The tanhf built-in introduced in this patch is one of a kind. First, it introduces a large approximation, up to 134 ULPs. Second, this built is not present in libdevice and there are no flags in nvcc that are able to introduce it into the code.
For these reasons, I speculate that this built-in is only used as an inline PTX instruction for the computation of the activation function in neural networks. For this reasons, at least for now, it should be kept separated form the other fast math functions.

I suggest adding a check for correct usages of the flag to the compiler.

This implies a modification that will be quite complex within the driver/CudaToolChain ad hoc for this built-in. Actually, the compiler error out in case the installed cudatoolkit does not support the PTX version of the specified architecture, not on specific instructions.

In order to simplify things, I introduced a check in libclc about the target architecture. If -fcuda-approx-tanhf is set and the architecture is >=sm_80, the built-in is used; if <sm_80 the plain (non-built-in) version is used. This will avoid any kind of error.

@bader
Copy link
Contributor

bader commented Jan 26, 2022

Okay. Thank you for the clarification.

@pgorlani pgorlani marked this pull request as ready for review January 31, 2022 09:18
@pgorlani pgorlani requested review from bader and a team as code owners January 31, 2022 09:18
bader
bader previously approved these changes Jan 31, 2022
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Please, update PR description as well.
It says:

This patch adds the support for an approximate
hyperbolic tangent single-precision built-in
function introduced in PTX 7.0 for devices having
compute capabilities >= 7.5.

@Naghasan
Copy link
Contributor

Naghasan commented Jan 31, 2022

I have a few concerns about the approach.

This is very a specific solution to a general problem, I don't think adding a flag just for tanh is a solution if other builtins are also impacted by this need.

The true concern I have is this applies a global module state that will pollute other modules. And this is potentially serious due to the mandatory LTO, this will prevents you from compiling a kernel with approx on and an other with approx off. I'm not sure what to suggest here though but the approach here shouldn't prevent a future proper solution if merged.

@Naghasan
Copy link
Contributor

I try to look at an approach based on the target attribute, but it is not fit for purpose as it is (very dynamic).

pvchupin pushed a commit that referenced this pull request Feb 24, 2022
This patch adds a note on the Get Started Guide regarding the minimum CUDA toolkit version required for fully utilize Turing devices (sm_75).

CUDA toolkit version 11.0 introduces PTX7.0. This version supports for the first time the Ampere architecture (sm_80), however some instructions introduced by PTX7.0 (e.g. approximated tanh (#5265) and ex2 for halfs) can be executed also by Turing devices (sm_75), if CUDA 11.0 (or above) is installed.

Compilation on Turing devices is possible also using CUDA 10.2 (the actual version reported as tested), however if one these PTX7.0 instruction is used, it will generate an error.
@pgorlani pgorlani marked this pull request as draft March 7, 2022 10:56
@pgorlani
Copy link
Contributor Author

pgorlani commented Mar 7, 2022

In #5747, we implemented an extension for defining native builtins outside the SYCL specification in order to achieve a more generic solution for this kind of problems.

For this reason, I converted this PR in a draft.

@github-actions github-actions bot added the Stale label Sep 4, 2022
@github-actions github-actions bot closed this Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants