[SYCL][CUDA][libclc] Add approx. tanhf built-in #5265

pgorlani · 2022-01-06T12:38:47Z

This patch adds the support for an approximate
hyperbolic tangent single-precision built-in
function introduced in PTX 7.0 for devices having
compute capabilities >= 8.0.

If this built-in is available, it is possible use
it by setting the -fcuda-approx-tanhf flag.

This patch adds the support for an approximate hyperbolic tangent single-precision built-in function introduced in PTX 7.0 for devices having compute capabilities >= 7.5. If this built-in is available, it is possible use it by setting the `-fcuda-approx-tanhf` flag.

pgorlani · 2022-01-06T14:00:56Z

Basically, the implemented approach extends the nvvm_reflect functionality adding the __CUDA_APPROX_TANHF parameter in order to select the built-in tanh intrinsic. This parameter will be set by the user compiling the code with the -Xclang -fcuda-approx-tanhf flag. This implies that the user is aware of the built-in availability (i.e., device compute capability >= 7.5 and PTX version >= 7.0) and uses the compiler flag accordingly.

Another possible solution is exposing this built-in via a sycl::native:: function, without the need for a flag.
This approach presents a couple of issues. First, sycl::native::tanh is not present in the standard (this is a minor issue since it could be added with an extension, I guess). Second, nvvm_reflect doesn't have a parameter for querying the PTX version (it has just one for querying the architecture). The automatic introduction of this build-in without knowing the PTX version (and without the user awareness) could lead to compilation failures (when c.c. >= 7.5 and PTX v. < 7.0) that are hard to understand. I.e.,

ptxas /tmp/tester-sm_75-143011.s, line 53; error   : Not a name of any known instruction: 'tanh'
ptxas /tmp/tester-sm_75-143011.s, line 97; error   : Not a name of any known instruction: 'tanh'
ptxas /tmp/tester-sm_75-143011.s, line 132; error   : Not a name of any known instruction: 'tanh'
ptxas /tmp/tester-sm_75-143011.s, line 170; error   : Not a name of any known instruction: 'tanh'
ptxas fatal   : Ptx assembly aborted due to errors
llvm-foreach: 
clang-14: error: ptxas command failed with exit code 255 (use -v to see invocation)
clang version 14.0.0 ([email protected]:pgorlani/llvm.git 91bb219cf39832dee4b9ddaae36418ea6b9cf4a9)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/paologorlani/llvm/build/bin
clang-14: note: diagnostic msg: Error generating preprocessed source(s).

Hi @bader, what do you think about it?

bader · 2022-01-18T14:07:20Z

@pgorlani, sorry for the delay.

This implies that the user is aware of the built-in availability (i.e., device compute capability >= 7.5 and PTX version >= 7.0) and uses the compiler flag accordingly.

I suggest adding a check for correct usages of the flag to the compiler.
Does it make sense to add support for -fsycl-approx-transcendentals (similar to -fcuda-approx-transcendentals) enabling all support approximated functions?

Another possible solution is exposing this built-in via a sycl::native:: function, without the need for a flag.

This sounds like a useful feature to have, but considering amount of work to enable it, I suggest we handle it separately. Let's create a feature request for sycl::native::tanh and see if there will be any interest in it. What do you think?

bader · 2022-01-18T14:08:25Z

Tagging @andykaylor for awareness.

pgorlani · 2022-01-26T12:02:54Z

Thank for your answer, @bader.

Does it make sense to add support for -fsycl-approx-transcendentals (similar to -fcuda-approx-transcendentals) enabling all support approximated functions?

This is a very good suggestion, and I think we need to apply this flag to the normal built-in that falls into the fast-math category within libdevice (sin, cos, div, ...). The users are aware of their behavior by using the --use_fast_math in nvcc. These functions introduce relatively small errors (just of few ULPs) compared to the tanhf built-in.

The tanhf built-in introduced in this patch is one of a kind. First, it introduces a large approximation, up to 134 ULPs. Second, this built is not present in libdevice and there are no flags in nvcc that are able to introduce it into the code.
For these reasons, I speculate that this built-in is only used as an inline PTX instruction for the computation of the activation function in neural networks. For this reasons, at least for now, it should be kept separated form the other fast math functions.

I suggest adding a check for correct usages of the flag to the compiler.

This implies a modification that will be quite complex within the driver/CudaToolChain ad hoc for this built-in. Actually, the compiler error out in case the installed cudatoolkit does not support the PTX version of the specified architecture, not on specific instructions.

In order to simplify things, I introduced a check in libclc about the target architecture. If -fcuda-approx-tanhf is set and the architecture is >=sm_80, the built-in is used; if <sm_80 the plain (non-built-in) version is used. This will avoid any kind of error.

bader · 2022-01-26T12:06:26Z

Okay. Thank you for the clarification.

libclc/ptx-nvidiacl/libspirv/math/tanh.cl

clang/include/clang/Basic/BuiltinsNVPTX.def

Co-authored-by: Alexey Bader <[email protected]>

bader

LGTM.
Please, update PR description as well.
It says:

This patch adds the support for an approximate
hyperbolic tangent single-precision built-in
function introduced in PTX 7.0 for devices having
compute capabilities >= 7.5.

Naghasan · 2022-01-31T10:42:19Z

I have a few concerns about the approach.

This is very a specific solution to a general problem, I don't think adding a flag just for tanh is a solution if other builtins are also impacted by this need.

The true concern I have is this applies a global module state that will pollute other modules. And this is potentially serious due to the mandatory LTO, this will prevents you from compiling a kernel with approx on and an other with approx off. I'm not sure what to suggest here though but the approach here shouldn't prevent a future proper solution if merged.

Naghasan · 2022-01-31T10:45:56Z

I try to look at an approach based on the target attribute, but it is not fit for purpose as it is (very dynamic).

This patch adds a note on the Get Started Guide regarding the minimum CUDA toolkit version required for fully utilize Turing devices (sm_75). CUDA toolkit version 11.0 introduces PTX7.0. This version supports for the first time the Ampere architecture (sm_80), however some instructions introduced by PTX7.0 (e.g. approximated tanh (#5265) and ex2 for halfs) can be executed also by Turing devices (sm_75), if CUDA 11.0 (or above) is installed. Compilation on Turing devices is possible also using CUDA 10.2 (the actual version reported as tested), however if one these PTX7.0 instruction is used, it will generate an error.

pgorlani · 2022-03-07T11:50:18Z

In #5747, we implemented an extension for defining native builtins outside the SYCL specification in order to achieve a more generic solution for this kind of problems.

For this reason, I converted this PR in a draft.

bader requested a review from Naghasan January 18, 2022 14:07

Add architecture check in libclc

a151a75

Update flag description

3db7642

pgorlani marked this pull request as ready for review January 31, 2022 09:18

pgorlani requested review from bader and a team as code owners January 31, 2022 09:18

bader reviewed Jan 31, 2022

View reviewed changes

libclc/ptx-nvidiacl/libspirv/math/tanh.cl Outdated Show resolved Hide resolved

clang/include/clang/Basic/BuiltinsNVPTX.def Show resolved Hide resolved

pgorlani and others added 2 commits January 31, 2022 09:52

Update libclc/ptx-nvidiacl/libspirv/math/tanh.cl

7b89c00

Co-authored-by: Alexey Bader <[email protected]>

Increase intrinsic architecture

44cd96c

bader previously approved these changes Jan 31, 2022

View reviewed changes

Update flag message

f6cba9c

pgorlani dismissed bader’s stale review via f6cba9c February 7, 2022 09:12

rename flag, apply module flag only for sycl

8efe3c1

pgorlani mentioned this pull request Feb 23, 2022

[SYCL][CUDA][DOC] State min. CUDA ver. for sm_75 #5642

Merged

add f16 and f16x2 intrinsics

464e761

pgorlani added 3 commits February 24, 2022 09:59

add f16 and f16x2 to liclc with half vects made of f16x2

5db0f72

bring back to sm_75 considering intel#5642

75e1ee8

add have2 vectorization

5f780f6

pgorlani marked this pull request as draft March 7, 2022 10:56

github-actions bot added the Stale label Sep 4, 2022

github-actions bot closed this Oct 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][CUDA][libclc] Add approx. tanhf built-in #5265

[SYCL][CUDA][libclc] Add approx. tanhf built-in #5265

Uh oh!

pgorlani commented Jan 6, 2022 •

edited

Loading

Uh oh!

pgorlani commented Jan 6, 2022

Uh oh!

bader commented Jan 18, 2022

Uh oh!

bader commented Jan 18, 2022

Uh oh!

pgorlani commented Jan 26, 2022

Uh oh!

bader commented Jan 26, 2022

Uh oh!

Uh oh!

Uh oh!

bader left a comment

Uh oh!

Naghasan commented Jan 31, 2022 •

edited

Loading

Uh oh!

Naghasan commented Jan 31, 2022

Uh oh!

pgorlani commented Mar 7, 2022

Uh oh!

Uh oh!

[SYCL][CUDA][libclc] Add approx. tanhf built-in #5265

[SYCL][CUDA][libclc] Add approx. tanhf built-in #5265

Uh oh!

Conversation

pgorlani commented Jan 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pgorlani commented Jan 6, 2022

Uh oh!

bader commented Jan 18, 2022

Uh oh!

bader commented Jan 18, 2022

Uh oh!

pgorlani commented Jan 26, 2022

Uh oh!

bader commented Jan 26, 2022

Uh oh!

Uh oh!

Uh oh!

bader left a comment

Choose a reason for hiding this comment

Uh oh!

Naghasan commented Jan 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Naghasan commented Jan 31, 2022

Uh oh!

pgorlani commented Mar 7, 2022

Uh oh!

Uh oh!

pgorlani commented Jan 6, 2022 •

edited

Loading

Naghasan commented Jan 31, 2022 •

edited

Loading