-
Notifications
You must be signed in to change notification settings - Fork 239
/
Copy pathexposing_new_intrinsics.jl
49 lines (37 loc) · 2.28 KB
/
exposing_new_intrinsics.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# # Introduction
# * Adding new GPU intrinsics *
# In this tutorial we will expose some GPU intrinsics to allow directed rounding in fused-multiply-add (fma)
# floating point operation
# We start by identifying the intrinsic we want to expose; to do so, we read the PTX (Parallel Thread Execution)
# documentation at [PTX - Floating Point Instructions](https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions).
# In table 32, it is presented a summary of floating point operations: we can construct the intrinsic string from that.
# The FMA instruction for Float32 is presented as `{mad,fma}.rnd.f32`, where `rnd` can assume the values `.rnd = { .rn, .rz, .rm, .rp }`,
# where `rn` is round to nearest, `rz` round to zero, `rm` round to minus infinity, `rp` round to plus infinity.
# When building the intrinsic for the call, we need to change the type `.f64` with `.d` and `.f32` with `.f`
# Therefore, to call the rounded towards infinity `fma` for `.f64` we need to call the intrinsic `llvm.nvvm.fma.rp.d`
# Please remark that this is only possible if LLVM support the intrinsic; a source for those exposed by LLVM
# may be found by searching the [LLVM repository](https://github.com/llvm/llvm-project). In in other cases you'd need @asmcall and inline PTX assembly.
fma_rp(x::Float64, y::Float64, z::Float64) = ccall("llvm.nvvm.fma.rp.d", llvmcall, Cdouble, (Cdouble, Cdouble, Cdouble), x, y, z)
fma(x::T, y::T, z::T, ::RoundingMode{:Up}) where {T <: Union{Float32, Float64}} = fma_rp(x, y, z)
# We inspect the PTX code
CUDA.code_ptx(fma_rp, Tuple{Float64,Float64,Float64})
# It is possible to see that the PTX code contains a call to the intrinsic `fma.rp.f64`; we add this function now
# to src/device/intrins/math.jl
using CUDA
function test_fma!(out, x, y)
I = threadIdx().x
z = (2.0) ^ (-(I+53))
out[I] = fma(x, y, z, RoundNearest)
out[I+4] = fma(x, y, z, RoundToZero)
out[I+8] = fma(x, y, z, RoundUp)
out[I+12] = fma(x, y, z, RoundDown)
return
end
# The first four entries of the output are Rounded to Nearest, the entries 5 to 8 are rounded towards zero,
# etc...
out_d = CuArray(zeros(16))
@cuda threads = 4 test_fma!(out_d, 1.0, 1.0)
out_h = Array(out_d)
out_d = CuArray(zeros(4))
@cuda threads = 4 test_fma!(out_d, -1.0, 1.0)
out_h = Array(out_d)