This lab is shold to be executed inside nvidia-docker with hopobcn/gtc2018-fwi:us image
In this lab you will make modifications to a real world oil&gas mini application called FWI
.
You will compile and execute the newly enhanced code in each step.
Along the way, solutions are provided, so you can check your work, or take a peek if you get lost.
After login into the machine you should see a folder called FWI
containing each step and solution as a different branch:
~# git branch
* gtc2018-step1
gtc2018-step1-sol
gtc2018-step3
gtc2018-step3-sol
gtc2018-step4
gtc2018-step4-sol
gtc2018-step5
gtc2018-step5-sol
ompss-openacc
Use git diff <branch-name>
to compare branches and git stash && git checkout <branch-name>
to discard non-commited changes and change to another step (branch).
Dual socket Intel(R) Xeon(R) E5-2630 v3 (Haswell) @ 2.40 GHz with 8 cores (16 core total, HT off) NVIDIA Tesla K80
To save time, this step has been moved to Appendix 0
To compare our GPU executions we will use serial & OpenMP executions:
- Sequential execution with 1 core:
Number of frequencies 1
Number of shots 1
Number of gradient iterations 1
Number of test iterations 1
Output directory path: results
FWI Program finished in 1059.629833 seconds
- OpenMP execution with 16 threads (1 per core)
Number of frequencies 1
Number of shots 1
Number of gradient iterations 1
Number of test iterations 1
Output directory path: results
FWI Program finished in 120.587904 seconds
In the initial step, you have to express parallelism available around expensive loops in the application. We assume you already have some OpenACC knowledge.
-
git checkout gtc2018-step1
-
Add
#pragma acc kernels
pragmas tocompute_component_vcell_TL:166
andcompute_component_scell_TR:624
functions (fromsrc/fwi_propagator.c
file) -
Compile
fwi
:mkdir build && cd build cmake -DCMAKE_C_COMPILER=pgcc -DUSE_OPENACC=ON ..
-
Inspect the compiler output. Try to identify if the kernels were successfully generated or if you found messages like:
Loop carried dependence due to exposed use of [array-name] prevents parallelization
. -
Fix the previous issue providing
#pragma acc loop independent
to each loop-nest -
Compile and run
fwi
:$ make irun
Step 2 will profile the application to find possible weaknesses and optimization opportunities.
We could use NVIDIA Visual Profiler for a graphical assestment or pgprof
/nvprof
for a command-line visualization.
For simplicity in this lab we are going to use nvprof
-
git checkout gtc2018-step2
-
Profile
fwi
with:nvprof --dependency-analysis bin/fwi fwi_schedule.txt
or
NVVP
-
Observe if:
- Which kernels take the most time in the critical path
- Is memory traffic due CUDA Unified Memory an issue?
Unified Memory (UM) can be very inefficient in older GPU generations and an experienced programmer with detailed knowledge of the application algorithm could outperform the Unified Memory.
As we have seen in Step 2, FWI
doesn't specially suffer from too much UM traffic.
But as an example, in this step we'll migrate from UM to a movement of data using OpenACC directives.
-
git checkout gtc2018-step3
-
Annotate, using OpenACC directives, which arrays will be used on the GPU: In
alloc_memory_shot
function (src/fwi_kernel.c
), after all allocations, complete the OpenACC section:coeff_t cc = *c; #pragma acc enter data create(cc) #pragma acc enter data create(cc.c11[:cellsInVolume]) #pragma acc enter data create(cc.c12[:cellsInVolume]) #pragma acc enter data create(/* COMPLETEME */) #pragma acc enter data create(/* COMPLETEME */) ... // continue with c13,c14,c15,c16, c22,c23,c24,c25,c26, c33,c34,c35,c36, c44,c45,c46, c55,c56, c66
In
free_memory_shot
function, before all dealocations (free
) we should first, deallocate the GPU memory with:#pragma acc wait #pragma acc exit data delete(c->c11) #pragma acc exit data delete(c->c12) #pragma acc exit data delete(/* COMPLETEME */) #pragma acc exit data delete(/* COMPLETEME */) ... #pragma acc exit data delete(c) ...
-
In
compute_component_vcell_TL:166
andcompute_component_scell_TR:625
functions. Specify that all the GPU arrays are already on the gpu with thepresent
clause -
Compile and run
fwi
:$ make irun
OpenACC follows the same semantics as CUDA regarding streams. By default, kernels and memory copies are executed in the default stream which imposes serialization between kernels and memory transfers.
Since we know that scell don't have dependences between each other (same for vcell functions) we could launch all the scell/vcell kernels to different streams.
OpenACC uses the clause async
that requires a nonnegative scalar integer as a parameter.
In include/fwi/fwi_propagator.h::78
we have already prepared an enum that we are going to use as identifiers:
typedef enum {TR, TL, BR, BL} phase_t;
-
git checkout gtc2018-step4
-
Add the
async
clause to all OpenACC kernels (src/fwi_propagator.c
)All scell & vcell functions provide a parameter named
phase
which can be provided to the OpenACC clause:async(phase)
-
In
velocity_propagator
andstress_propagator
Pass as the last paramter, the stream identifier from
phase_t
enum (TR
,TL
,BR
orBL
). -
Compile and run
fwi
:$ make irun
FWI
is a great candidate to take advantage of shuffle intrinsics and shared memory.
OpenACC provides the directive cache
to exploit shared memory, but it lacks a way to exploit CUDA intra-warp intrinsics.
In this step we provide a set of highly optimized kernels in src/fwi_propagator.cu
file.
Your task consists on adding the necessary glue code to get:
- The GPU device pointers managed by the OpenACC runtime
- The CUDA stream allocated by OpenACC
-
git checkout gtc2018-step5
-
Add
#pragma acc host_data use_device
directives to forward the device pointers allocated by OpenACC to our CUDA kernels -
Pass the current stream to the CUDA kernel (with
acc_get_cuda_stream
) -
Compile and run
fwi
:$ make irun