Description
I did some of my own hunting, but i’m sure it’s incomplete. it would be very useful to add this to your OpenACC/MP docs @prathi-wind even if incomplete.
Cray+OpenACC
(I think this is supposed to work with OMP as well)
Links: https://cpe.ext.hpe.com/docs/24.11/cce/man7/intro_openacc.7.html#environment-variables
CRAY_ACC_DEBUG: 0 (off), 1, 2, 3 (very noisy) Dumps a time-stamped log line ("ACC: …) for every allocation, data transfer, kernel launch, wait, etc. Great first stop when "nothing seems to run on the GPU."
set via export CRAY_ACC_DEBUG=3
before running.
I also found this Cray one, which we haven’t used before but seems potentially useful:
CRAY_ACC_FORCE_EARLY_INIT=1
Force full GPU initialisation at program start so you can see start-up hangs immediately
If there is a problem with data movement, apparently, this helps:
export CRAY_ACC_DEBUG=3
export CRAY_ACC_PRESENT_DUMP_SAVE_NAMES=1
* sprinkle acc_present_dump() or omp_get_mapped_ptr() around hotspots
Makes acc_present_dump()
show names + source lines—priceless for "present but not really" bugs
Vendor/compiler agnostic OpenMP:
export OMP_TARGET_OFFLOAD = MANDATORY | DISABLED | DEFAULT
Quick way to turn off off-load (DISABLED) or make it abort if a GPU isn't found (MANDATORY)—great first test: does the problem disappear when you drop back to the CPU?
Cray+OMP: https://cpe.ext.hpe.com/docs/24.11/cce/man7/intro_openmp.7.html#environment-variables
NVIDIA compilers:
ACC only I think (links: https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html?highlight=NVCOMPILER_#environment-variables)
NVCOMPILER_ACC_NOTIFY
Sends a one-liner to stderr every time something interesting happens.
Bit-mask:1 = kernel launch 2 = data copies 4 = region entry/exit 8 = wait/sync 16 = malloc/free
1 (kernels only) is the usual first step.3 (kernels + copies) is great for "why is it so slow?"
NVCOMPILER_ACC_TIME
Lightweight profiler; prints a tidy end-of-run table with per-region and per-kernel times and bytes moved.
Set to any non-zero value (most folks just use 1). Don't run CUDA profilers at the same time.
NVCOMPILER_ACC_DEBUG=1
Spews everything the runtime sees: host/device addresses, mapping events, present-table look-ups, etc. Great for "partially present" or "pointer went missing" errors.
NVIDIA+OMP: (links: https://openmp.llvm.org/design/Runtimes.html)
I think these might work with Cray + OMP, but I’m not sure, since it targets the underlying llvm of OMP.
You can also apparently profile OMP at runtime without invoking a proper profiler, similar to NVCOMPILER_ACC_TIME
. This might work with cray as well.
export LIBOMPTARGET_PROFILE=run.json
# then inspect the output json file via Chrome.
which has this detail:
Emits a Chrome-trace (JSON) timeline you can open in chrome://tracing or Speedscope; great lightweight profiler when Nsight is over-kill. Granularity in µs via LIBOMPTARGET_PROFILE_GRANULARITY (default 500).
there is also
LIBOMPTARGET_INFO
bit-mask, e.g. 1 (= print kernel args) 0x10 (= plugin info) -1 (= everything)
Human-readable log of data-mapping inserts/updates, kernel launches, copies, waits. Perfect first stop for "why is nothing copied?"
LIBOMPTARGET_DEBUG=1
Developer-level trace (host-side). Much noisier than INFO; only works if the runtime was built with -DOMPTARGET_DEBUG.