amrex::Gpu::htod_memcpy and making ExampleCodes/GPU/CNS portable to MPI+OpenMP #2929

colinmcnally · 2022-08-27T00:30:07Z

colinmcnally
Aug 27, 2022

Hi,
Wanting to end up with a performance-portable code, I have been playing with the GPU/CNS example code as having similar base functionality to what I need and current GPU launching implementation. In compiling that for MPI_OpenMP AMReX I ran across calls to amrex::Gpu::htod_memcpy. They seem in this case to only be for copying constants over to device memory. My hack was to just do:

#ifdef AMREX_USE_GPU
         amrex::Gpu::htod_memcpy(CNS::d_prob_parm, CNS::h_prob_parm, sizeof(ProbParm));
#endif

and then whenever the device memory versions were used, made a patch like

+#ifdef AMREX_USE_GPU
     Parm const* lparm = d_parm;
+#else
+    Parm const* lparm = h_parm;
+#endif

so the device versions are not used.

That seems to work. Is there a tidier way to make this happen automatically? A different AMReX htod_memcpy which still exists and does something same when GPU support is not used so changes like the second one are not needed?

I also made a change to CNS::compute_dSdt to put a OpenMP pragma on the MFIter loop

-    FArrayBox qtmp, slopetmp;
+#ifdef AMREX_USE_OMP
+#pragma omp parallel
+#endif
     for (MFIter mfi(S); mfi.isValid(); ++mfi)
     {
+        FArrayBox qtmp, slopetmp;

which seems to result in threads getting work. I wonder if that's the best one can do however?

Answered by WeiqunZhang

Sep 7, 2022

That seems to work. Is there a tidier way to make this happen automatically? A different AMReX htod_memcpy which still exists and does something same when GPU support is not used so changes like the second one are not needed?

You could do amrex::copy(amrex::Gpu::hostToDevice, &CNS::h_prob_parm, &CNS::h_prob_parm+1, &CNS::d_prob_parm). That will call std::memcpy for CPU builds.

For the MFIter loops, if you want OpenMP, I agree with @BenWibking that you should try tiling. Note that you do have to make some changes in your CPU code for tiling. You can also change where the temporary Fabs are defined to make memory allocation more efficient by reusing the memory. So it will be something like

…

View full answer

BenWibking · 2022-09-06T22:35:00Z

BenWibking
Sep 6, 2022
Collaborator

Hi,

I've written a code very similar to the CNS code for astrophysics here: https://github.com/BenWibking/quokka. It works on both GPU and CPU with very good performance on both. There is also Castro, which (usually) uses a predictor-corrector integrator rather than RK2 for timestepping: https://github.com/AMReX-Astro/Castro/issues

I tried experimenting with adding OpenMP, but it always ran slower than flat MPI. The Athena++ developers found the same thing for their code as well (see https://github.com/PrincetonUniversity/athena/wiki/Using-MPI-and-OpenMP#note-on-performance).

If you are set on adding OpenMP threading, you will need to modify the MFIter loops to add tiling as described in the AMReX documentation: https://amrex-codes.github.io/amrex/docs_html/Basics.html#mfiter-with-tiling

0 replies

WeiqunZhang · 2022-09-07T00:02:13Z

WeiqunZhang
Sep 7, 2022
Maintainer

That seems to work. Is there a tidier way to make this happen automatically? A different AMReX htod_memcpy which still exists and does something same when GPU support is not used so changes like the second one are not needed?

You could do amrex::copy(amrex::Gpu::hostToDevice, &CNS::h_prob_parm, &CNS::h_prob_parm+1, &CNS::d_prob_parm). That will call std::memcpy for CPU builds.

For the MFIter loops, if you want OpenMP, I agree with @BenWibking that you should try tiling. Note that you do have to make some changes in your CPU code for tiling. You can also change where the temporary Fabs are defined to make memory allocation more efficient by reusing the memory. So it will be something like

#ifdef AMREX_USE_OMP
#pragma omp parallel
#endif
{
  FArrayBox qtmp, slopetmp;
  for (MFIter mfi(S, TilingIfNotGPU()); mfi.isValid(); ++mfi) {
      ...
      qtmp.resize(...);
      Elixir ...; // for GPU
  }
}

2 replies

BenWibking Sep 7, 2022
Collaborator

Do you find tiling improves performance compared to flat MPI on AMReX codes when running on 'normal' x86 (i.e., not Xeon Phi) or similar CPUs? (I did single-node tests with Quokka, but I realized it might be different at scale.)

WeiqunZhang Sep 7, 2022
Maintainer

The new CPUs like AMD Milan has 64 cores. I would not be surprised that using threads could help, although probably not all 128 threads. Depending on the kernel, tiling could improve performance due to cache reuse even for flat MPI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amrex::Gpu::htod_memcpy and making ExampleCodes/GPU/CNS portable to MPI+OpenMP #2929

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

amrex::Gpu::htod_memcpy and making ExampleCodes/GPU/CNS portable to MPI+OpenMP #2929

colinmcnally Aug 27, 2022

Replies: 2 comments · 2 replies

BenWibking Sep 6, 2022 Collaborator

WeiqunZhang Sep 7, 2022 Maintainer

BenWibking Sep 7, 2022 Collaborator

WeiqunZhang Sep 7, 2022 Maintainer

colinmcnally
Aug 27, 2022

Replies: 2 comments 2 replies

BenWibking
Sep 6, 2022
Collaborator

WeiqunZhang
Sep 7, 2022
Maintainer

BenWibking Sep 7, 2022
Collaborator

WeiqunZhang Sep 7, 2022
Maintainer