amrex::Gpu::htod_memcpy and making ExampleCodes/GPU/CNS portable to MPI+OpenMP #2929
-
Hi,
and then whenever the device memory versions were used, made a patch like
so the device versions are not used. That seems to work. Is there a tidier way to make this happen automatically? A different AMReX htod_memcpy which still exists and does something same when GPU support is not used so changes like the second one are not needed? I also made a change to
which seems to result in threads getting work. I wonder if that's the best one can do however? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Hi, I've written a code very similar to the CNS code for astrophysics here: https://github.com/BenWibking/quokka. It works on both GPU and CPU with very good performance on both. There is also Castro, which (usually) uses a predictor-corrector integrator rather than RK2 for timestepping: https://github.com/AMReX-Astro/Castro/issues I tried experimenting with adding OpenMP, but it always ran slower than flat MPI. The Athena++ developers found the same thing for their code as well (see https://github.com/PrincetonUniversity/athena/wiki/Using-MPI-and-OpenMP#note-on-performance). If you are set on adding OpenMP threading, you will need to modify the MFIter loops to add tiling as described in the AMReX documentation: https://amrex-codes.github.io/amrex/docs_html/Basics.html#mfiter-with-tiling |
Beta Was this translation helpful? Give feedback.
-
You could do For the MFIter loops, if you want OpenMP, I agree with @BenWibking that you should try tiling. Note that you do have to make some changes in your CPU code for tiling. You can also change where the temporary Fabs are defined to make memory allocation more efficient by reusing the memory. So it will be something like
|
Beta Was this translation helpful? Give feedback.
You could do
amrex::copy(amrex::Gpu::hostToDevice, &CNS::h_prob_parm, &CNS::h_prob_parm+1, &CNS::d_prob_parm)
. That will callstd::memcpy
for CPU builds.For the MFIter loops, if you want OpenMP, I agree with @BenWibking that you should try tiling. Note that you do have to make some changes in your CPU code for tiling. You can also change where the temporary Fabs are defined to make memory allocation more efficient by reusing the memory. So it will be something like