Skip to content

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Jul 23, 2025

Add first-class support for zero-copy data exchange with ROCm and SYCL GPUs via DLPack interfaces.

Specs:

Note: we might want to implement a slightly older DLPack version if we do not want to bump up NumPy/CuPy/PyTorch/... to very recent versions. Do we have access to the 2025 Intel Python tools release on Aurora?

Close #9

Action Items

  • start by vibing while preparing dinner, then manually:
  • review and finish Array4
  • PODVector
  • Vector
  • ArrayOfStructs
  • BaseFab
  • SmallMatrix
  • SYCL: Implement .to_dpnp / .to_dpctl helper functions
  • Update .to_xp functions to use .to_dpnp or .to_dpctl for SYCL GPUs
  • Test on CUDA GPU
  • Test on ROCm GPU
  • Test on SYCL GPU (help wanted)
  • Search docs for needed updates.
  • Fix DLPack stubs Fix PyCapsule for DLPack sizmailov/pybind11-stubgen#258 or bind manually in pyAMReX

@roelof-groenewald
Copy link
Contributor

roelof-groenewald commented Jul 25, 2025

I performed some testing of the new functionality on Perlmutter. After the latest commit, the following appears to work as intended:

def test_mfab_cuda_cupy(mfab_device):
    import cupy as cp

    # AMReX -> cupy
    for mfi in mfab_device:   
        marr_cupy_from_dlpack = cp.from_dlpack(mfab_device.array(mfi))
        marr_cupy_from_dlpack[0, 1, 3, 2] = 5

    for mfi in mfab_device:   
        marr_cupy_from_dlpack = cp.from_dlpack(mfab_device.array(mfi))
        print(marr_cupy_from_dlpack[0, 1, 3, 2])

It executes without failure and prints the modified value 5. Inspection of the DLDevice showed that the device was successfully identified as kDLCUDA. The device id returned 3, which is consistent with Perlmutter's standard rank-to-gpu mapping with just one MPI rank.

@ax3l
Copy link
Member Author

ax3l commented Jul 25, 2025

Awesome, then we are nearly there.

Try the dpnp logic for SYCL next?

@roelof-groenewald
Copy link
Contributor

roelof-groenewald commented Jul 26, 2025

I tested the dlpack functionality on Aurora (SYCL) and it also now produces the expected result. I also modified Array4_to_xp to take into account the GPU backend. We can now successfully access a MultiFab's Array4 from a SYCL device with

for mfi in mfab_device:
    mfab_device.array(mfi).to_dpnp()

@roelof-groenewald
Copy link
Contributor

I compiled WarpX on Aurora using this pyamrex branch. With it I was able to successfully run a multi-GPU simulation that uses fields.py to read MultiFab values 🎉 🚀

Comment on lines 235 to 240
/* TODO: Handle keyword arguments
[[maybe_unused]] py::handle stream,
[[maybe_unused]] std::tuple<int, int> max_version,
[[maybe_unused]] std::tuple<DLDeviceType, int32_t> dl_device,
[[maybe_unused]] bool copy
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to flag this since copy=True doesn't yet work in the .to_dpnp() function.

@ax3l
Copy link
Member Author

ax3l commented Jul 28, 2025

We need to rebase against development after #455 was merged. I already added the DLDeviceType bindings now and the other PR adds capsule type hints.

Signed-off-by: Axel Huebl <[email protected]>
Signed-off-by: Axel Huebl <[email protected]>
@ax3l ax3l mentioned this pull request Aug 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: cuda Specific to CUDA execution (GPUs) backend: hip Specific to ROCm execution (GPUs) backend: sycl Specific to DPC++/SYCL execution (CPUs/GPUs)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discussion on mapping between amrex, numpy.ndarray, and torch.tensor data types
2 participants