Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] ENH: CuPy multi-device support #293

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

crusaderky
Copy link
Contributor

@crusaderky crusaderky commented Mar 31, 2025

@crusaderky
Copy link
Contributor Author

@tylerjereddy as you have a dual-GPU box, could you help testing this?

@crusaderky crusaderky marked this pull request as ready for review March 31, 2025 11:29
@Copilot Copilot bot review requested due to automatic review settings March 31, 2025 11:29
@crusaderky
Copy link
Contributor Author

CI failure is unrelated

@ev-br
Copy link
Member

ev-br commented Mar 31, 2025

cc @lucyleeow who has been recently working on multiple device support for scikit-learn

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds multi-device support across several array libraries (CuPy, Dask, NumPy, and PyTorch) by updating device type annotations from str to Device and introducing device validation through helper functions. Key changes include:

  • Updating documentation and type annotations for device parameters and return types.
  • Introducing and using _helpers._validate_device for device validation in array creation functions.
  • Modifying CuPy’s array creation functions and context management for device support.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
array_api_compat/torch/_info.py Update device type in default_device, added notes on current device, with a minor spelling issue in the note.
array_api_compat/numpy/_info.py Update device type annotations for default_device and devices.
array_api_compat/numpy/_aliases.py Replace manual device checking with _helpers._validate_device.
array_api_compat/dask/array/_info.py Update device type annotations for default_device and devices.
array_api_compat/dask/array/_aliases.py Use _helpers._validate_device in various array functions.
array_api_compat/cupy/_info.py Update device type annotations and add contextual documentation for default_device.
array_api_compat/cupy/_aliases.py Adjust device handling in asarray, changing the default for the copy parameter.
array_api_compat/common/_helpers.py Remove unused _check_device and add _device_ctx for context management with devices.
array_api_compat/common/_aliases.py Replace _check_device calls with _device_ctx in array creation functions.

@crusaderky crusaderky changed the title CuPy multi-device support ENH: CuPy multi-device support Mar 31, 2025
@crusaderky
Copy link
Contributor Author

It's not clear to me what happens in cupy non-creation functions.

e.g.

with cupy.cuda.Device(1):
    x = cp.asarray(1)
with cupy.cuda.Device(0):
    y = x + 1
assert y.device == x.device

Does the device propagate from the input, like the Array API dictates, or does the interpreter-level default / context prevail? In the latter case this PR is insufficient.

@betatim
Copy link
Member

betatim commented Mar 31, 2025

In some multi GPU setups the following happens

In [7]: with cp.cuda.Device(0):
   ...:     y = x + 1
   ...:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 2
      1 with cp.cuda.Device(0):
----> 2     y = x + 1

File cupy/_core/core.pyx:1322, in cupy._core.core._ndarray_base.__add__()

File cupy/_core/core.pyx:1746, in cupy._core.core._ndarray_base.__array_ufunc__()

File cupy/_core/_kernel.pyx:1285, in cupy._core._kernel.ufunc.__call__()

File cupy/_core/_kernel.pyx:159, in cupy._core._kernel._preprocess_args()

File cupy/_core/_kernel.pyx:130, in cupy._core._kernel._preprocess_arg()

File cupy/_core/_kernel.pyx:120, in cupy._core._kernel._check_peer_access()

ValueError: The device where the array resides (1) is different from the current device (0). Peer access is unavailable between these devices.

@crusaderky
Copy link
Contributor Author

ValueError: The device where the array resides (1) is different from the current device (0). Peer access is unavailable between these devices.

Damn. That's a showstopper. We can patch functions, but we can't patch array methods.
Are there config flags that can change the behaviour to propagate from input to outut?

@betatim
Copy link
Member

betatim commented Mar 31, 2025

I'm not sure what should happen here. We are trying to combine arrays on two different devices.

From the docs it doesn't sound like you can use the context manager to move an existing array to a different device. In the example above x is on one device and then when y = x + 1 is executed the implicitly created array from 1 is on another device. Which, in general, you can't combine together to create y (see https://docs.cupy.dev/en/stable/user_guide/basic.html#current-device for "sometimes this might work but we recommend that you don't do this").

This doesn't seem that unreasonable and I don't think the standard says anything about how arrys on different devices should/shouldn't be combined?

@@ -54,8 +54,8 @@ def empty_like(
device: Optional[Device] = None,
**kwargs,
) -> Array:
_check_device(xp, device)
return xp.empty_like(x, dtype=dtype, **kwargs)
with _device_ctx(xp, device, like=x):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education: why can the user pass a device= argument to a *_like function? Naively I'd have expected that the _like implies that the device matches that of x. But then you can also pass a dtype= which overrides the dtype of x, so by symmetry allowing a device= makes sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching device of the input is what the user wants 99% of the time.
Using empty_like etc. on a different device can make some sense when preparing an output vessel that is filled from different sources. TBH, though, the main difference between empty and empty_like, besides convenience, is that the latter can easily duplicate a lazy (unknown) shape. Which frequently prevents masked updates, but that's a separate problem.

@crusaderky
Copy link
Contributor Author

I'm not sure what should happen here. We are trying to combine arrays on two different devices.
This doesn't seem that unreasonable and I don't think the standard says anything about how arrys on different devices should/shouldn't be combined?

No, we're trying to propagate the input to the output, which the standard does say is what should happen:

Preserve device assignment as much as possible (e.g. output arrays from a function are expected to be on the same device as input arrays to the function).

In

with cupy.cuda.Device(1):
    x = cp.asarray(1)
with cupy.cuda.Device(0):
    y = x + 1
assert y.device == x.device

I expect a.__add__(b) to ignore the global and context device and just use a.device, which must match b.device if b is an Array. There is no expectation to have binops with mismatched input devices work

@@ -326,7 +326,7 @@ def devices(self):

Returns
-------
devices : list of str
devices : list[Device]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In numpy it is a string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, in numpy it is a Literal["cpu"].
This is a fairly philosophical thing I doubt warrants much discussion.
My personal, very debatable opinion is that I'd rather stick to the abstract Device object rather than str.

The returned variable type in the two snippets below should be the same to static type checkers:

import array_api_compat.numpy as xp
d = xp.__array_namespace_info__().devices()

vs.

import numpy as np
from array_api_compat import array_namespace
xp = array_namespace(np.asarray(0))  # returns an abstracts array_api_types.Namespace
d = xp.__array_namespace_info__().devices()

(it doesn't today because array-api-types doesn't exist yet, but it will eventually).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy returns a string, not a Device, so why change this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the user shouldn't think of it as a string, but as an opaque device object.

@lucascolley
Copy link
Member

(what is up with the automatic copilot review 👀)

@betatim
Copy link
Member

betatim commented Mar 31, 2025

Your expectation and mine don't agree, which is what makes me think that it isn't clear what should happen.

I read "Raise an exception if an operation involves arrays on different devices" combined with "If a library has multiple ways of controlling device placement, the most explicit method should have the highest priority." as the user asking for something that isn't possible because they asked for Device(0) with the context manager but want to operate on something that is on a different device.

@crusaderky
Copy link
Contributor Author

Your expectation and mine don't agree, which is what makes me think that it isn't clear what should happen.

They're not my expectations, they're the standard's recommendations:
https://data-apis.org/array-api/latest/design_topics/device_support.html#semantics

"If a library has multiple ways of controlling device placement, the most explicit method should have the highest priority."

You're quoting point 5, but you skipped over point 2:

"Preserve device assignment as much as possible (e.g. output arrays from a function are expected to be on the same device as input arrays to the function)."

Also, the context manager is just an example for the sake of making a reproducer. A more realistic example is:

with cupy.cuda.Device(1):
    x = ...
# ... 1000 lines and 3 modules later...
y = scipy.logsumexp(x)  # x is on device 1, but the current device is 0

Here there is no "most explicit method". What should be more obvious to the user, that they're currently on the default device 0, or that the array is on device 1?
Are you saying that the desirable behaviour in CuPy is to crash, and that it should ignore the Standard's recommendation to propagate from the input?

@betatim
Copy link
Member

betatim commented Mar 31, 2025

My expectation is that y is on the same device as x because of "avoid transfers where possible"

@crusaderky
Copy link
Contributor Author

My expectation is that y is on the same device as x because of "avoid transfers where possible"

So you agree that there should be propagation from input to output.
How is y = x + 1 a few comments above different from it?

@lucascolley
Copy link
Member

lucascolley commented Mar 31, 2025

The problem is that the CuPy docs say:

All CuPy operations (except for multi-GPU features and device-to-device copy) are performed on the currently active device.

I admit that in this case

with cupy.cuda.Device(1):
    x = cp.asarray(1)
with cupy.cuda.Device(0):
    y = x + 1
assert y.device == x.device

I think the intuitive behaviour would be for y to retain x's device—any problems regarding an 'implicit device' of the Python scalar 1 seem to be implementation detail shortcomings.

The bigger problem is whether such behaviour which is intuitive for me (and seems to align with the standard) would break the model-by-design of CuPy's multi-device support. For example, if you are supposed to treat Python scalars not as scalars but as already arrays on the default device, before their input to any functions is considered. Again, that seems unintuitive to me, but if that is the model CuPy is committed to, I think it may be worth updating the standard.

@lucascolley
Copy link
Member

It may be worth opening a CuPy issue to ask the devs about this specific example, to find out how much it is by (crucial) design or how much it is accidental.

@crusaderky
Copy link
Contributor Author

if you are supposed to treat Python scalars not as scalars but as already arrays on the default device, before their input to any functions is considered.

That feels really weird, considering that there is already explicit ad-hoc code for NEP50-style type promotion and it would feel logical to assign their device in the same way:

>>> cupy.asarray(1.0, dtype=cupy.float32) + cupy.asarray(1.0)
array(2.)  # float64
>>> cupy.asarray(1.0, dtype=cupy.float32) + 1.0
array(2., dtype=float32)

Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tylerjereddy as you have a dual-GPU box, could you help testing this?

I'm cross-testing this branch against scipy/scipy#22756, with the state of that SciPy branch at hash 3025a6a05 before some of the more recent i.e., xfail adjustments over there, to see if this feature branch of array-api-compat can rescue the test failure when pulled in to scipy/_lib/array_api_compat (using hash 9fae9434 of this branch inside of the other SciPy branch).

At the moment, SCIPY_DEVICE=cuda python dev.py test -t scipy/special/tests/test_logsumexp.py::TestLogSumExp::test_device -b cupy continues to fail in the same way as originally described over there. I'll try to provide
slightly more information below on the dual-GPU machine, and maybe you can ask me for what you want me to try next.

With this branch pulled over there, I see that the first setting of the non-default device works just fine:

--- a/scipy/special/tests/test_logsumexp.py
+++ b/scipy/special/tests/test_logsumexp.py
@@ -296,6 +296,7 @@ class TestLogSumExp:
     def test_device(self, x, xp, nondefault_device):
         """Test input device propagation to output."""
         x = xp.asarray(x, device=nondefault_device)
+        print("x device:", xp_device(x))

That prints x device: <CUDA Device 1> with this branch pulled in, but this also seems to be true when I checkout the submodule version SciPy main currently uses.

I dissected further in SciPy on the dual-GPU machine with the debug prints below the fold.

--- a/scipy/_lib/_array_api.py
+++ b/scipy/_lib/_array_api.py
@@ -515,9 +515,11 @@ def xp_ravel(x: Array, /, *, xp: ModuleType | None = None) -> Array:
 
 # utility to broadcast arrays and promote to common dtype
 def xp_broadcast_promote(*args, ensure_writeable=False, force_floating=False, xp=None):
+    print("xp_broadcast_promote xp_device(args[0]) checkpoint 1:", xp_device(args[0]))
     xp = array_namespace(*args) if xp is None else xp
 
     args = [(_asarray(arg, subok=True) if arg is not None else arg) for arg in args]
+    print("xp_broadcast_promote xp_device(args[0]) checkpoint 2:", xp_device(args[0]))
     args_not_none = [arg for arg in args if arg is not None]
 
     # determine minimum dtype
diff --git a/scipy/_lib/array_api_compat b/scipy/_lib/array_api_compat
index 1b0de51538..9fae94343f 160000
--- a/scipy/_lib/array_api_compat
+++ b/scipy/_lib/array_api_compat
@@ -1 +1 @@
-Subproject commit 1b0de51538deb7c21d0c268f36764a8589e40012
+Subproject commit 9fae94343fcb3e88ff88e7c0494267fb0a0dfb13
diff --git a/scipy/special/_logsumexp.py b/scipy/special/_logsumexp.py
index c1664bf935..2e2d041d05 100644
--- a/scipy/special/_logsumexp.py
+++ b/scipy/special/_logsumexp.py
@@ -105,9 +105,12 @@ def logsumexp(a, axis=None, b=None, keepdims=False, return_sign=False):
 
     """
     xp = array_namespace(a, b)
+    print("xp_device(a) checkpoint -1:", xp_device(a))
     a, b = xp_broadcast_promote(a, b, ensure_writeable=True, force_floating=True, xp=xp)
+    print("xp_device(a) checkpoint 0:", xp_device(a))
     a = xpx.atleast_nd(a, ndim=1, xp=xp)
     b = xpx.atleast_nd(b, ndim=1, xp=xp) if b is not None else b
+    print("xp_device(a) checkpoint 1:", xp_device(a))
     axis = tuple(range(a.ndim)) if axis is None else axis

A sample failure then drops the non-default device very early on in logsumexp(), and in particular apparently because of an _asarray(arg, subok=True) call inside of xp_broadcast_promote():

scipy/special/tests/test_logsumexp.py:300: in test_device
    assert xp_device(logsumexp(x)) == nondefault_device
E   assert <CUDA Device 0> == <CUDA Device 1>
E    +  where <CUDA Device 0> = xp_device(array(-inf))
E    +    where array(-inf) = logsumexp(array([], dtype=float64))
        nondefault_device = <CUDA Device 1>
        self       = <scipy.special.tests.test_logsumexp.TestLogSumExp object at 0x7482e19845d0>
        x          = array([], dtype=float64)
        xp         = <module 'scipy._lib.array_api_compat.cupy' from '/home/treddy/github_projects/scipy/build-install/lib/python3.11/site-packages/scipy/_lib/array_api_compat/cupy/__init__.py'>
------------------------------------------------------------------------------------------------------------------------ Captured stdout call ------------------------------------------------------------------------------------------------------------------------
x device: <CUDA Device 1>
xp_device(a) checkpoint -1: <CUDA Device 1>
xp_broadcast_promote xp_device(args[0]) checkpoint 1: <CUDA Device 1>
xp_broadcast_promote xp_device(args[0]) checkpoint 2: <CUDA Device 0>
xp_device(a) checkpoint 0: <CUDA Device 0>
xp_device(a) checkpoint 1: <CUDA Device 0>

@crusaderky
Copy link
Contributor Author

@tylerjereddy Thank you, there was indeed a bug in asarray. I fixed it now.

@betatim
Copy link
Member

betatim commented Apr 1, 2025

So you agree that there should be propagation from input to output.

Yes, except when you add some explicit request to not do that. For example by using xp.empty_like(x, device=foo).

How is y = x + 1 a few comments above different from it?

Because in the y = x + 1 example we explicitly requested to use Device(0) and x is not on that device. I read the use of the context manager as the user saying "I have thought about this and I want this to happen on Device(0)". The context manager is syntactic sugar for adding device=Device(0) to all function calls in a block.

I agree it is a bit murky what "this" is. Is "this" the addition of x and 1 (the add instructions are executed on device 0, while accessing memory from device 1)? Is "this" the placement of the resulting array y (the result of the add instructions executing somewhere is moved to the memory of device 0 if needed)? Is "this" the creation of the implicit array for 1 (we are now trying to add an array on two different devices)? But, no matter what "this" is, it seems like the user is asking for something that isn't possible, because one of the inputs is on Device(1) and we should not implicitly move arrays. Ignoring the context manager also seems wrong.

@lucascolley
Copy link
Member

lucascolley commented Apr 1, 2025

@kmaehashi @leofang @asi1024 hello! We have a question about how to interpret CuPy's device context manager. The following seems clear:

with cp.cuda.Device(1):
    x = cp.asarray(1)  # x should be on 'current device' Device(1)
    y = cp.asarray(1, device=cp.cuda.Device(0))  # y should be on Device(0), as explicitly requested

But what about the following example:

with cp.cuda.Device(1):
    y = cp.asarray(1, device=cp.cuda.Device(0))  # y should be on Device(0)
    z = y + 1

How 'strong' is the context manager supposed to be here? It seems like there are at least a couple options:

  1. Everything should be forced to be on Device(1) unless explicitly requested otherwise in the function call. Thus z should be on Device(1), and we should throw an exception if this can't happen.
  2. The context manager is expected to apply to array creation, but needn't stop z = y + 1 from propagating y's device to z. So z can be on Device(0).

#293 (comment) suggests that in practice, the current CuPy implementation leans towards option (1), failing when peer access cannot be established. If so, is this a deliberate/crucial design aspect, or more so accidental?

If (1) is by design, how extreme does it go? For example should

with cp.cuda.Device(1):
    y = cp.asarray(1, device=cp.cuda.Device(0))  # y should be on Device(0)
    y = y

also raise an exception if y cannot be accessed on Device(1)?


I suppose there is also the more extreme option 3 where

with cp.cuda.Device(1):
    y = cp.asarray(1, device=cp.cuda.Device(0))

should itself throw an exception, with the context manager overriding the device argument, but that would directly contradict https://data-apis.org/array-api/draft/design_topics/device_support.html#semantics.

@crusaderky
Copy link
Contributor Author

So you agree that there should be propagation from input to output.

Yes, except when you add some explicit request to not do that. For example by using xp.empty_like(x, device=foo).

This much is clear.

How is y = x + 1 a few comments above different from it?

Because in the y = x + 1 example we explicitly requested to use Device(0) and x is not on that device. I read the use of the context manager as the user saying "I have thought about this and I want this to happen on Device(0)". The context manager is syntactic sugar for adding device=Device(0) to all function calls in a block.

I agree it is a bit murky what "this" is. Is "this" the addition of x and 1 (the add instructions are executed on device 0, while accessing memory from device 1)? Is "this" the placement of the resulting array y (the result of the add instructions executing somewhere is moved to the memory of device 0 if needed)? Is "this" the creation of the implicit array for 1 (we are now trying to add an array on two different devices)? But, no matter what "this" is, it seems like the user is asking for something that isn't possible, because one of the inputs is on Device(1) and we should not implicitly move arrays. Ignoring the context manager also seems wrong.

So are you saying that the context should trump the device of the input array to a function (x + 1 is a function), but not the global default device?
That, in other words, the hierarchy in your opinion should be

  1. device= parameter of the function
  2. context manager
  3. device of input array(s) to the function
  4. global device

?

@tylerjereddy
Copy link
Contributor

Just to continue providing the multi-device scenario feedback, we get farther in the control flow in the SciPy example now until encountering the dreaded ValueError: The device where the array resides (1) is different from the current device (0). Peer access is unavailable between these devices..

That is perhaps slightly more surprising since I don't see a context manager directly in the source, but I think confusion is a common theme already anyway, and the context manager may just be abstracted away in the shims here (or is the global default device still overriding even without the context?) so I don't "see it."

Dissection at scipy/scipy#22756 (comment), but probably more something to discuss over here for now.

@leofang
Copy link

leofang commented Apr 1, 2025

CuPy does not support the device= kwarg in the array constructors today, so some works have to happen first. But if we were to support it now I'd be supportive to enforce honoring the kwarg (and ignore the global setting or local context manager, to the max extents possible).

@crusaderky
Copy link
Contributor Author

Dissection at scipy/scipy#22756 (comment), but probably more something to discuss over here for now.

xp.exp(x) is disregarding the device of x and instead it is using the global default of 0. No context managers are involved here.

array-api-compat could wrap every single function so that cupy respects the input argument's device, but cannot do the same for array methods, e.g. __add__, so we're dead in the water unless cupy itself changes things.

@lucascolley
Copy link
Member

@leofang do you have an opinion on the desired behaviour here?

with cp.cuda.Device(1):
    y = cp.asarray(1, device=cp.cuda.Device(0))  # y should be on Device(0)
    z = y + 1  # device of z?

@crusaderky crusaderky marked this pull request as draft April 3, 2025 11:30
@crusaderky crusaderky changed the title ENH: CuPy multi-device support [DNM] ENH: CuPy multi-device support Apr 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants