Skip to content

Device.set_current() is slow #739

Open
@leofang

Description

@leofang

Follow-up of #658.

With cuda-python 12.6.0 (this is an important case because this is a Cython-based cudart re-implementation):

In [4]: %timeit cudart.cudaSetDevice(0)
355 ns ± 1.69 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

For comparison, this is CuPy

In [6]: %timeit cp.cuda.runtime.setDevice(0)
154 ns ± 1.67 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

With cuda-bindings 12.9.0 (the re-implementation is replaced by the statically linked cudart)

In [4]: %timeit runtime.cudaSetDevice(0)
167 ns ± 0.367 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

and this is cuda-core on the main branch + cuda-bindings 12.9.0

In [5]: %timeit dev.set_current()
1.84 μs ± 16 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

I think we should find a way to at least revive and reuse the old re-implementation.

Metadata

Metadata

Assignees

Labels

P0High priority - Must do!cuda.coreEverything related to the cuda.core moduleenhancementAny code-related improvementstriageNeeds the team's attention

Type

No type

Projects

Status

Todo

Relationships

None yet

Development

No branches or pull requests

Issue actions