Open
Description
Follow-up of #658.
With cuda-python
12.6.0 (this is an important case because this is a Cython-based cudart re-implementation):
In [4]: %timeit cudart.cudaSetDevice(0)
355 ns ± 1.69 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
For comparison, this is CuPy
In [6]: %timeit cp.cuda.runtime.setDevice(0)
154 ns ± 1.67 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
With cuda-bindings
12.9.0 (the re-implementation is replaced by the statically linked cudart)
In [4]: %timeit runtime.cudaSetDevice(0)
167 ns ± 0.367 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
and this is cuda-core
on the main branch + cuda-bindings
12.9.0
In [5]: %timeit dev.set_current()
1.84 μs ± 16 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
I think we should find a way to at least revive and reuse the old re-implementation.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status