Benchmarks: cuda.core#2005
Conversation
|
Do you have a side-by-side bindings-vs-core delta table that you could post here? Quick "Low" findings from Cursor GPT-5.4 Extra High Fast
|
|
Here is a table: On this case the |
|
Updated the table with my last numbers, sorry about that. The question here is if are ok with cuda.core being faster because of object construction and i think because of like caching some of the device creation. In general I would say yes, but I understand that one could argue the comparison is not fair. |
Description
This is for matching benchmarks we have been doing for
cuda.bindingstocuda.core.I guess its up for discussion if we need these and what we want to compare them against.
Right now its basically trying to measure extra latency of the
cuda.corelayer by comparing the to cuda.bindings ones and matching benchmark IDs to that suite 1:1.The main question I think is regarding the "caching" that we get from cuda.core on
Device.Deviceinstances are singletons so after a first callDevice(0)doesnt hit the driver. And probably other similar cases.I guess we could also introduce some sort of cleanups or process spawns but that would come with other latencies.