Skip to content

Conversation

@MarcelKoch
Copy link
Member

This PR adds the ability to annotate pointers used in the unified kernels with the __restrict keyword. To add the keyword, the object has to be wrapped by as_restrict, when passed to the kernel. (This does not apply to the solver kernels. I would need more work to fix those, which could be done at a later point if necessary.)
Currently this has been added only to the (add|sub)_scaled dense kernel. I briefly also added it to the (add|sub)_scaled_diag, but in a few cases the performance dropped significantly.

I ran some benchmarks on the coma-cluster using this input file:
in.json

These are the results, already translated into speedup/slowdown:
blas.json
The cuda and intel-cpu machine was gpu-nvidia-h100, the amd-cpu machine was rocinante.

Here are also only the cases where a slowdown occured.
blas.slowdown.json

As mentioned before, I removed the largest slowdown again. The other slowdowns are ~5% for cuda for the smallest sizes. I think it is reasonable to still continue, since the openmp speedup is quite significant, and small sizes are less relevant for cuda than they are for cpus.

@MarcelKoch MarcelKoch requested a review from a team August 13, 2025 11:01
@MarcelKoch MarcelKoch self-assigned this Aug 13, 2025
@ginkgo-bot ginkgo-bot added mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:openmp This is related to the OpenMP module. type:matrix-format This is related to the Matrix formats mod:hip This is related to the HIP module. mod:dpcpp This is related to the DPC++ module. labels Aug 13, 2025
Comment on lines +244 to +249
* @tparam T the underlying type being mapped. Any references or const
* qualifiers have to be resolved before passing the type.
* The distinction between const/mutable objects is done by
* overloading the map_to_device function.
* @tparam PtrWrapper the pointer type. By default, it's just `T*`, but it may
* be set to restricted_ptr.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed a bit how to_device_type_impl<T> is implemented, since it was easier for me to reason about it after removing all cv/ref.
If wanted I can also revert this change.

@yhmtsai
Copy link
Member

yhmtsai commented Aug 13, 2025

should we just use it to omp_restrict now if cuda almost have slowdown by adding that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:dpcpp This is related to the DPC++ module. mod:hip This is related to the HIP module. mod:openmp This is related to the OpenMP module. type:matrix-format This is related to the Matrix formats

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants