Add `__restrict` to some unified kernels. #1910

MarcelKoch · 2025-08-13T11:01:16Z

This PR adds the ability to annotate pointers used in the unified kernels with the __restrict keyword. To add the keyword, the object has to be wrapped by as_restrict, when passed to the kernel. (This does not apply to the solver kernels. I would need more work to fix those, which could be done at a later point if necessary.)
Currently this has been added only to the (add|sub)_scaled dense kernel. I briefly also added it to the (add|sub)_scaled_diag, but in a few cases the performance dropped significantly.

I ran some benchmarks on the coma-cluster using this input file:
in.json

These are the results, already translated into speedup/slowdown:
blas.json
The cuda and intel-cpu machine was gpu-nvidia-h100, the amd-cpu machine was rocinante.

Here are also only the cases where a slowdown occured.
blas.slowdown.json

As mentioned before, I removed the largest slowdown again. The other slowdowns are ~5% for cuda for the smallest sizes. I think it is reasonable to still continue, since the openmp speedup is quite significant, and small sizes are less relevant for cuda than they are for cpus.

MarcelKoch · 2025-08-13T11:02:53Z

common/unified/base/kernel_launch.hpp

+ * @tparam T  the underlying type being mapped. Any references or const
+ *            qualifiers have to be resolved before passing the type.
+ *            The distinction between const/mutable objects is done by
+ *            overloading the map_to_device function.
+ * @tparam PtrWrapper  the pointer type. By default, it's just `T*`, but it may
+ *                     be set to restricted_ptr.


I've changed a bit how to_device_type_impl<T> is implemented, since it was easier for me to reason about it after removing all cv/ref.
If wanted I can also revert this change.

yhmtsai · 2025-08-13T11:35:09Z

should we just use it to omp_restrict now if cuda almost have slowdown by adding that?

In benchmarks the (add|sub)_diag operations could experience significant slowdowns (up to 30% for single-threaded n=20k on intel), while the benefit in other cases was not as significant.

MarcelKoch requested a review from a team August 13, 2025 11:01

MarcelKoch self-assigned this Aug 13, 2025

MarcelKoch commented Aug 13, 2025

View reviewed changes

MarcelKoch added 7 commits August 13, 2025 13:39

[unified] use restrict for matrix_accessor

48bf140

[unified] add restrict to all pointers

77d0397

[unified] add restrict to arrays

71eccbb

[unified] add restrict only when requested

9ee2c34

[unified] add restrict to some dense kernels

5a4a5fd

[unified] refactor type mapping

87c3acb

[unified] revert partially adding restrict

356be23

In benchmarks the (add|sub)_diag operations could experience significant slowdowns (up to 30% for single-threaded n=20k on intel), while the benefit in other cases was not as significant.

MarcelKoch force-pushed the restrict branch from e73280f to 356be23 Compare August 13, 2025 11:40

MarcelKoch force-pushed the benchmark-add_sub_diag branch from 8205148 to ac150d4 Compare August 13, 2025 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `__restrict` to some unified kernels. #1910

Add `__restrict` to some unified kernels. #1910

Uh oh!

MarcelKoch commented Aug 13, 2025

Uh oh!

MarcelKoch Aug 13, 2025

Uh oh!

yhmtsai commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add __restrict to some unified kernels. #1910

Are you sure you want to change the base?

Add __restrict to some unified kernels. #1910

Uh oh!

Conversation

MarcelKoch commented Aug 13, 2025

Uh oh!

MarcelKoch Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

yhmtsai commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add `__restrict` to some unified kernels. #1910

Add `__restrict` to some unified kernels. #1910