Integrate decoupled lookahead warpspeed scan #6811

bernhardmgruber · 2025-11-28T14:27:10Z

WIP

cub.bench.scan.exclusive.sum.base on B200:

## [4] NVIDIA B200

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |      %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-------------|------------|----------|
|   I8    |      I32      |     72576      |   9.773 us |       9.58% |  10.913 us |       6.25% |    1.140 us |     11.67% |   SLOW   |
|   I8    |      I32      |    1056384     |  11.379 us |       3.45% |  13.224 us |       4.89% |    1.846 us |     16.22% |   SLOW   |
|   I8    |      I32      |    16781184    |  38.191 us |       2.09% |  31.356 us |       2.86% |   -6.836 us |    -17.90% |   FAST   |
|   I8    |      I32      |   268442496    | 474.799 us |       0.58% | 269.826 us |       0.32% | -204.973 us |    -43.17% |   FAST   |
|   I8    |      I32      |   1073745792   |   1.875 ms |       0.31% |   1.047 ms |       0.09% | -827.709 us |    -44.15% |   FAST   |
|   I8    |      I64      |     72576      |   9.244 us |       1.43% |  11.317 us |       0.60% |    2.073 us |     22.43% |   SLOW   |
|   I8    |      I64      |    1056384     |  11.371 us |       5.03% |  13.240 us |       3.75% |    1.869 us |     16.44% |   SLOW   |
|   I8    |      I64      |    16781184    |  37.873 us |       3.25% |  31.085 us |       3.35% |   -6.789 us |    -17.92% |   FAST   |
|   I8    |      I64      |   268442496    | 467.858 us |       0.98% | 269.914 us |       0.35% | -197.944 us |    -42.31% |   FAST   |
|   I8    |      I64      |   1073745792   |   1.844 ms |       0.51% |   1.047 ms |       0.10% | -796.871 us |    -43.22% |   FAST   |
|   I8    |      I64      |   4294975104   |   9.164 us |       3.36% |   4.157 ms |       0.03% |    4.148 ms |  45260.45% |   SLOW   |
|   I16   |      I32      |     72576      |   9.494 us |       6.54% |  10.592 us |       9.04% |    1.098 us |     11.57% |   SLOW   |
|   I16   |      I32      |    1056384     |  11.342 us |       2.55% |  14.790 us |       6.61% |    3.447 us |     30.39% |   SLOW   |
|   I16   |      I32      |    16781184    |  38.848 us |       3.12% |  35.034 us |       4.87% |   -3.814 us |     -9.82% |   FAST   |
|   I16   |      I32      |   268442496    | 473.811 us |       1.07% | 297.185 us |       0.61% | -176.627 us |    -37.28% |   FAST   |
|   I16   |      I32      |   1073745792   |   1.868 ms |       0.51% |   1.134 ms |       0.14% | -733.934 us |    -39.28% |   FAST   |
|   I16   |      I64      |     72576      |   9.236 us |       2.18% |  10.746 us |       8.32% |    1.510 us |     16.35% |   SLOW   |
|   I16   |      I64      |    1056384     |  15.152 us |       5.82% |  15.536 us |       6.00% |    0.384 us |      2.54% |   SAME   |
|   I16   |      I64      |    16781184    |  41.622 us |       3.14% |  42.883 us |       2.92% |    1.262 us |      3.03% |   SLOW   |
|   I16   |      I64      |   268442496    | 492.626 us |       0.80% | 495.246 us |       0.78% |    2.619 us |      0.53% |   SAME   |
|   I16   |      I64      |   1073745792   |   1.936 ms |       0.37% |   1.935 ms |       0.37% |   -0.888 us |     -0.05% |   SAME   |
|   I16   |      I64      |   4294975104   |   9.170 us |       1.16% |   7.716 ms |       0.20% |    7.707 ms |  84049.47% |   SLOW   |
|   I32   |      I32      |     72576      |  10.913 us |       7.31% |  11.229 us |       2.33% |    0.316 us |      2.90% |   SLOW   |
|   I32   |      I32      |    1056384     |  11.954 us |       8.11% |  14.491 us |       7.97% |    2.537 us |     21.22% |   SLOW   |
|   I32   |      I32      |    16781184    |  43.783 us |       4.42% |  36.619 us |       3.40% |   -7.164 us |    -16.36% |   FAST   |
|   I32   |      I32      |   268442496    | 545.111 us |       1.59% | 318.979 us |       0.53% | -226.133 us |    -41.48% |   FAST   |
|   I32   |      I32      |   1073745792   |   2.139 ms |       0.77% |   1.228 ms |       0.41% | -911.623 us |    -42.61% |   FAST   |
|   I32   |      I64      |     72576      |   9.612 us |       7.77% |  11.204 us |       6.07% |    1.592 us |     16.56% |   SLOW   |
|   I32   |      I64      |    1056384     |  12.740 us |       7.42% |  14.981 us |       6.69% |    2.241 us |     17.59% |   SLOW   |
|   I32   |      I64      |    16781184    |  43.804 us |       4.05% |  36.602 us |       3.58% |   -7.202 us |    -16.44% |   FAST   |
|   I32   |      I64      |   268442496    | 545.338 us |       1.55% | 319.094 us |       0.49% | -226.244 us |    -41.49% |   FAST   |
|   I32   |      I64      |   1073745792   |   2.147 ms |       0.81% |   1.228 ms |       0.42% | -918.556 us |    -42.79% |   FAST   |
|   I32   |      I64      |   4294975104   |   9.160 us |       1.02% |   4.918 ms |       0.67% |    4.909 ms |  53593.72% |   SLOW   |
|   I64   |      I32      |     72576      |  11.297 us |       1.06% |  12.638 us |       7.65% |    1.341 us |     11.87% |   SLOW   |
|   I64   |      I32      |    1056384     |  13.370 us |       2.10% |  15.335 us |       3.58% |    1.966 us |     14.70% |   SLOW   |
|   I64   |      I32      |    16781184    |  68.362 us |       2.36% |  64.131 us |       7.05% |   -4.232 us |     -6.19% |   FAST   |
|   I64   |      I32      |   268442496    | 902.997 us |       0.57% | 800.659 us |       2.13% | -102.338 us |    -11.33% |   FAST   |
|   I64   |      I32      |   1073745792   |   3.581 ms |       0.31% |   3.149 ms |       0.99% | -432.325 us |    -12.07% |   FAST   |
|   I64   |      I64      |     72576      |  10.059 us |       9.98% |  11.392 us |       4.45% |    1.333 us |     13.25% |   SLOW   |
|   I64   |      I64      |    1056384     |  13.472 us |       5.16% |  15.292 us |       3.70% |    1.821 us |     13.51% |   SLOW   |
|   I64   |      I64      |    16781184    |  67.528 us |       2.69% |  63.810 us |       6.10% |   -3.718 us |     -5.51% |   FAST   |
|   I64   |      I64      |   268442496    | 910.227 us |       0.66% | 801.402 us |       2.09% | -108.824 us |    -11.96% |   FAST   |
|   I64   |      I64      |   1073745792   |   3.608 ms |       0.34% |   3.149 ms |       1.05% | -458.895 us |    -12.72% |   FAST   |
|   I64   |      I64      |   4294975104   |   9.207 us |       0.87% |  12.554 ms |       0.51% |   12.545 ms | 136260.72% |   SLOW   |
|  I128   |      I32      |     72576      |  13.342 us |       1.81% |  13.665 us |       5.61% |    0.323 us |      2.42% |   SLOW   |
|  I128   |      I32      |    1056384     |  25.970 us |       3.47% |  28.582 us |       3.80% |    2.612 us |     10.06% |   SLOW   |
|  I128   |      I32      |    16781184    | 213.839 us |       0.62% | 213.656 us |       0.63% |   -0.184 us |     -0.09% |   SAME   |
|  I128   |      I32      |   268442496    |   3.189 ms |       0.14% |   3.193 ms |       0.14% |    3.284 us |      0.10% |   SAME   |
|  I128   |      I32      |   1073745792   |  12.726 ms |       0.07% |  12.726 ms |       0.07% |   -0.059 us |     -0.00% |   SAME   |
|  I128   |      I64      |     72576      |  13.537 us |       4.44% |  14.471 us |       7.02% |    0.934 us |      6.90% |   SLOW   |
|  I128   |      I64      |    1056384     |  25.862 us |       2.87% |  29.387 us |       3.68% |    3.525 us |     13.63% |   SLOW   |
|  I128   |      I64      |    16781184    | 215.459 us |       0.59% | 215.206 us |       0.63% |   -0.253 us |     -0.12% |   SAME   |
|  I128   |      I64      |   268442496    |   3.215 ms |       0.14% |   3.218 ms |       0.14% |    2.091 us |      0.07% |   SAME   |
|  I128   |      I64      |   1073745792   |  12.824 ms |       0.07% |  12.826 ms |       0.07% |    2.265 us |      0.02% |   SAME   |
|  I128   |      I64      |   4294975104   |  11.380 us |       5.31% |  51.255 ms |       0.04% |   51.244 ms | 450311.83% |   SLOW   |
|   F32   |      I32      |     72576      |  10.230 us |       9.90% |  11.530 us |       5.89% |    1.300 us |     12.71% |   SLOW   |
|   F32   |      I32      |    1056384     |  11.834 us |       7.43% |  14.250 us |       7.23% |    2.416 us |     20.41% |   SLOW   |
|   F32   |      I32      |    16781184    |  42.262 us |       3.87% |  40.673 us |       3.47% |   -1.589 us |     -3.76% |   FAST   |
|   F32   |      I32      |   268442496    | 545.043 us |       1.51% | 349.034 us |       0.44% | -196.009 us |    -35.96% |   FAST   |
|   F32   |      I32      |   1073745792   |   2.154 ms |       0.72% |   1.335 ms |       0.13% | -818.394 us |    -38.00% |   FAST   |
|   F32   |      I64      |     72576      |  10.116 us |       9.78% |  11.324 us |       0.55% |    1.208 us |     11.94% |   SLOW   |
|   F32   |      I64      |    1056384     |  11.802 us |       8.00% |  14.787 us |       7.01% |    2.985 us |     25.29% |   SLOW   |
|   F32   |      I64      |    16781184    |  43.856 us |       4.39% |  40.624 us |       3.19% |   -3.232 us |     -7.37% |   FAST   |
|   F32   |      I64      |   268442496    | 545.610 us |       1.54% | 348.663 us |       0.47% | -196.947 us |    -36.10% |   FAST   |
|   F32   |      I64      |   1073745792   |   2.155 ms |       0.77% |   1.335 ms |       0.15% | -819.335 us |    -38.03% |   FAST   |
|   F32   |      I64      |   4294975104   |   9.166 us |       1.42% |   5.285 ms |       0.05% |    5.276 ms |  57556.92% |   SLOW   |
|   F64   |      I32      |     72576      |  11.296 us |       0.64% |  12.127 us |       8.36% |    0.831 us |      7.35% |   SLOW   |
|   F64   |      I32      |    1056384     |  14.302 us |       7.16% |  15.707 us |       5.40% |    1.405 us |      9.82% |   SLOW   |
|   F64   |      I32      |    16781184    |  68.580 us |       2.06% |  66.780 us |       7.10% |   -1.800 us |     -2.62% |   FAST   |
|   F64   |      I32      |   268442496    | 919.827 us |       0.67% | 886.553 us |       2.98% |  -33.274 us |     -3.62% |   FAST   |
|   F64   |      I32      |   1073745792   |   3.633 ms |       0.32% |   3.526 ms |       1.42% | -106.914 us |     -2.94% |   FAST   |
|   F64   |      I64      |     72576      |  10.522 us |       9.59% |  10.818 us |       7.73% |    0.296 us |      2.81% |   SAME   |
|   F64   |      I64      |    1056384     |  15.418 us |       2.59% |  16.327 us |       6.31% |    0.909 us |      5.89% |   SLOW   |
|   F64   |      I64      |    16781184    |  71.646 us |       1.68% |  71.886 us |       1.75% |    0.241 us |      0.34% |   SAME   |
|   F64   |      I64      |   268442496    | 964.252 us |       0.31% | 964.371 us |       0.33% |    0.118 us |      0.01% |   SAME   |
|   F64   |      I64      |   1073745792   |   3.822 ms |       0.16% |   3.822 ms |       0.16% |   -0.300 us |     -0.01% |   SAME   |
|   F64   |      I64      |   4294975104   |   9.161 us |       1.63% |  15.256 ms |       0.08% |   15.247 ms | 166425.68% |   SLOW   |
|   C32   |      I32      |     72576      |  13.492 us |       4.55% |  13.981 us |       7.88% |    0.490 us |      3.63% |   SAME   |
|   C32   |      I32      |    1056384     |  25.215 us |       3.81% |  25.473 us |       3.04% |    0.258 us |      1.02% |   SAME   |
|   C32   |      I32      |    16781184    | 203.001 us |       1.15% | 203.035 us |       1.04% |    0.034 us |      0.02% |   SAME   |
|   C32   |      I32      |   268442496    |   3.086 ms |       0.26% |   3.068 ms |       0.26% |  -18.705 us |     -0.61% |   FAST   |
|   C32   |      I32      |   1073745792   |  12.282 ms |       0.13% |  12.268 ms |       0.12% |  -14.091 us |     -0.11% |   SAME   |
|   C32   |      I64      |     72576      |  14.572 us |       6.96% |  15.502 us |       4.26% |    0.930 us |      6.38% |   SLOW   |
|   C32   |      I64      |    1056384     |  25.819 us |       3.19% |  25.333 us |       4.06% |   -0.486 us |     -1.88% |   SAME   |
|   C32   |      I64      |    16781184    | 203.783 us |       1.15% | 204.356 us |       1.13% |    0.574 us |      0.28% |   SAME   |
|   C32   |      I64      |   268442496    |   3.088 ms |       0.27% |   3.096 ms |       0.26% |    8.165 us |      0.26% |   SLOW   |
|   C32   |      I64      |   1073745792   |  12.323 ms |       0.14% |  12.323 ms |       0.13% |    0.379 us |      0.00% |   SAME   |
|   C32   |      I64      |   4294975104   |   9.225 us |       2.16% |  49.232 ms |       0.06% |   49.222 ms | 533570.16% |   SLOW   |

Fixes: #6644

copy-pr-bot · 2025-11-28T14:27:13Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

bernhardmgruber · 2025-11-28T14:43:47Z

/ok to test 621720f

bernhardmgruber · 2025-11-28T15:40:23Z

/ok to test 96a492f

miscco · 2025-11-28T19:28:57Z

cub/cub/device/dispatch/kernels/warpspeed/makeWarpUniform.cuh

+// For 64-bit types, we still use __shfl_sync
+[[nodiscard]] _CCCL_DEVICE_API inline int makeWarpUniform(int x)
+{
+  NV_IF_ELSE_TARGET(NV_PROVIDES_SM_90, (return __reduce_min_sync(~0, x);), (return x;));


@ahendriksen should this fall back to __shfl_sync for non SM90 ?

Yes, that would work.

I believe we should actually use WarpReduce here, because that has an optimization for that

miscco · 2025-11-28T19:33:03Z

/ok to test

miscco · 2025-11-28T19:34:52Z

cub/benchmarks/bench/scan/exclusive/base.cuh

  .set_name("base")
  .set_type_axes_names({"T{ct}", "OffsetT{ct}"})
-  .add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4));
+  //.add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4))


Critical: We need to make sure we can handle partial tiles

I have the changes working locally. Will upstream soon.

cub/cub/device/dispatch/kernels/warpspeed/resource/SmemStage.cuh

miscco · 2025-11-28T19:37:11Z

cub/cub/device/dispatch/kernels/warpspeed/squad/Squad.h

+      : SquadDesc(squadStatic)
+      , mSpecialRegisters(specialRegisters)
+  {
+    mIsWarpLeader = ::cuda::ptx::elect_sync(~0);


We should make this available in earlier architectures

Yes, we can do this using mIsWarpLeader = (threadIdx.x % 32) == 0;

or sr.laneIdx == 0

miscco · 2025-11-28T19:38:18Z

cub/cub/device/dispatch/kernels/warpspeed/squad/Squad.h

+squadDispatch(SpecialRegisters sr, const SquadDesc (&squads)[numSquads], F f, int warpIdxStart = 0)
+{
+  static_assert(numSquads > 0);
+  if (numSquads == 1)


Can this be

Suggested change

if (numSquads == 1)

if constexpr (numSquads == 1)

Yes it can. Not sure if there is any benefit, but it is possible.

miscco · 2025-11-28T19:39:03Z

cub/cub/device/dispatch/kernels/warpspeed/squad/Squad.h

+    }
+    if (sr.warpIdx < warpIdxStartMid)
+    {
+      if constexpr (0 < mid)


I believe it would be clearer to compare against 0

Suggested change

if constexpr (0 < mid)

if constexpr (mid != 0)

miscco · 2025-11-28T19:49:21Z

cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

+template <int numLookbackTiles,
+          int tile_size,


Style: Use CamelCase

miscco · 2025-11-28T19:50:01Z

cub/cub/device/dispatch/dispatch_scan.cuh

+    NV_IF_ELSE_TARGET(
+      NV_IS_HOST,
+      ({
+        int curr_device{};
+        if (const auto error = CubDebug(cudaGetDevice(&curr_device)))
+        {
+          return error;
+        }
+
+        int max_smem_size_optin{};
+        if (const auto error = CubDebug(
+              cudaDeviceGetAttribute(&max_smem_size_optin, cudaDevAttrMaxSharedMemoryPerBlockOptin, curr_device)))
+        {
+          return error;
+        }
+
+        int reserved_smem_size{};
+        if (const auto error = CubDebug(
+              cudaDeviceGetAttribute(&reserved_smem_size, cudaDevAttrReservedSharedMemoryPerBlock, curr_device)))
+        {
+          return error;
+        }
+        max_dynamic_smem_size = max_smem_size_optin - reserved_smem_size;
+      }),
+      ({
+        cudaFuncAttributes func_attrs{};
+        if (const auto error = CubDebug(cudaFuncGetAttributes(&func_attrs, func)))
+        {
+          return error;
+        }
+        max_dynamic_smem_size = func_attrs.maxDynamicSharedSizeBytes;
+      }))
+    return cudaSuccess;
+  }


Nitpick: I believe we should move this into a utility function

@davebayer did this in #6818

miscco · 2025-11-28T19:50:29Z

cub/cub/device/dispatch/dispatch_scan.cuh

+    auto* d_in_unwrapped  = THRUST_NS_QUALIFIER::unwrap_contiguous_iterator(d_in);
+    auto* d_out_unwrapped = THRUST_NS_QUALIFIER::unwrap_contiguous_iterator(d_out);


Note to self, no change requested, we should really move this to to_address

cub/cub/device/dispatch/dispatch_scan.cuh

miscco · 2025-11-28T19:51:30Z

cub/test/catch2_test_device_scan_large_offsets.cu

  REQUIRE(all_results_correct == true);
+
+// Copy over the results and expected results to host and compare
+#if false


Question: Should this be enabled

It's just a debug print utility in case of failing tests. I'm leaning towards dropping this.

miscco · 2025-11-29T09:34:36Z

/ok to test

miscco · 2025-11-29T09:38:06Z

/ok to test

miscco · 2025-11-29T09:39:13Z

/ok to test

miscco · 2025-11-29T10:06:02Z

/ok to test

miscco · 2025-11-29T10:36:49Z

/ok to test

miscco · 2025-11-29T11:34:58Z

cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

+      int warpIsPrivSum = 0;
+      NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsPrivSum = __reduce_or_sync(~0, laneIsPrivSum);))


@ahendriksen this is unused, did we accidentally drop something?

Some code is/was left behind to support decoupled lookback, which has cumSum states in tmp_states in addition to just privSum. See the commented out lines starting with // We are not storing CUM_SUM states, because it makes updating idxTileCur below.

Since we are fairly confident that we will only need the privSum states, we can drop warpIsCumSum and I think we can also drop warpIsPrivSum (as we are using warpIsEmpty below which gives all necessary information).

miscco · 2025-11-30T09:37:26Z

cub/cub/device/dispatch/kernels/warpspeed/makeWarpUniform.cuh

+// For 64-bit types, we still use __shfl_sync
+[[nodiscard]] _CCCL_DEVICE_API inline int makeWarpUniform(int x)
+{
+  NV_IF_ELSE_TARGET(NV_PROVIDES_SM_90, (return __reduce_min_sync(~0, x);), (return x;));


I believe we should actually use WarpReduce here, because that has an optimization for that

miscco · 2025-11-30T09:38:34Z

cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

+      int warpIsEmpty = 0;
+      NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsEmpty = __reduce_or_sync(~0, laneIsEmpty);))
+      int warpIsCumSum = 0;
+      NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsCumSum = __reduce_or_sync(~0, laneIsCumSum);))


Important: This is technically UB, because the bitwise reduce functions take an unsigned input

Suggested change

int warpIsEmpty = 0;

NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsEmpty = __reduce_or_sync(~0, laneIsEmpty);))

int warpIsCumSum = 0;

NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsCumSum = __reduce_or_sync(~0, laneIsCumSum);))

unsigned warpIsEmpty = 0;

NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsEmpty = __reduce_or_sync(~0, laneIsEmpty);))

unsigned warpIsCumSum = 0;

NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsCumSum = __reduce_or_sync(~0, laneIsCumSum);))

miscco · 2025-11-30T09:40:36Z

cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

+_CCCL_GLOBAL_CONSTANT SquadDesc squadReduce{/*squadIdx=*/0, /*numWarps=*/4};
+_CCCL_GLOBAL_CONSTANT SquadDesc squadScanStore{/*squadIdx=*/1, /*numWarps=*/4};
+_CCCL_GLOBAL_CONSTANT SquadDesc squadLoad{/*squadIdx=*/2, /*numWarps=*/1};
+_CCCL_GLOBAL_CONSTANT SquadDesc squadSched{/*squadIdx=*/3, /*numWarps=*/1};
+_CCCL_GLOBAL_CONSTANT SquadDesc squadLookback{/*squadIdx=*/4, /*numWarps=*/1};
+
+_CCCL_GLOBAL_CONSTANT SquadDesc scanSquads[] = {
+  squadReduce,
+  squadScanStore,
+  squadLoad,
+  squadSched,
+  squadLookback,
+};


I believe we should have a make_squads(int...) that returns effectively scanSquads

We then should be able to name the individual array members via a reference

miscco · 2025-11-30T09:41:41Z

cub/cub/device/dispatch/kernels/warpspeed/SpecialRegisters.cuh

+  const uint32_t laneIdx;
+};
+
+[[nodiscard]] _CCCL_DEVICE_API inline SpecialRegisters getSpecialRegisters()


Sorry, I actually meant cudax, so that we can have something that can evolve

bernhardmgruber · 2025-12-01T09:32:48Z

/ok to test 7c44978

copy-pr-bot · 2025-12-01T10:05:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

bernhardmgruber · 2025-12-02T00:09:02Z

/ok to test 2ec0602

github-actions · 2025-12-02T02:58:03Z

😬 CI Workflow Results

🟥 Finished in 2h 46m: Pass: 54%/267 | Total: 5d 07h | Max: 2h 21m | Hits: 73%/210590

See results here.

* MSVC does not like designated initializer * Only instantiate kernel for SM100 for now until we decide whether a non work-stealing implementation is worth it * Disable warpspeed in test * Make aligment test work * Fix use_warpspeed in test policy * Apply suggestions from code review * Fix formatting * Drop strange line * Fix nodiscard issue * Try to work around clang-cuda issue with __reduce_or_sync only being available with SM80 * Fix NV_IF_TARGET_mishap

Fixes: NVIDIA#6657

Check single stage SMEM consumption at compile-time See merge request CCCL/cccl-mirror!57

Use the input tile SMEM for staging the output See merge request CCCL/cccl-mirror!58

This was a typo by allard

Avoid reading garbage in first tile See merge request CCCL/cccl-mirror!61

github-project-automation bot added this to CCCL Nov 28, 2025

github-project-automation bot moved this to Todo in CCCL Nov 28, 2025

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Nov 28, 2025

This comment has been minimized.

Sign in to view

miscco reviewed Nov 28, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

miscco reviewed Nov 29, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

miscco reviewed Nov 30, 2025

View reviewed changes

bernhardmgruber force-pushed the warpspeed branch from eecd5da to 7c44978 Compare December 1, 2025 09:07

bernhardmgruber force-pushed the warpspeed branch from b9d90f5 to 1e31dc0 Compare December 1, 2025 13:58

This comment has been minimized.

Sign in to view

bernhardmgruber force-pushed the warpspeed branch from c668411 to a51eaed Compare December 1, 2025 22:21

bernhardmgruber force-pushed the warpspeed branch from f45e745 to 08f0046 Compare December 2, 2025 13:10

Integrate warpspeed scan

dac26a8

bernhardmgruber and others added 19 commits December 3, 2025 10:27

Avoid SmemResource converting into each other for different T

7371756

fix for non-contiguous iterators

e30f617

Reset presets

7250eec

FOrmat

e49672b

Guard against old PTX ISAs

bc9ee2b

Make makeWarpUniform compile on older SMs

0cd6915

Move some hard coded parameters into tuning policies

b26cb83

Fixes: NVIDIA#6657

Ensure we compile on older architectures

40f976a

Fix conversion of InitValue

34d9a6d

Wrap warpspeed headers in detail namespace

18509ac

Manual tuning of tile size

091044c

Add support for unaligned output and any problem size

e465ea9

Avoid assertion

006ef11

Use MaxPotentialDynamicSmemBytes

d3a0472

Simplify warpspeed values

6c09905

Prepare a tuning policy for warpspeed and plumb it through

44bfe2d

Fix: Prepare a tuning policy for warpspeed and plumb it through

6ff9fb3

Move squads definition into the warpspeed policy

345f260

bernhardmgruber force-pushed the warpspeed branch from fd32480 to 345f260 Compare December 3, 2025 09:30

bernhardmgruber changed the title ~~Integrate warpspeed scan~~ Integrate decoupled lookahead warpspeed scan Dec 3, 2025

bernhardmgruber and others added 9 commits December 4, 2025 00:19

Pretty printing in tests

fe6bd70

Check single stage SMEM consumption at compile-time

6cb857e

Merge branch 'warpspeed_constexpr_smem' into 'warpspeed'

9d218d9

Check single stage SMEM consumption at compile-time See merge request CCCL/cccl-mirror!57

Use the input tile SMEM for staging the output

a094967

Merge branch 'warpspeed_chunked_output' into 'warpspeed'

00e6766

Use the input tile SMEM for staging the output See merge request CCCL/cccl-mirror!58

Do not store lookback tile with all threads

e509cd7

This was a typo by allard

Do not use 64bit load / stores when using larger types

1503b5c

Avoid reading garbage in first tile

b81cdfa

Merge branch 'avoid_uninit_data_first_tile' into 'warpspeed'

a966c0f

Avoid reading garbage in first tile See merge request CCCL/cccl-mirror!61

		auto* d_in_unwrapped = THRUST_NS_QUALIFIER::unwrap_contiguous_iterator(d_in);
		auto* d_out_unwrapped = THRUST_NS_QUALIFIER::unwrap_contiguous_iterator(d_out);

		int warpIsPrivSum = 0;
		NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsPrivSum = __reduce_or_sync(~0, laneIsPrivSum);))

Integrate decoupled lookahead warpspeed scan #6811

Are you sure you want to change the base?

Integrate decoupled lookahead warpspeed scan #6811

Conversation

bernhardmgruber commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Nov 28, 2025

Uh oh!

bernhardmgruber commented Nov 28, 2025

Uh oh!

This comment has been minimized.

bernhardmgruber commented Nov 28, 2025

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miscco commented Nov 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

miscco commented Nov 29, 2025

Uh oh!

miscco commented Nov 29, 2025

Uh oh!

miscco commented Nov 29, 2025

Uh oh!

miscco commented Nov 29, 2025

Uh oh!

miscco commented Nov 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber commented Dec 1, 2025

Uh oh!

copy-pr-bot bot commented Dec 1, 2025

bernhardmgruber commented Nov 28, 2025 •

edited

Loading