-
Notifications
You must be signed in to change notification settings - Fork 300
Integrate decoupled lookahead warpspeed scan #6811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test 621720f |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 96a492f |
This comment has been minimized.
This comment has been minimized.
| // For 64-bit types, we still use __shfl_sync | ||
| [[nodiscard]] _CCCL_DEVICE_API inline int makeWarpUniform(int x) | ||
| { | ||
| NV_IF_ELSE_TARGET(NV_PROVIDES_SM_90, (return __reduce_min_sync(~0, x);), (return x;)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahendriksen should this fall back to __shfl_sync for non SM90 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should actually use WarpReduce here, because that has an optimization for that
|
/ok to test |
| .set_name("base") | ||
| .set_type_axes_names({"T{ct}", "OffsetT{ct}"}) | ||
| .add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4)); | ||
| //.add_int64_power_of_two_axis("Elements{io}", nvbench::range(16, 28, 4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: We need to make sure we can handle partial tiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the changes working locally. Will upstream soon.
cub/cub/device/dispatch/kernels/warpspeed/resource/SmemStage.cuh
Outdated
Show resolved
Hide resolved
| : SquadDesc(squadStatic) | ||
| , mSpecialRegisters(specialRegisters) | ||
| { | ||
| mIsWarpLeader = ::cuda::ptx::elect_sync(~0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this available in earlier architectures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do this using mIsWarpLeader = (threadIdx.x % 32) == 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or sr.laneIdx == 0
| squadDispatch(SpecialRegisters sr, const SquadDesc (&squads)[numSquads], F f, int warpIdxStart = 0) | ||
| { | ||
| static_assert(numSquads > 0); | ||
| if (numSquads == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be
| if (numSquads == 1) | |
| if constexpr (numSquads == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it can. Not sure if there is any benefit, but it is possible.
| } | ||
| if (sr.warpIdx < warpIdxStartMid) | ||
| { | ||
| if constexpr (0 < mid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it would be clearer to compare against 0
| if constexpr (0 < mid) | |
| if constexpr (mid != 0) |
| template <int numLookbackTiles, | ||
| int tile_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: Use CamelCase
| NV_IF_ELSE_TARGET( | ||
| NV_IS_HOST, | ||
| ({ | ||
| int curr_device{}; | ||
| if (const auto error = CubDebug(cudaGetDevice(&curr_device))) | ||
| { | ||
| return error; | ||
| } | ||
|
|
||
| int max_smem_size_optin{}; | ||
| if (const auto error = CubDebug( | ||
| cudaDeviceGetAttribute(&max_smem_size_optin, cudaDevAttrMaxSharedMemoryPerBlockOptin, curr_device))) | ||
| { | ||
| return error; | ||
| } | ||
|
|
||
| int reserved_smem_size{}; | ||
| if (const auto error = CubDebug( | ||
| cudaDeviceGetAttribute(&reserved_smem_size, cudaDevAttrReservedSharedMemoryPerBlock, curr_device))) | ||
| { | ||
| return error; | ||
| } | ||
| max_dynamic_smem_size = max_smem_size_optin - reserved_smem_size; | ||
| }), | ||
| ({ | ||
| cudaFuncAttributes func_attrs{}; | ||
| if (const auto error = CubDebug(cudaFuncGetAttributes(&func_attrs, func))) | ||
| { | ||
| return error; | ||
| } | ||
| max_dynamic_smem_size = func_attrs.maxDynamicSharedSizeBytes; | ||
| })) | ||
| return cudaSuccess; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I believe we should move this into a utility function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davebayer did this in #6818
| auto* d_in_unwrapped = THRUST_NS_QUALIFIER::unwrap_contiguous_iterator(d_in); | ||
| auto* d_out_unwrapped = THRUST_NS_QUALIFIER::unwrap_contiguous_iterator(d_out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self, no change requested, we should really move this to to_address
| REQUIRE(all_results_correct == true); | ||
|
|
||
| // Copy over the results and expected results to host and compare | ||
| #if false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Should this be enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a debug print utility in case of failing tests. I'm leaning towards dropping this.
This comment has been minimized.
This comment has been minimized.
|
/ok to test |
4 similar comments
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test |
| int warpIsPrivSum = 0; | ||
| NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsPrivSum = __reduce_or_sync(~0, laneIsPrivSum);)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahendriksen this is unused, did we accidentally drop something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some code is/was left behind to support decoupled lookback, which has cumSum states in tmp_states in addition to just privSum. See the commented out lines starting with // We are not storing CUM_SUM states, because it makes updating idxTileCur below.
Since we are fairly confident that we will only need the privSum states, we can drop warpIsCumSum and I think we can also drop warpIsPrivSum (as we are using warpIsEmpty below which gives all necessary information).
This comment has been minimized.
This comment has been minimized.
| // For 64-bit types, we still use __shfl_sync | ||
| [[nodiscard]] _CCCL_DEVICE_API inline int makeWarpUniform(int x) | ||
| { | ||
| NV_IF_ELSE_TARGET(NV_PROVIDES_SM_90, (return __reduce_min_sync(~0, x);), (return x;)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should actually use WarpReduce here, because that has an optimization for that
| int warpIsEmpty = 0; | ||
| NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsEmpty = __reduce_or_sync(~0, laneIsEmpty);)) | ||
| int warpIsCumSum = 0; | ||
| NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsCumSum = __reduce_or_sync(~0, laneIsCumSum);)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important: This is technically UB, because the bitwise reduce functions take an unsigned input
| int warpIsEmpty = 0; | |
| NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsEmpty = __reduce_or_sync(~0, laneIsEmpty);)) | |
| int warpIsCumSum = 0; | |
| NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsCumSum = __reduce_or_sync(~0, laneIsCumSum);)) | |
| unsigned warpIsEmpty = 0; | |
| NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsEmpty = __reduce_or_sync(~0, laneIsEmpty);)) | |
| unsigned warpIsCumSum = 0; | |
| NV_IF_TARGET(NV_PROVIDES_SM_80, (warpIsCumSum = __reduce_or_sync(~0, laneIsCumSum);)) |
| _CCCL_GLOBAL_CONSTANT SquadDesc squadReduce{/*squadIdx=*/0, /*numWarps=*/4}; | ||
| _CCCL_GLOBAL_CONSTANT SquadDesc squadScanStore{/*squadIdx=*/1, /*numWarps=*/4}; | ||
| _CCCL_GLOBAL_CONSTANT SquadDesc squadLoad{/*squadIdx=*/2, /*numWarps=*/1}; | ||
| _CCCL_GLOBAL_CONSTANT SquadDesc squadSched{/*squadIdx=*/3, /*numWarps=*/1}; | ||
| _CCCL_GLOBAL_CONSTANT SquadDesc squadLookback{/*squadIdx=*/4, /*numWarps=*/1}; | ||
|
|
||
| _CCCL_GLOBAL_CONSTANT SquadDesc scanSquads[] = { | ||
| squadReduce, | ||
| squadScanStore, | ||
| squadLoad, | ||
| squadSched, | ||
| squadLookback, | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should have a make_squads(int...) that returns effectively scanSquads
We then should be able to name the individual array members via a reference
| const uint32_t laneIdx; | ||
| }; | ||
|
|
||
| [[nodiscard]] _CCCL_DEVICE_API inline SpecialRegisters getSpecialRegisters() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I actually meant cudax, so that we can have something that can evolve
eecd5da to
7c44978
Compare
|
/ok to test 7c44978 |
b9d90f5 to
1e31dc0
Compare
This comment has been minimized.
This comment has been minimized.
c668411 to
a51eaed
Compare
|
/ok to test 2ec0602 |
😬 CI Workflow Results🟥 Finished in 2h 46m: Pass: 54%/267 | Total: 5d 07h | Max: 2h 21m | Hits: 73%/210590See results here. |
f45e745 to
08f0046
Compare
* MSVC does not like designated initializer * Only instantiate kernel for SM100 for now until we decide whether a non work-stealing implementation is worth it * Disable warpspeed in test * Make aligment test work * Fix use_warpspeed in test policy * Apply suggestions from code review * Fix formatting * Drop strange line * Fix nodiscard issue * Try to work around clang-cuda issue with __reduce_or_sync only being available with SM80 * Fix NV_IF_TARGET_mishap
fd32480 to
345f260
Compare
Check single stage SMEM consumption at compile-time See merge request CCCL/cccl-mirror!57
Use the input tile SMEM for staging the output See merge request CCCL/cccl-mirror!58
This was a typo by allard
Avoid reading garbage in first tile See merge request CCCL/cccl-mirror!61
WIP
cub.bench.scan.exclusive.sum.baseon B200:Fixes: #6644