@@ -35,9 +35,9 @@ Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's d
35
35
| ✔️| ✔️| ✔️| ✔️|
36
36
| Copy Async| Tile MMA (More Threads)| Tile Warp (More Values)| Multi Stages (2/3/4)|
37
37
| ✔️| ✔️| ✔️| ✔️|
38
- | Reg Double Buffers| Block Swizzle| Warp Swizzle| SMEM Swizzle (CuTe)|
38
+ | Reg Double Buffers| Block Swizzle| Warp Swizzle| SMEM Swizzle (CuTe/MMA )|
39
39
| ✔️| ✔️| ✔️| ✔️|
40
- | Collective Store (Warp Shfl)| Row Major (NN)| Col Major (TN)| SGEMM FP32/TF32|
40
+ | Collective Store (Shfl)| Row Major (NN)| Col Major (TN)| SGEMM FP32/TF32|
41
41
| ✔️| ✔️| ✔️| ✔️|
42
42
43
43
@@ -48,7 +48,7 @@ I have also implemented **FlashAttention-2** using pure MMA PTX instructions, wh
48
48
| Tensor Cores| Loop over Seqlen/Headdim | Tile Block (Br, Bc)| MMA (m16n8k16)|
49
49
| :---:| :---:| :---:| :---:|
50
50
| ✔️| ✔️| ✔️| ✔️|
51
- | Pack LDST (128 bits)| SMEM Padding| Copy Async| Tile MMA (More Threads)|
51
+ | Pack LDST (128 bits)| SMEM ** Swizzle ** / Padding | Copy Async| Tile MMA (More Threads)|
52
52
| ✔️| ✔️| ✔️| ✔️|
53
53
| Tile Warp (More Values)| Multi Stages (1/2)| Collective Store (Shfl)| ** Split KV/Q** |
54
54
| ✔️| ✔️| ✔️| ✔️|
@@ -160,7 +160,6 @@ The kernels listed here will guide you through a step-by-step progression, rangi
160
160
161
161
| 📖 CUDA Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |
162
162
| :---| :---| :---| :---| :---|
163
- | ✔️ [ nsys/ncu(timeline/ptx/sass)] ( ./kernels/nvidia-nsight/ ) | /| /| [ link] ( ./kernels/nvidia-nsight/ ) | ⭐️|
164
163
| ✔️ [ elementwise_f32] ( ./kernels/elementwise/elementwise.cu ) | f32| /| [ link] ( ./kernels/elementwise/ ) | ⭐️|
165
164
| ✔️ [ elementwise_f32x4] ( ./kernels/elementwise/elementwise.cu ) | f32| /| [ link] ( ./kernels/elementwise/ ) | ⭐️|
166
165
| ✔️ [ elementwise_f16] ( ./kernels/elementwise/elementwise.cu ) | f16| /| [ link] ( ./kernels/elementwise/ ) | ⭐️|
@@ -205,27 +204,27 @@ The kernels listed here will guide you through a step-by-step progression, rangi
205
204
| ✔️ [ mat_trans_f32_diagonal2d] ( ./kernels/mat-transpose/mat_transpose.cu ) | f32| /| [ link] ( ./kernels/mat-transpose/ ) | ⭐️⭐️|
206
205
| ✔️ [ mat_trans_f32x4_col2row{2d}] ( ./kernels/mat-transpose/mat_transpose.cu ) | f32| /| [ link] ( ./kernels/mat-transpose/ ) | ⭐️⭐️|
207
206
| ✔️ [ mat_trans_f32x4_row2col{2d}] ( ./kernels/mat-transpose/mat_transpose.cu ) | f32| /| [ link] ( ./kernels/mat-transpose/ ) | ⭐️⭐️|
208
- | ✔️ [ warp_reduce_ [ all] ]( ./kernels/reduce/block_all_reduce.cu ) | all| all| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
209
- | ✔️ [ reduce_f32_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | f32| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
210
- | ✔️ [ reduce_f32x4_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | f32| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
211
- | ✔️ [ reduce_f16_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | f16| f16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
212
- | ✔️ [ reduce_f16_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | f16| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
213
- | ✔️ [ reduce_f16x2_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | f16| f16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
214
- | ✔️ [ reduce_f16x2_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | f16| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
215
- | ✔️ [ reduce_f16x8_pack_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | f16| f16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
216
- | ✔️ [ reduce_f16x8_pack_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | f16| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
217
- | ✔️ [ reduce_bf16_bf16 ] ( ./kernels/reduce/block_all_reduce.cu ) | bf16| bf16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
218
- | ✔️ [ reduce_bf16_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | bf16| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
219
- | ✔️ [ reduce_bf16x2_bf16 ] ( ./kernels/reduce/block_all_reduce.cu ) | bf16| bf16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
220
- | ✔️ [ reduce_bf16x2_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | bf16| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
221
- | ✔️ [ reduce_bf16x8_pack_bf16 ] ( ./kernels/reduce/block_all_reduce.cu ) | bf16| bf16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
222
- | ✔️ [ reduce_bf16x8_pack_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | bf16| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
223
- | ✔️ [ reduce_fp8_e4m3_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | fp8_e4m3| f16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
224
- | ✔️ [ reduce_fp8_e5m2_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | fp8_e5m2| f16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
225
- | ✔️ [ reduce_fp8_e4m3x16_pack_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | fp8_e4m3| f16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
226
- | ✔️ [ reduce_fp8_e5m2x16_pack_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | fp8_e5m2| f16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
227
- | ✔️ [ reduce_i8_i32 ] ( ./kernels/reduce/block_all_reduce.cu ) | i8| i32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
228
- | ✔️ [ reduce_i8x16_pack_i32 ] ( ./kernels/reduce/block_all_reduce.cu ) | i8| i32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
207
+ | ✔️ [ warp_reduce_ { all} ] ( ./kernels/reduce/block_all_reduce.cu ) | all| all| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
208
+ | ✔️ [ block_all_reduce_f32_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | f32| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
209
+ | ✔️ [ block_all_reduce_f32x4_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | f32| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
210
+ | ✔️ [ block_all_reduce_f16_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | f16| f16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
211
+ | ✔️ [ block_all_reduce_f16_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | f16| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
212
+ | ✔️ [ block_all_reduce_f16x2_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | f16| f16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
213
+ | ✔️ [ block_all_reduce_f16x2_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | f16| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
214
+ | ✔️ [ block_all_reduce_f16x8_pack_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | f16| f16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
215
+ | ✔️ [ block_all_reduce_f16x8_pack_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | f16| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
216
+ | ✔️ [ block_all_reduce_bf16_bf16 ] ( ./kernels/reduce/block_all_reduce.cu ) | bf16| bf16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
217
+ | ✔️ [ block_all_reduce_bf16_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | bf16| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
218
+ | ✔️ [ block_all_reduce_bf16x2_bf16 ] ( ./kernels/reduce/block_all_reduce.cu ) | bf16| bf16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
219
+ | ✔️ [ block_all_reduce_bf16x2_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | bf16| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
220
+ | ✔️ [ block_all_reduce_bf16x8_pack_bf16 ] ( ./kernels/reduce/block_all_reduce.cu ) | bf16| bf16| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
221
+ | ✔️ [ block_all_reduce_bf16x8_pack_f32 ] ( ./kernels/reduce/block_all_reduce.cu ) | bf16| f32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
222
+ | ✔️ [ block_all_reduce_fp8_e4m3_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | fp8_e4m3| f16| [ link] ( ./kernels/reduce/ ) | ⭐️ ⭐️⭐️|
223
+ | ✔️ [ block_all_reduce_fp8_e5m2_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | fp8_e5m2| f16| [ link] ( ./kernels/reduce/ ) | ⭐️ ⭐️⭐️|
224
+ | ✔️ [ block_all_reduce_fp8_e4m3x16_pack_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | fp8_e4m3| f16| [ link] ( ./kernels/reduce/ ) | ⭐️ ⭐️⭐️|
225
+ | ✔️ [ block_all_reduce_fp8_e5m2x16_pack_f16 ] ( ./kernels/reduce/block_all_reduce.cu ) | fp8_e5m2| f16| [ link] ( ./kernels/reduce/ ) | ⭐️ ⭐️⭐️|
226
+ | ✔️ [ block_all_reduce_i8_i32 ] ( ./kernels/reduce/block_all_reduce.cu ) | i8| i32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
227
+ | ✔️ [ block_all_reduce_i8x16_pack_i32 ] ( ./kernels/reduce/block_all_reduce.cu ) | i8| i32| [ link] ( ./kernels/reduce/ ) | ⭐️⭐️|
229
228
| ✔️ [ dot_product_f32] ( ./kernels/dot-product/dot_product.cu ) | f32| f32| [ link] ( ./kernels/dot-product/ ) | ⭐️⭐️|
230
229
| ✔️ [ dot_product_f32x4] ( ./kernels/dot-product/dot_product.cu ) | f32| f32| [ link] ( ./kernels/dot-product/ ) | ⭐️⭐️|
231
230
| ✔️ [ dot_product_f16_f32] ( ./kernels/dot-product/dot_product.cu ) | f16| f32| [ link] ( ./kernels/dot-product/ ) | ⭐️⭐️|
@@ -262,7 +261,8 @@ The kernels listed here will guide you through a step-by-step progression, rangi
262
261
| ✔️ [ rms_norm_f16x8_pack_f32] ( ./kernels/rms-norm/rms_norm.cu ) | f16| f32| [ link] ( ./kernels/rms-norm/ ) | ⭐️⭐️|
263
262
| ✔️ [ rms_norm_f16_f32] ( ./kernels/rms-norm/rms_norm.cu ) | f16| f32| [ link] ( ./kernels/rms-norm/ ) | ⭐️⭐️|
264
263
| ✔️ [ nms_f32] ( ./kernels/nms/nms.cu ) | f32| /| [ link] ( ./kernels/nms ) | ⭐️⭐️|
265
- | ✔️ [ notes v1(deprecated)] ( ./kernels/notes-v1.cu ) | f32| f32| /| ⭐️|
264
+ | ✔️ [ notes v1(deprecated)] ( ./kernels/notes-v1.cu ) | f32| f32| /| ⭐️⭐️|
265
+ | ✔️ [ How to profile with nsys/ncu(timeline/ptx/sass)] ( ./kernels/nvidia-nsight/ ) | /| /| [ link] ( ./kernels/nvidia-nsight/ ) | ⭐️⭐️|
266
266
267
267
### 📚 Hard ⭐⭐⭐️ ([ ©️back👆🏻] ( #cuda-kernel ) )
268
268
@@ -284,7 +284,7 @@ The kernels listed here will guide you through a step-by-step progression, rangi
284
284
| ✔️ [ sgemm_t_8x8_sliced_k16...dbuf] ( ./kernels/sgemm/sgemm_async.cu ) | f32| f32| [ link] ( ./kernels/sgemm/ ) | ⭐️⭐️⭐️|
285
285
| ✔️ [ sgemm_t_8x8_sliced_k16...async] ( ./kernels/sgemm/sgemm_async.cu ) | f32| f32| [ link] ( ./kernels/sgemm/ ) | ⭐️⭐️⭐️|
286
286
| ✔️ [ sgemm_wmma_m16n16k8...stages* ] ( ./kernels/sgemm/sgemm_wmma_tf32_stage.cu ) | tf32| f32| [ link] ( ./kernels/sgemm/ ) | ⭐️⭐️⭐️|
287
- | ✔️ [ sgemm_wmma_m16n16k8...swizzle* ] ( ./kernels/sgemm/sgemm_wmma_tf32_stage.cu ) | tf32| f32| [ link] ( ./kernels/sgemm/ ) | ⭐️⭐️⭐️|
287
+ | ✔️ [ sgemm_wmma_m16n16k8...swizzle{+block} * ] ( ./kernels/sgemm/sgemm_wmma_tf32_stage.cu ) | tf32| f32| [ link] ( ./kernels/sgemm/ ) | ⭐️⭐️⭐️|
288
288
| ✔️ [ hgemm_naive_f16] ( ./kernels/hgemm/naive/hgemm.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️|
289
289
| ✔️ [ hgemm_sliced_k_f16] ( ./kernels/hgemm/naive/hgemm.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
290
290
| ✔️ [ hgemm_t_8x8_sliced_k_f16x4] ( ./kernels/hgemm/hgemm.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
@@ -299,12 +299,13 @@ The kernels listed here will guide you through a step-by-step progression, rangi
299
299
| ✔️ [ hgemm_wmma_m16n16k16...dbuf* ] ( ./kernels/hgemm/wmma/hgemm_wmma.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
300
300
| ✔️ [ hgemm_wmma_m32n8k16....dbuf* ] ( ./kernels/hgemm/wmma/hgemm_wmma.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
301
301
| ✔️ [ hgemm_wmma_m16n16k16...stages* ] ( ./kernels/hgemm/wmma/hgemm_wmma_stage.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
302
- | ✔️ [ hgemm_wmma_m16n16k16...swizzle* ] ( ./kernels/hgemm/wmma/hgemm_wmma_stage.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
302
+ | ✔️ [ hgemm_wmma_m16n16k16...swizzle{+block} * ] ( ./kernels/hgemm/wmma/hgemm_wmma_stage.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
303
303
| ✔️ [ hgemm_mma_m16n8k16...naive* ] ( ./kernels/hgemm/mma/hgemm_mma.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
304
304
| ✔️ [ hgemm_mma_m16n8k16...mma2x4* ] ( ./kernels/hgemm/mma/hgemm_mma.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
305
305
| ✔️ [ hgemm_mma_m16n8k16...stages* ] ( ./kernels/hgemm/mma/hgemm_mma_stage.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
306
- | ✔️ [ hgemm_mma_m16n8k16...swizzle* ] ( ./kernels/hgemm/mma/hgemm_mma_stage.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
307
- | ✔️ [ hgemm_mma_stages{swizzle}...cute* ] ( ./kernels/hgemm/cutlass/hgemm_mma_stage_tn_cute.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
306
+ | ✔️ [ hgemm_mma_m16n8k16...swizzle{+block}* ] ( ./kernels/hgemm/mma/hgemm_mma_stage.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
307
+ | ✔️ [ hgemm_mma_m16n8k16...swizzle{+smem}* ] ( ./kernels/hgemm/mma/hgemm_mma_stage_swizzle.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
308
+ | ✔️ [ hgemm_mma_stages_swizzle{+smem}...cute* ] ( ./kernels/hgemm/cutlass/hgemm_mma_stage_tn_cute.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️⭐️|
308
309
| ✔️ [ hgemm_mma_cublas* ] ( ./kernels/hgemm/cublas/hgemm_cublas.cu ) | f16| f16| [ link] ( ./kernels/hgemm/ ) | ⭐️⭐️|
309
310
310
311
### 📚 Hard+ ⭐️⭐️⭐️⭐️ & Hard++ ⭐️⭐️⭐️⭐️⭐️ ([ ©️back👆🏻] ( #cuda-kernel ) )
@@ -318,11 +319,14 @@ The kernels listed here will guide you through a step-by-step progression, rangi
318
319
| ✔️ [ flash_attn_mma_stages...shared_kv* ] ( ./kernels/flash-attn/mma/flash_attn_mma_share_kv.cu ) | f16| f16| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️⭐️|
319
320
| ✔️ [ flash_attn_mma_stages...shared_qkv* ] ( ./kernels/flash-attn/mma/flash_attn_mma_share_qkv.cu ) | f16| f16| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️⭐️|
320
321
| ✔️ [ flash_attn_mma_stages...tiling_qk* ] ( ./kernels/flash-attn/mma/flash_attn_mma_tiling_qk.cu ) | f16| f16| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️⭐️|
322
+ | ✔️ [ flash_attn_mma...tiling_qk_swizzle{+smem}* ] ( ./kernels/flash-attn/mma/flash_attn_mma_tiling_qk_swizzle.cu ) | f16| f16| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️⭐️|
321
323
| ? [ flash_attn_mma_stages_split_kv{f32}* ] ( ./kernels/flash-attn/mma/flash_attn_mma_split_kv_acc_f32.cu ) | f16| f32| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️|
322
324
| ? [ flash_attn_mma_stages_split_q{f32}* ] ( ./kernels/flash-attn/mma/flash_attn_mma_split_q_acc_f32.cu ) | f16| f32| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️|
323
325
| ? [ flash_attn_mma_stages...shared_kv{f32}* ] ( ./kernels/flash-attn/mma/flash_attn_mma_share_kv_acc_f32.cu ) | f16| f32| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️⭐️|
324
326
| ? [ flash_attn_mma_stages...shared_qkv{f32}* ] ( ./kernels/flash-attn/mma/flash_attn_mma_share_qkv_acc_f32.cu ) | f16| f32| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️⭐️|
325
327
| ? [ flash_attn_mma_stages...tiling_qk{f32}* ] ( ./kernels/flash-attn/mma/flash_attn_mma_tiling_qk_acc_f32.cu ) | f16| f32| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️⭐️|
328
+ | ✔️ [ How to implement MMA smem swizzle* ] ( ./kernels/swizzle/mma_simple_swizzle.cu ) | f16| f16| [ link] ( ./kernels/swizzle ) | ⭐️⭐️⭐️⭐️|
329
+
326
330
## 📖 博客目录
327
331
328
332
<div id =" my-blogs-part-1 " ></div >
0 commit comments