@@ -262,7 +262,7 @@ The kernels listed here will guide you through a step-by-step progression, rangi
262
262
| ✔️ [ rms_norm_f16_f32] ( ./kernels/rms-norm/rms_norm.cu ) | f16| f32| [ link] ( ./kernels/rms-norm/ ) | ⭐️⭐️|
263
263
| ✔️ [ nms_f32] ( ./kernels/nms/nms.cu ) | f32| /| [ link] ( ./kernels/nms ) | ⭐️⭐️|
264
264
| ✔️ [ notes v1(deprecated)] ( ./kernels/notes-v1.cu ) | f32| f32| /| ⭐️⭐️|
265
- | ✔️ [ How to profile with nsys/ncu(timeline/ptx/sass)] ( ./kernels/nvidia-nsight/ ) | /| /| [ link] ( ./kernels/nvidia-nsight/ ) | ⭐️⭐️|
265
+ | ✔️ [ How to use nsys/ncu(timeline/ptx/sass)] ( ./kernels/nvidia-nsight/ ) | /| /| [ link] ( ./kernels/nvidia-nsight/ ) | ⭐️⭐️|
266
266
267
267
### 📚 Hard ⭐⭐⭐️ ([ ©️back👆🏻] ( #cuda-kernel ) )
268
268
@@ -327,12 +327,12 @@ The kernels listed here will guide you through a step-by-step progression, rangi
327
327
| ✔️ [ flash_attn_mma...shared_qkv_swizzle{qkv}* ] ( ./kernels/flash-attn/mma/swizzle/flash_attn_mma_share_qkv_swizzle_qkv.cu ) | f16| f16| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️|
328
328
| ✔️ [ flash_attn_mma...tiling_qk_swizzle{q}* ] ( ./kernels/flash-attn/mma/swizzle/flash_attn_mma_tiling_qk_swizzle_q.cu ) | f16| f16| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️|
329
329
| ✔️ [ flash_attn_mma...tiling_qk_swizzle{qk}* ] ( ./kernels/flash-attn/mma/swizzle/flash_attn_mma_tiling_qk_swizzle_qk.cu ) | f16| f16| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️|
330
- | ✔️ [ flash_attn_mma...tiling_qk_swizzle{qkv}* ] ( ./kernels/flash-attn/mma/swizzle/flash_attn_mma_tiling_qk_swizzle_qkv.cu ) | f16| f16| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️⭐️ |
330
+ | ✔️ [ flash_attn_mma...tiling_qk_swizzle{qkv}* ] ( ./kernels/flash-attn/mma/swizzle/flash_attn_mma_tiling_qk_swizzle_qkv.cu ) | f16| f16| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️|
331
331
| ? [ flash_attn_mma_stages_split_q{f32}* ] ( ./kernels/flash-attn/mma/basic/flash_attn_mma_split_q_acc_f32.cu ) | f16| f32| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️|
332
- | ? [ flash_attn_mma_stages...shared_kv{f32}* ] ( ./kernels/flash-attn/mma/basic/flash_attn_mma_share_kv_acc_f32.cu ) | f16| f32| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️⭐️ |
333
- | ? [ flash_attn_mma_stages...shared_qkv{f32}* ] ( ./kernels/flash-attn/mma/basic/flash_attn_mma_share_qkv_acc_f32.cu ) | f16| f32| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️⭐️ |
334
- | ? [ flash_attn_mma_stages...tiling_qk{f32}* ] ( ./kernels/flash-attn/mma/basic/flash_attn_mma_tiling_qk_acc_f32.cu ) | f16| f32| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️⭐️ |
335
- | ✔️ [ How to implement MMA smem swizzle* ] ( ./kernels/swizzle/mma_simple_swizzle.cu ) | f16| f16| [ link] ( ./kernels/swizzle ) | ⭐️⭐️⭐️⭐️ |
332
+ | ? [ flash_attn_mma_stages...shared_kv{f32}* ] ( ./kernels/flash-attn/mma/basic/flash_attn_mma_share_kv_acc_f32.cu ) | f16| f32| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️|
333
+ | ? [ flash_attn_mma_stages...shared_qkv{f32}* ] ( ./kernels/flash-attn/mma/basic/flash_attn_mma_share_qkv_acc_f32.cu ) | f16| f32| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️|
334
+ | ? [ flash_attn_mma_stages...tiling_qk{f32}* ] ( ./kernels/flash-attn/mma/basic/flash_attn_mma_tiling_qk_acc_f32.cu ) | f16| f32| [ link] ( ./kernels/flash-attn ) | ⭐️⭐️⭐️⭐️|
335
+ | ✔️ [ How to implement MMA smem swizzle* ] ( ./kernels/swizzle/mma_simple_swizzle.cu ) | f16| f16| [ link] ( ./kernels/swizzle ) | ⭐️⭐️⭐️|
336
336
337
337
## 📖 博客目录
338
338
0 commit comments