Releases: xlite-dev/LeetCUDA
Releases · xlite-dev/LeetCUDA
v2.4.1 Pack LayerNorm
What's Changed
- [Nsight] Add nsys/ncu usage, ptx/sass by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/44
 - [DotProd][FP16] support f16x8_pack kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/45
 - [LayerNorm][FP16] Add pack support for f16x8 LD/ST by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/46
 
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4...v2.4.1
v2.4 Pack Reduce LDST
What's Changed
- [Reduce][Kernel] Pack f16/bf16x8 & fp8/i8x16 LD/ST by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/43
 
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.3.1...v2.4
v2.3.1 f16x8 Pack Elementwise
What's Changed
- [FA2][Half] Add FA2 f16_mma_m16n8k16 kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/35
 - [Refactor][7/N] CUDA Learn Notes refactor Part-7 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/36
 - Clamped input range in Sigmoid kernel to prevent overflow by @Phoenix8215 in https://github.com/DefTruth/CUDA-Learn-Notes/pull/37
 - [Sigmoid][F16] Add f16x8_pack kernel, boost 1.5x ~ by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/39
 - [Elementwise][Half] support f16x8_pack kernel, boost 1.1x by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/40
 - [FlashAttention] replace FLOAT4 with LDST128BITS macro by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/41
 - [RELU][FP16] Add f16x8_pack kernel, boost 2.1x by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/42
 
New Contributors
- @Phoenix8215 made their first contribution in https://github.com/DefTruth/CUDA-Learn-Notes/pull/37
 
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.3...v2.3.1
v2.3 Refactor 6/N
What's Changed
- [Refactor][6/N] CUDA Learn Notes refactor Part-6 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/17
 - [Refactor][5/N] CUDA Learn Notes refactor Part-6 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/18
 - [LayerNorm][Half] support fp16x8 packed LayerNorm by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/19
 - [Reduce][Half] add HALF2 & BFLOAT2 macro by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/21
 - [RMSNorm][Half] support fp16x8 packed RMSNorm by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/22
 - [Bugfix][Kernel] fixed some kernel blocks calculate errors by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/23
 - [Elementwise][Half] support fp16x8 packed Elementwise by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/24
 - [Elementwise][Half] support fp16x8 packed Elementwise by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/25
 - [RELU][Half] support fp16x8 RELU kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/26
 - [RMSNorm] support f16x8_f32 RMSNorm by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/28
 - [RMSNorm][Kernel] Add FLOAT2/HALF2_VARIANCE macro by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/29
 - [LayerNorm][Kernel] Add HALF2 SUM/SUB/VAR macro by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/30
 - [HGEMM] Add slicked_k&t_8x8_sliced_k_f16x4 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/31
 - [HGEMV][Half] support hgemv k32/k128/f16 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/32
 - [FlashAttention] Refactor flash_attn_1_fwd_f32 kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/33
 - Bump up to v2.3 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/34
 
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.2...v2.3
v2.2 Refactor 5/N
What's Changed
- [Refactor][5/N] CUDA Learn Notes refactor Part-5 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/15
 - Bump up to v2.2 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/16
 
Full Changelog: DefTruth/CUDA-Learn-Notes@2.1...v2.2
v2.1 Refactor 4/N Part-4
What's Changed
- [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/10
 - [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/11
 - [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/12
 - [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/13
 - [Refactor][4/N] CUDA Learn Notes refactor Part-4 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/14
 
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.0...2.1
v2.0 Refactor 4/N
Full Changelog: DefTruth/CUDA-Learn-Notes@v0.8...v2.0
v0.8
What's Changed
- Bump up to v0.8 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/9
 
Full Changelog: DefTruth/CUDA-Learn-Notes@v0.7...v0.8
CUDA Learn Note v0.7
Full Changelog: DefTruth/CUDA-Learn-Notes@v0.5...v0.6
What's Changed
- Bump up to v0.7 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/8
 
New Contributors
- @DefTruth made their first contribution in https://github.com/DefTruth/CUDA-Learn-Notes/pull/8
 
Full Changelog: DefTruth/CUDA-Learn-Notes@v0.6...v0.7
CUDA Learn Notes v0.5
Full Changelog: DefTruth/CUDA-Learn-Notes@v0.3...v0.5