-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support deepseek-v3 #9878
Closed
Closed
support deepseek-v3 #9878
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
update 0113 support head_dim=192,256 for append_attn c16 attention run refine code add softmax_scale support weight_only_int8 refine code support tp delete test_append_attn add splited fused_moe from ziyuan add deepseek-v3 class fix repe for deepseek-v3 fix wint8 precision and refine code fix wint4, big diff add e_score_correction_bias fix head_dim fix v3 verify [AutoParallel] open tensor_fusion for benchmark (PaddlePaddle#9749) * open tensor_fusion for benchmark fix loraga merge (PaddlePaddle#9765) * fix loraga merge * change sign Fix ernie ci auto trainer error (PaddlePaddle#9758) * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * Update run_pretrain_auto.py Update README.md (PaddlePaddle#9766) * Update README.md [BugFix] Fix matryoshka norm loss (PaddlePaddle#9774) * fix matryoshka norm [Distributed] support fuse optimizer (PaddlePaddle#9519) (PaddlePaddle#9777) Update register_sequence_parallel_allreduce_hooks (PaddlePaddle#9782) * fix sequence parallel * update register_sequence_parallel_allreduce_hooks * update fuse_sequence_parallel_allreduce Fix ce error (PaddlePaddle#9783) * [AutoParallel]:fix ci error * [AutoParallel]:fix ci error fix (PaddlePaddle#9779) [MoE] fix expert parallel (PaddlePaddle#9760) * fix moe uc fix dpo pp criterion (PaddlePaddle#9786) [Infer] Add pir_model path for server infer. (PaddlePaddle#9790) fix d2s fix v3 verify support qk_head_dim != v_head_dim support fp8 batch gemm on cutlass3.x upgrade cutlass version for block_wise fp8 gemm change cutlass commit to ckl117 group_wise branch support fp8 block gemm, but private cutlass commit, and TODO: update fp8 dual gemm api on cutlass3.x support auto tune fp8 block gemm code update cutlass to v3.7.0, todo: support block gemm based on v3.7.0 support block gemm on cutlass v3.7.0 commit code check code check check dynamic_quant ad block builder dir rename group_quant fix wint8 v_head_dim fix rope fix qwen2 mla use position_ids only remove control flow remove gpu concat fix norm weight dtype remove all_reduce in fused_moe part support fp8 check group_quant and fake fp8 check support block gemm [LLM] support flash device on static model (PaddlePaddle#9619) (PaddlePaddle#9787) * [LLM] support flash device on static model * [LLM] adapt pdc sdk [LLM Benchmark]update scripts (PaddlePaddle#9722) * add no_proxy & del paddlenlp_ops * update timeout for dpo * fix sequence_parallel * add timeout * add Total_Tokens_per_second_per_gpu * fix Tokens_per_second_per_gpu * update Total_Tokens_per_second_per_gpu mergekit gpu 1226 (PaddlePaddle#9702) * mergekit gpu 1226 * merge model gpu * merge gpu * add lora model * change valueerror * add lora * gpu test [LLM] merge code from fastdeploy (PaddlePaddle#9791) * [LLM] update llm server dockerfiles * merge code from fastdeploy [Inference] Support eagle for llama (PaddlePaddle#9812) [CI] Fix ci of small models (PaddlePaddle#9633) [Trainer] Wrap model when lora is ON and only do evaluation. (PaddlePaddle#9803) [README] Update README.md for documention (PaddlePaddle#9785) * Update README.md * Update README.md * Update README_en.md fix static run wint8 and fake-fp8, todo: support data type does not match support fp8, but ffn1 and moe in wint8 support ffn1 fp8 block gemm done ffn1 fp8 block gemm block gemm done block gemm support batch refine rope code compute position_ids use custom op fix split_param (PaddlePaddle#9817) [LLM] Update model convert and fix TP for deepseekv3 (PaddlePaddle#9797) * fix model convert and tp in MoEMLP * fix tp_action filter * update convert accoding to num_nextn_predict_layers * add deepseek-R1 fuse rope fix macro fix mixtral set_state_dict block_wise weight support fp8 per tensor network, no support scale Tensor for tensor gemm deepseek-v3 fp8 tensor gemm network, but precision fault add triton fp8 fused_moe kernel fix moe triton kernel add moe triton kernel fix fix fp8 block gemm precision moe triton fp8 network support moe triton and precision correct, but shared ffn1 ffn2 incorrect fp8 block network, no check shared ffn1-ffn2 in v2-lite delete wint8 in fake delete some useless code and verify per tensor net with in qkv outlinear ffn1 ffn2, but triton moe don't match api fp8 block quant when load model, and code check fix tokenizer and qwen [AutoParallel] add sharding tensor_fusion save load switch (PaddlePaddle#9810) * support tensor_fusion save load * apply suggestions from code review 修复benchmark多机任务异常退出的处理 (PaddlePaddle#9651) * 修复benchmark多机任务异常退出的处理 * fix bug * update Fix LLAMA arg parsing bug in pp (PaddlePaddle#9806) [Readme] Update mixtral.md (PaddlePaddle#9829) [XPU] Support empty_cache on XPUs (PaddlePaddle#9789) * [XPU] Support empty_cache on XPUs * warn if current device doesn't support [Inference] Fix multibatch inference (PaddlePaddle#9831) * fix batch infra * fix deepseekv2 infra Fix position_ids for infra (PaddlePaddle#9841) fix moe diff due to e_score_correction_bias fix fast tokenizer [LLM] Add pipeline and flashmask for Qwen2Moe and Deepseek (PaddlePaddle#9827) * add modleing_pp * add modleing_pp for qwen2moe * add flashmask and pp for Qwen2MoE and Deepseek * remove * fix fast_tokenizer save * update for topk_weight of noaux_tc * fix for flashmask * add use_expert_parallel for pretrain * fix tokenizer test [Mergekit]update & add LoRA merge (PaddlePaddle#9811) * add * fix bug * fix * add * add lora merge * add * add * add * add * add * add [Unified Checkpoint] Fix expert parallel (PaddlePaddle#9821) * fix expert parallel * fix split_param for expert parallel * add filter_sync_parameters fix import [Inference] Flask server compatible with OpenAI api. (PaddlePaddle#9828) * flask server compatible with OpenAI api. * fix max_length to max_tokens. * fix with think model. [LLM] fix checkpoint save for non flash mode (PaddlePaddle#9830) support mla for speculate [DSK] support deepseek-v3/r1 (mha/fp16/bf16/wint8/wint4) (PaddlePaddle#9769) * support deepseek-v3 * support head_dim=192,256 for append_attn c16 * update 0113 * attention run * refine code * add softmax_scale * support weight_only_int8 * refine code * support tp * delete test_append_attn * add splited fused_moe from ziyuan * fix repe for deepseek-v3 * add deepseek-v3 class * fix wint8 precision and refine code * fix wint4, big diff * add e_score_correction_bias * fix head_dim * fix v3 verify * fix d2s * fix v3 verify * support qk_head_dim != v_head_dim * fix wint8 v_head_dim * fix rope * fix qwen2 * mla use position_ids only * remove control flow * remove gpu concat * fix norm weight dtype * remove all_reduce in fused_moe * fix static run * refine rope code * compute position_ids use custom op * fuse rope * fix macro * fix mixtral * support mla for speculate * fix tokenizer and qwen * fix moe diff due to e_score_correction_bias * fix fast tokenizer * fix import --------- Co-authored-by: lizhenyun01 <[email protected]> Co-authored-by: lizhenyun <[email protected]> Solve the compatibility problem of type annotation Python version (PaddlePaddle#9853) mix fp8 and wint8 save extra special tokens (PaddlePaddle#9837) [Bugfix] Fix dsk rope diff (PaddlePaddle#9859) * fix dsk diff * fix * update merge develop to check fp8 moe-wint8 fix deepseek v3 fp8 precision fix deepseek weight quant [Optimization] Support lower memory cards. (PaddlePaddle#9804) * support lower memory cards. * add doc for v100 16G such devices. * remove debug info. * add pre divided factor to overcome overfit problem for fp16 attention. Support XPU for auto-paralllel LLaMa (PaddlePaddle#9796) * Support XPU for auto-paralllel LLaMa * Update * Update * Update * Update * Fix CI errors * Update [XPU] Add xpu fused op for deepseek (PaddlePaddle#9854) [Inference] Update deepseek (PaddlePaddle#9864) * fix * fix infra [PreTrain] Support deepseek mfu for pretraining and fix tflops for pretrain pipe model (PaddlePaddle#9855) * git flops with pp model. * Support hareware tflops for deepseek. [Inference]Support mtp with deepseek-v3 (PaddlePaddle#9856) * support mtp with deepseek_v3 both in static and dygraph mode * fix speculate tokenizer in unittest * delete useless code check code
Thanks for your contribution! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
update 0113
support head_dim=192,256 for append_attn c16
attention run
refine code
add softmax_scale
support weight_only_int8
refine code
support tp
delete test_append_attn
add splited fused_moe from ziyuan
add deepseek-v3 class
fix repe for deepseek-v3
fix wint8 precision and refine code
fix wint4, big diff
add e_score_correction_bias
fix head_dim
fix v3 verify
[AutoParallel] open tensor_fusion for benchmark (#9749)
fix loraga merge (#9765)
fix loraga merge
change sign
Fix ernie ci auto trainer error (#9758)
[AutoParallel]:fix ernine auto_trainer error
[AutoParallel]:fix ernine auto_trainer error
[AutoParallel]:fix ernine auto_trainer error
[AutoParallel]:fix ernine auto_trainer error
[AutoParallel]:fix ernine auto_trainer error
[AutoParallel]:fix ernine auto_trainer error
[AutoParallel]:fix ernine auto_trainer error
[AutoParallel]:fix ernine auto_trainer error
[AutoParallel]:fix ernine auto_trainer error
[AutoParallel]:fix ernine auto_trainer error
[AutoParallel]:fix ernine auto_trainer error
[AutoParallel]:fix ernine auto_trainer error
[AutoParallel]:fix ernine auto_trainer error
Update run_pretrain_auto.py
Update README.md (#9766)
[BugFix] Fix matryoshka norm loss (#9774)
[Distributed] support fuse optimizer (#9519) (#9777)
Update register_sequence_parallel_allreduce_hooks (#9782)
fix sequence parallel
update register_sequence_parallel_allreduce_hooks
update fuse_sequence_parallel_allreduce
Fix ce error (#9783)
[AutoParallel]:fix ci error
[AutoParallel]:fix ci error
fix (#9779)
[MoE] fix expert parallel (#9760)
fix dpo pp criterion (#9786)
[Infer] Add pir_model path for server infer. (#9790)
fix d2s
fix v3 verify
support qk_head_dim != v_head_dim
support fp8 batch gemm on cutlass3.x
upgrade cutlass version for block_wise fp8 gemm
change cutlass commit to ckl117 group_wise branch
support fp8 block gemm, but private cutlass commit, and TODO: update fp8 dual gemm api on cutlass3.x
support auto tune fp8 block gemm code
update cutlass to v3.7.0, todo: support block gemm based on v3.7.0
support block gemm on cutlass v3.7.0 commit
code check
code check
check dynamic_quant
ad block builder dir
rename group_quant
fix wint8 v_head_dim
fix rope
fix qwen2
mla use position_ids only
remove control flow
remove gpu concat
fix norm weight dtype
remove all_reduce in fused_moe
part support fp8
check group_quant and fake fp8
check
support block gemm
[LLM] support flash device on static model (#9619) (#9787)
[LLM] support flash device on static model
[LLM] adapt pdc sdk
[LLM Benchmark]update scripts (#9722)
add no_proxy & del paddlenlp_ops
update timeout for dpo
fix sequence_parallel
add timeout
add Total_Tokens_per_second_per_gpu
fix Tokens_per_second_per_gpu
update Total_Tokens_per_second_per_gpu
mergekit gpu 1226 (#9702)
mergekit gpu 1226
merge model gpu
merge gpu
add lora model
change valueerror
add lora
gpu test
[LLM] merge code from fastdeploy (#9791)
[LLM] update llm server dockerfiles
merge code from fastdeploy
[Inference] Support eagle for llama (#9812)
[CI] Fix ci of small models (#9633)
[Trainer] Wrap model when lora is ON and only do evaluation. (#9803)
[README] Update README.md for documention (#9785)
Update README.md
Update README.md
Update README_en.md
fix static run
wint8 and fake-fp8, todo: support data type does not match
support fp8, but ffn1 and moe in wint8
support ffn1 fp8 block gemm
done ffn1 fp8 block gemm
block gemm done
block gemm support batch
refine rope code
compute position_ids use custom op
fix split_param (#9817)
[LLM] Update model convert and fix TP for deepseekv3 (#9797)
fix model convert and tp in MoEMLP
fix tp_action filter
update convert accoding to num_nextn_predict_layers
add deepseek-R1
fuse rope
fix macro
fix mixtral
set_state_dict block_wise weight
support fp8 per tensor network, no support scale Tensor for tensor gemm
deepseek-v3 fp8 tensor gemm network, but precision fault
add triton fp8 fused_moe kernel
fix moe triton kernel
add moe triton kernel
fix
fix fp8 block gemm precision
moe triton fp8 network
support moe triton and precision correct, but shared ffn1 ffn2 incorrect
fp8 block network, no check shared ffn1-ffn2 in v2-lite
delete wint8 in fake
delete some useless code and verify per tensor net with in qkv outlinear ffn1 ffn2, but triton moe don't match api
fp8 block quant when load model, and code check
fix tokenizer and qwen
[AutoParallel] add sharding tensor_fusion save load switch (#9810)
support tensor_fusion save load
apply suggestions from code review
修复benchmark多机任务异常退出的处理 (#9651)
修复benchmark多机任务异常退出的处理
fix bug
update
Fix LLAMA arg parsing bug in pp (#9806)
[Readme] Update mixtral.md (#9829)
[XPU] Support empty_cache on XPUs (#9789)
[XPU] Support empty_cache on XPUs
warn if current device doesn't support
[Inference] Fix multibatch inference (#9831)
fix batch infra
fix deepseekv2 infra
Fix position_ids for infra (#9841)
fix moe diff due to e_score_correction_bias
fix fast tokenizer
[LLM] Add pipeline and flashmask for Qwen2Moe and Deepseek (#9827)
add modleing_pp
add modleing_pp for qwen2moe
add flashmask and pp for Qwen2MoE and Deepseek
remove
fix fast_tokenizer save
update for topk_weight of noaux_tc
fix for flashmask
add use_expert_parallel for pretrain
fix tokenizer test
[Mergekit]update & add LoRA merge (#9811)
add
fix bug
fix
add
add lora merge
add
add
add
add
add
add
[Unified Checkpoint] Fix expert parallel (#9821)
fix expert parallel
fix split_param for expert parallel
add filter_sync_parameters
fix import
[Inference] Flask server compatible with OpenAI api. (#9828)
flask server compatible with OpenAI api.
fix max_length to max_tokens.
fix with think model.
[LLM] fix checkpoint save for non flash mode (#9830)
support mla for speculate
[DSK] support deepseek-v3/r1 (mha/fp16/bf16/wint8/wint4) (#9769)
support deepseek-v3
support head_dim=192,256 for append_attn c16
update 0113
attention run
refine code
add softmax_scale
support weight_only_int8
refine code
support tp
delete test_append_attn
add splited fused_moe from ziyuan
fix repe for deepseek-v3
add deepseek-v3 class
fix wint8 precision and refine code
fix wint4, big diff
add e_score_correction_bias
fix head_dim
fix v3 verify
fix d2s
fix v3 verify
support qk_head_dim != v_head_dim
fix wint8 v_head_dim
fix rope
fix qwen2
mla use position_ids only
remove control flow
remove gpu concat
fix norm weight dtype
remove all_reduce in fused_moe
fix static run
refine rope code
compute position_ids use custom op
fuse rope
fix macro
fix mixtral
support mla for speculate
fix tokenizer and qwen
fix moe diff due to e_score_correction_bias
fix fast tokenizer
fix import
Co-authored-by: lizhenyun01 [email protected]
Co-authored-by: lizhenyun [email protected]
Solve the compatibility problem of type annotation Python version (#9853)
mix fp8 and wint8
save extra special tokens (#9837)
[Bugfix] Fix dsk rope diff (#9859)
fix dsk diff
fix
update
merge develop to check fp8 moe-wint8
fix deepseek v3 fp8 precision
fix deepseek weight quant
[Optimization] Support lower memory cards. (#9804)
support lower memory cards.
add doc for v100 16G such devices.
remove debug info.
add pre divided factor to overcome overfit problem for fp16 attention.
Support XPU for auto-paralllel LLaMa (#9796)
Support XPU for auto-paralllel LLaMa
Update
Update
Update
Update
Fix CI errors
Update
[XPU] Add xpu fused op for deepseek (#9854)
[Inference] Update deepseek (#9864)
fix
fix infra
[PreTrain] Support deepseek mfu for pretraining and fix tflops for pretrain pipe model (#9855)
git flops with pp model.
Support hareware tflops for deepseek.
[Inference]Support mtp with deepseek-v3 (#9856)
support mtp with deepseek_v3 both in static and dygraph mode
fix speculate tokenizer in unittest
delete useless code
check code
Before submitting
tests
folder. If there are codecov issues, please add tests cases first.PR types
PR changes
Description