-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat: add health_generate route to openai serving #3856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@kaiyux wdyt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the added entrypoint function, it is acceptable to check if the runtime/executor/engine is alive. But similar logic can also be added to the client code instead of the server side, set a timeout for a simple request, then check if there is any response returned by the server and the response status. Do we have any reason to implement the logic in the server side? @dsingal0
FYI: vLLM monitors executor health with daemon thread, it is more structured and reliable for me, but we need to implement similar checking from executor level. @kaiyux |
|
So you mean K8S is using server side health check for error detection and auto-scaling? If so I think it is reasonable to add such a WAR, we can work on a more structured solution afterwards. @kaiyux for opinions. |
@kaiyux @penli9 , as we have no such check yet, I think a |
@dsingal0 Can you also help fix the DCO check? https://github.com/NVIDIA/TensorRT-LLM/pull/3856/checks?check_run_id=41920082141
You can follow the guidance here to sign off https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md#signing-your-work. Please let us know for any questions, thanks! |
Signed-off-by: Dhruv Singal <[email protected]>
* add MNNVL memory mapping support Signed-off-by: Dongxu Yang <[email protected]> * add more MPI environment for trtllm-llmapi-launch Signed-off-by: Dongxu Yang <[email protected]> * add MoE communication and prepare kernels Signed-off-by: Dongxu Yang <[email protected]> * add MNNVL AlltoAll support for DeepSeekV3 Signed-off-by: Dongxu Yang <[email protected]> * add output dump for throughput benchmark Signed-off-by: Dongxu Yang <[email protected]> * support dynamic kernel launch grid Signed-off-by: Dongxu Yang <[email protected]> * address review comments Signed-off-by: Dongxu Yang <[email protected]> * address review comments NVIDIA#2 Signed-off-by: Dongxu Yang <[email protected]> --------- Signed-off-by: Dongxu Yang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
…IA#3847) Signed-off-by: Suguna Velury <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
) * refactor: Fix headsize 72 attention error for TRTLLM attn backend in PyTorch workflow - Remove the head size pre-check logic in AttentionOp because head size 72 can be supported with fmha kernels. - Added support for head size 72 in unfused attention kernels(QKVPreprocessing). - Enhanced unit tests by introducing a scenario generation function for better test coverage of attention configurations(include head size 72). Signed-off-by: qixiang-99 <[email protected]> * update: Waive head_dim=72 test cases and enhance test representation - Added a waiver for head_dim=72 cases on post sm100 in the test suite to address known issues. - Introduced a custom __repr__ method in the Scenario class for pytest substring match. Signed-off-by: qixiang-99 <[email protected]> --------- Signed-off-by: qixiang-99 <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: junq <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: Dom Brown <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: Kaiyu Xie <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]> Signed-off-by: Haohang Huang <[email protected]> Co-authored-by: Alexandre Milesi <[email protected]> Co-authored-by: Haohang Huang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: bhsueh <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
* Remote results.xml when no cases ran Signed-off-by: qqiao <[email protected]> * Change some test config to verify Signed-off-by: qqiao <[email protected]> * Update for quotes Signed-off-by: qqiao <[email protected]> * Move the remove results.xml in catch section Signed-off-by: qqiao <[email protected]> * Add missed path Signed-off-by: qqiao <[email protected]> * Change back the test stage setting Signed-off-by: qqiao <[email protected]> --------- Signed-off-by: qqiao <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
…IA#4164) feat: Allow overriding cli args with yaml file in trtllm-serve Signed-off-by: Patrice Castonguay <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
…VIDIA#4141) * fix bug of attention dp on qwen3 Signed-off-by: bhsueh <[email protected]> * fix pre-commit changes Signed-off-by: bhsueh <[email protected]> * fix bug of attention dp 8 Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: Yukun He <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
…ist (NVIDIA#4083) Signed-off-by: Stanley Sun <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
Add Piecewise CUDA Graph Support Signed-off-by: Yi Zhang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: Erin Ho <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
…input sequence length. (NVIDIA#4089) Fix apply_per_channel_scale for extremely large input seq length. Signed-off-by: Jiang Shao <[email protected]> Co-authored-by: crazy-JiangDongHua <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
[fix] Fix trtllm-bench for llama 4 Signed-off-by: Mike Iovine <[email protected]> Co-authored-by: Zhihan Jiang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
…+ pipeline reuse (NVIDIA#4169) Fix import break caused by rebase. Signed-off-by: Yukun He <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
… case for pre-ada (NVIDIA#4095) skip pre ada Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
…accuracy test suite (NVIDIA#3440) * add mistral-7b-v0.1 torch flow test case Signed-off-by: Ivy Zhang <[email protected]> * rearrange mistral Signed-off-by: Ivy Zhang <[email protected]> * rearrange mixtral case Signed-off-by: Ivy Zhang <[email protected]> * remove api function test Signed-off-by: Ivy Zhang <[email protected]> * move mistral nemo cases Signed-off-by: Ivy Zhang <[email protected]> * move mixtral cases Signed-off-by: Ivy Zhang <[email protected]> * update threshold Signed-off-by: Ivy Zhang <[email protected]> * fix failure Signed-off-by: Ivy Zhang <[email protected]> * fix name Signed-off-by: Ivy Zhang <[email protected]> * fix failure cases Signed-off-by: Ivy Zhang <[email protected]> * update list Signed-off-by: Ivy Zhang <[email protected]> * update threshold Signed-off-by: Ivy Zhang <[email protected]> * remove awq llmapi test Signed-off-by: Ivy Zhang <[email protected]> * adjust threshold Signed-off-by: Ivy Zhang <[email protected]> * fix ci Signed-off-by: Ivy Zhang <[email protected]> * fix partial comments Signed-off-by: Ivy Zhang <[email protected]> * fix path Signed-off-by: Ivy Zhang <[email protected]> * update thres Signed-off-by: Ivy Zhang <[email protected]> * update Signed-off-by: Ivy Zhang <[email protected]> * remove duplicate test case Signed-off-by: Ivy Zhang <[email protected]> * fix ci Signed-off-by: Ivy Zhang <[email protected]> --------- Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
* Add fp8 kv cache tests to DSV3-Lite integration tests. Signed-off-by: Bo Li <[email protected]> * Refactor. Make fp8kv parallel to attention_dp, overlap_scheduler and cuda_graph. Signed-off-by: Bo Li <[email protected]> * Update gsm8k. Signed-off-by: Bo Li <[email protected]> * Update CI list. Signed-off-by: Bo Li <[email protected]> * Update TestDeepSeekR1. Signed-off-by: Bo Li <[email protected]> * Fix test list. Signed-off-by: Bo Li <[email protected]> * Need quant_config besides pytorch_config. Signed-off-by: Bo Li <[email protected]> * Update waive list (bug 5239087). Signed-off-by: Bo Li <[email protected]> * Update waive list. Signed-off-by: Bo Li <[email protected]> * Correct test name. Signed-off-by: Bo Li <[email protected]> * Update waive list. Signed-off-by: Bo Li <[email protected]> --------- Signed-off-by: Bo Li <[email protected]> Signed-off-by: Bo Li <[email protected]> Signed-off-by: Enwei Zhu <[email protected]> Co-authored-by: Enwei Zhu <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
@kaiyux done |
@dsingal0 Thanks for fixing the DCO check. Mind squashing the commits? Not sure why there were a lot of un-related commits introduced, the diff looks good though. |
Signed-off-by: Dhruv Singal <[email protected]>
@kaiyux I think rebasing to add signoff added those commits to the PR, can we squash on merge instead? I added the test and removed the mtp.py change. |
alternatively, I could open a new PR |
/bot run |
PR_Github #4694 [ run ] triggered by Bot |
Signed-off-by: Dhruv <[email protected]>
/bot run |
PR_Github #4705 [ run ] triggered by Bot |
PR_Github #4694 [ run ] completed with state |
PR_Github #4705 [ run ] completed with state |
Adds /health_generate for kubernetes to detect when the runtime is hung so it can restart the container/pod even if the server /health returns a 200.
This is helpful in cases when a long context request from a user has caused the llm runtime to hang or die but the fastapi server is still running.