feat: add health_generate route to openai serving #3856

dsingal0 · 2025-04-25T07:11:05Z

Adds /health_generate for kubernetes to detect when the runtime is hung so it can restart the container/pod even if the server /health returns a 200.
This is helpful in cases when a long context request from a user has caused the llm runtime to hang or die but the fastapi server is still running.

dsingal0 · 2025-04-25T15:57:18Z

@kaiyux wdyt?

kaiyux · 2025-05-01T15:38:52Z

@kaiyux wdyt?

@dsingal0 Thanks for the suggestion. Some members in the team are on public holidays and will return next week, we will keep you posted.

At first glance I think we should be careful when introducing a new API that is not within OpenAI API scope, and I'll take a closer look on that.

LinPoly

For the added entrypoint function, it is acceptable to check if the runtime/executor/engine is alive. But similar logic can also be added to the client code instead of the server side, set a timeout for a simple request, then check if there is any response returned by the server and the response status. Do we have any reason to implement the logic in the server side? @dsingal0

LinPoly · 2025-05-06T09:23:41Z

FYI: vLLM monitors executor health with daemon thread, it is more structured and reliable for me, but we need to implement similar checking from executor level. @kaiyux

dsingal0 · 2025-05-06T16:27:10Z

@LinPoly

re implementing the check server side vs client side
for deployment in kubernetes we need a health and liveness probe to detect when a pod is healthy and ready to be sent traffic during autoscaling and when its unhealthy and needs to be restarted. There is no client code to add this logic to in that case. Currently trtllm-serve exposes /health but that only checks if the server is up, not the health of the runtime. Would be great to have it at the executor/runtime level if possible

LinPoly · 2025-05-07T06:21:45Z

@LinPoly

re implementing the check server side vs client side
for deployment in kubernetes we need a health and liveness probe to detect when a pod is healthy and ready to be sent traffic during autoscaling and when its unhealthy and needs to be restarted. There is no client code to add this logic to in that case. Currently trtllm-serve exposes /health but that only checks if the server is up, not the health of the runtime. Would be great to have it at the executor/runtime level if possible

So you mean K8S is using server side health check for error detection and auto-scaling? If so I think it is reasonable to add such a WAR, we can work on a more structured solution afterwards. @kaiyux for opinions.

tensorrt_llm/serve/openai_server.py

Superjomn · 2025-05-09T05:36:54Z

@kaiyux @penli9 , as we have no such check yet, I think a /health_generate is reasonable; we can refine the implementation during further iterations. BTW, we are going to introduce the heartbeat mechanism into the LLM-API level, and that could facilitate the implementation for /health -- it will check and update the status of the system periodically in a cheap way, but I think the /health_generate can always return a real-time status once it is invoked.

kaiyux · 2025-05-09T06:10:58Z

@dsingal0 Can you also help fix the DCO check? https://github.com/NVIDIA/TensorRT-LLM/pull/3856/checks?check_run_id=41920082141

Summary
Commit sha: 2cd150e, Author: Dhruv Singal, Committer: Dhruv Singal; The sign-off is missing.
Commit sha: a12154a, Author: Dhruv Singal, Committer: Dhruv Singal; The sign-off is missing.
Commit sha: 06f3d21, Author: Dhruv Singal, Committer: Dhruv Singal; The sign-off is missing.

You can follow the guidance here to sign off https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md#signing-your-work.

Please let us know for any questions, thanks!

Signed-off-by: Dhruv Singal <[email protected]>

* add MNNVL memory mapping support Signed-off-by: Dongxu Yang <[email protected]> * add more MPI environment for trtllm-llmapi-launch Signed-off-by: Dongxu Yang <[email protected]> * add MoE communication and prepare kernels Signed-off-by: Dongxu Yang <[email protected]> * add MNNVL AlltoAll support for DeepSeekV3 Signed-off-by: Dongxu Yang <[email protected]> * add output dump for throughput benchmark Signed-off-by: Dongxu Yang <[email protected]> * support dynamic kernel launch grid Signed-off-by: Dongxu Yang <[email protected]> * address review comments Signed-off-by: Dongxu Yang <[email protected]> * address review comments NVIDIA#2 Signed-off-by: Dongxu Yang <[email protected]> --------- Signed-off-by: Dongxu Yang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

Signed-off-by: Yiqing Yan <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

…IA#3847) Signed-off-by: Suguna Velury <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

) * refactor: Fix headsize 72 attention error for TRTLLM attn backend in PyTorch workflow - Remove the head size pre-check logic in AttentionOp because head size 72 can be supported with fmha kernels. - Added support for head size 72 in unfused attention kernels(QKVPreprocessing). - Enhanced unit tests by introducing a scenario generation function for better test coverage of attention configurations(include head size 72). Signed-off-by: qixiang-99 <[email protected]> * update: Waive head_dim=72 test cases and enhance test representation - Added a waiver for head_dim=72 cases on post sm100 in the test suite to address known issues. - Introduced a custom __repr__ method in the Scenario class for pytest substring match. Signed-off-by: qixiang-99 <[email protected]> --------- Signed-off-by: qixiang-99 <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

Signed-off-by: junq <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

Signed-off-by: Dom Brown <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

Signed-off-by: Kaiyu Xie <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

Signed-off-by: Alexandre Milesi <[email protected]> Signed-off-by: Haohang Huang <[email protected]> Co-authored-by: Alexandre Milesi <[email protected]> Co-authored-by: Haohang Huang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

Signed-off-by: bhsueh <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

* Remote results.xml when no cases ran Signed-off-by: qqiao <[email protected]> * Change some test config to verify Signed-off-by: qqiao <[email protected]> * Update for quotes Signed-off-by: qqiao <[email protected]> * Move the remove results.xml in catch section Signed-off-by: qqiao <[email protected]> * Add missed path Signed-off-by: qqiao <[email protected]> * Change back the test stage setting Signed-off-by: qqiao <[email protected]> --------- Signed-off-by: qqiao <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

…IA#4164) feat: Allow overriding cli args with yaml file in trtllm-serve Signed-off-by: Patrice Castonguay <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

…VIDIA#4141) * fix bug of attention dp on qwen3 Signed-off-by: bhsueh <[email protected]> * fix pre-commit changes Signed-off-by: bhsueh <[email protected]> * fix bug of attention dp 8 Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

Signed-off-by: Yukun He <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

…ist (NVIDIA#4083) Signed-off-by: Stanley Sun <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

Add Piecewise CUDA Graph Support Signed-off-by: Yi Zhang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

Signed-off-by: Erin Ho <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

…input sequence length. (NVIDIA#4089) Fix apply_per_channel_scale for extremely large input seq length. Signed-off-by: Jiang Shao <[email protected]> Co-authored-by: crazy-JiangDongHua <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

[fix] Fix trtllm-bench for llama 4 Signed-off-by: Mike Iovine <[email protected]> Co-authored-by: Zhihan Jiang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

…+ pipeline reuse (NVIDIA#4169) Fix import break caused by rebase. Signed-off-by: Yukun He <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

… case for pre-ada (NVIDIA#4095) skip pre ada Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

…accuracy test suite (NVIDIA#3440) * add mistral-7b-v0.1 torch flow test case Signed-off-by: Ivy Zhang <[email protected]> * rearrange mistral Signed-off-by: Ivy Zhang <[email protected]> * rearrange mixtral case Signed-off-by: Ivy Zhang <[email protected]> * remove api function test Signed-off-by: Ivy Zhang <[email protected]> * move mistral nemo cases Signed-off-by: Ivy Zhang <[email protected]> * move mixtral cases Signed-off-by: Ivy Zhang <[email protected]> * update threshold Signed-off-by: Ivy Zhang <[email protected]> * fix failure Signed-off-by: Ivy Zhang <[email protected]> * fix name Signed-off-by: Ivy Zhang <[email protected]> * fix failure cases Signed-off-by: Ivy Zhang <[email protected]> * update list Signed-off-by: Ivy Zhang <[email protected]> * update threshold Signed-off-by: Ivy Zhang <[email protected]> * remove awq llmapi test Signed-off-by: Ivy Zhang <[email protected]> * adjust threshold Signed-off-by: Ivy Zhang <[email protected]> * fix ci Signed-off-by: Ivy Zhang <[email protected]> * fix partial comments Signed-off-by: Ivy Zhang <[email protected]> * fix path Signed-off-by: Ivy Zhang <[email protected]> * update thres Signed-off-by: Ivy Zhang <[email protected]> * update Signed-off-by: Ivy Zhang <[email protected]> * remove duplicate test case Signed-off-by: Ivy Zhang <[email protected]> * fix ci Signed-off-by: Ivy Zhang <[email protected]> --------- Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

* Add fp8 kv cache tests to DSV3-Lite integration tests. Signed-off-by: Bo Li <[email protected]> * Refactor. Make fp8kv parallel to attention_dp, overlap_scheduler and cuda_graph. Signed-off-by: Bo Li <[email protected]> * Update gsm8k. Signed-off-by: Bo Li <[email protected]> * Update CI list. Signed-off-by: Bo Li <[email protected]> * Update TestDeepSeekR1. Signed-off-by: Bo Li <[email protected]> * Fix test list. Signed-off-by: Bo Li <[email protected]> * Need quant_config besides pytorch_config. Signed-off-by: Bo Li <[email protected]> * Update waive list (bug 5239087). Signed-off-by: Bo Li <[email protected]> * Update waive list. Signed-off-by: Bo Li <[email protected]> * Correct test name. Signed-off-by: Bo Li <[email protected]> * Update waive list. Signed-off-by: Bo Li <[email protected]> --------- Signed-off-by: Bo Li <[email protected]> Signed-off-by: Bo Li <[email protected]> Signed-off-by: Enwei Zhu <[email protected]> Co-authored-by: Enwei Zhu <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

Signed-off-by: Dhruv Singal <[email protected]>

dsingal0 · 2025-05-09T06:39:09Z

@kaiyux done

tensorrt_llm/_torch/speculative/mtp.py

tensorrt_llm/serve/openai_server.py

kaiyux · 2025-05-09T06:56:15Z

@dsingal0 Thanks for fixing the DCO check. Mind squashing the commits? Not sure why there were a lot of un-related commits introduced, the diff looks good though.

Signed-off-by: Dhruv Singal <[email protected]>

dsingal0 · 2025-05-09T07:14:58Z

@kaiyux I think rebasing to add signoff added those commits to the PR, can we squash on merge instead? I added the test and removed the mtp.py change.

dsingal0 · 2025-05-09T07:16:50Z

alternatively, I could open a new PR

kaiyux · 2025-05-09T09:07:35Z

@kaiyux I think rebasing to add signoff added those commits to the PR, can we squash on merge instead? I added the test and removed the mtp.py change.

Thanks a lot! I think we should be fine merge this one since we will squash the commits when merge it, @chzblych do you think differently?

kaiyux · 2025-05-09T09:07:58Z

/bot run

tensorrt-cicd · 2025-05-09T09:13:20Z

PR_Github #4694 [ run ] triggered by Bot

Signed-off-by: Dhruv <[email protected]>

kaiyux · 2025-05-09T10:02:52Z

/bot run

tensorrt-cicd · 2025-05-09T10:08:36Z

PR_Github #4705 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-09T10:08:39Z

PR_Github #4694 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-05-09T15:16:18Z

PR_Github #4705 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3393 completed with status: 'FAILURE'

juney-nvidia changed the title ~~[feat] add health_generate route to openai serving~~ feat: add health_generate route to openai serving Apr 26, 2025

juney-nvidia added Community want to contribute Community Engagement labels Apr 26, 2025

dsingal0 requested a review from a team as a code owner April 29, 2025 18:43

kaiyux requested review from Superjomn, kaiyux and LinPoly and removed request for a team May 1, 2025 15:35

LinPoly reviewed May 6, 2025

View reviewed changes

Superjomn reviewed May 7, 2025

View reviewed changes

tensorrt_llm/serve/openai_server.py Show resolved Hide resolved

dsingal0 and others added 13 commits May 8, 2025 23:31

add health_generate

0221e09

Signed-off-by: Dhruv Singal <[email protected]>

[infra] Waive L0 tests (NVIDIA#3853)

d413d3b

Signed-off-by: Yiqing Yan <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

[chore] Add Llama 4 Maverick to quickstart README (NVIDIA#3848)

9b0873e

Signed-off-by: Mike Iovine <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

fix: [AutoDeploy] update hf loading for e_score_correction_bias (NVID…

a209283

…IA#3847) Signed-off-by: Suguna Velury <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

chore: update pytorch only change file list (NVIDIA#3873)

efd64fd

Signed-off-by: junq <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

Test: Split C++ unit tests for CI granularity (NVIDIA#3868)

c2fdaad

Signed-off-by: Dom Brown <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

TRTLLM-4875 feat: Add version switcher to doc (NVIDIA#3846)

2320cf6

Signed-off-by: Kaiyu Xie <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

fix: Detect pmix and raise error when mpirun is not used. (NVIDIA#3858)

22e675a

Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

fix bug of deepseek gropu_size setting (NVIDIA#3860)

2afb813

Signed-off-by: bhsueh <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

pcastonguay and others added 12 commits May 8, 2025 23:31

[feat] Allow overriding cli args with yaml file in trtllm-serve (NVID…

be41c13

…IA#4164) feat: Allow overriding cli args with yaml file in trtllm-serve Signed-off-by: Patrice Castonguay <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

chore: Clean up the legacy DeepseekAllreudceFusionOp. (NVIDIA#4081)

0f029a2

Signed-off-by: Yukun He <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

test: add qwen3 and disaggregated serving accuracy tests to qa test l…

929b23f

…ist (NVIDIA#4083) Signed-off-by: Stanley Sun <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (NVIDIA#3804)

4659238

Add Piecewise CUDA Graph Support Signed-off-by: Yi Zhang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

fix: change pp broadcast pattern for LPs (NVIDIA#4130)

46f39d1

Signed-off-by: Erin Ho <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

[nvbug/5262268][fix] Fix trtllm-bench for llama 4 (NVIDIA#4104)

98828fe

[fix] Fix trtllm-bench for llama 4 Signed-off-by: Mike Iovine <[email protected]> Co-authored-by: Zhihan Jiang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

chore: Fix pipeline break caused by previous PR (NVIDIA#4081) rebase …

71c1893

…+ pipeline reuse (NVIDIA#4169) Fix import break caused by rebase. Signed-off-by: Yukun He <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

[https://nvbugspro.nvidia.com/bug/5260676]test: skip fp8 quantization…

2afc5c0

… case for pre-ada (NVIDIA#4095) skip pre ada Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Dhruv Singal <[email protected]>

dsingal0 force-pushed the main branch from 2c98761 to d78a50a Compare May 9, 2025 06:33

Merge branch 'main' into main

3dc1aca

Signed-off-by: Dhruv Singal <[email protected]>

kaiyux reviewed May 9, 2025

View reviewed changes

tensorrt_llm/_torch/speculative/mtp.py Outdated Show resolved Hide resolved

tensorrt_llm/serve/openai_server.py Show resolved Hide resolved

feat: Add health genearte, health_generate test and fix mpt.py

e9300fb

Signed-off-by: Dhruv Singal <[email protected]>

dsingal0 force-pushed the main branch from 7c39838 to e9300fb Compare May 9, 2025 07:13

Merge branch 'main' into main

c436961

fix issues caught by pre-commit checks

6da7ae8

Signed-off-by: Dhruv <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add health_generate route to openai serving #3856

feat: add health_generate route to openai serving #3856

dsingal0 commented Apr 25, 2025

dsingal0 commented Apr 25, 2025

kaiyux commented May 1, 2025

LinPoly left a comment

LinPoly commented May 6, 2025 •

edited

Loading

dsingal0 commented May 6, 2025

LinPoly commented May 7, 2025

Superjomn commented May 9, 2025 •

edited

Loading

kaiyux commented May 9, 2025

dsingal0 commented May 9, 2025

kaiyux commented May 9, 2025

dsingal0 commented May 9, 2025

dsingal0 commented May 9, 2025

kaiyux commented May 9, 2025

kaiyux commented May 9, 2025

tensorrt-cicd commented May 9, 2025

kaiyux commented May 9, 2025

tensorrt-cicd commented May 9, 2025

tensorrt-cicd commented May 9, 2025

tensorrt-cicd commented May 9, 2025

feat: add health_generate route to openai serving #3856

Are you sure you want to change the base?

feat: add health_generate route to openai serving #3856

Conversation

dsingal0 commented Apr 25, 2025

dsingal0 commented Apr 25, 2025

kaiyux commented May 1, 2025

LinPoly left a comment

Choose a reason for hiding this comment

LinPoly commented May 6, 2025 • edited Loading

dsingal0 commented May 6, 2025

LinPoly commented May 7, 2025

Superjomn commented May 9, 2025 • edited Loading

kaiyux commented May 9, 2025

dsingal0 commented May 9, 2025

kaiyux commented May 9, 2025

dsingal0 commented May 9, 2025

dsingal0 commented May 9, 2025

kaiyux commented May 9, 2025

kaiyux commented May 9, 2025

tensorrt-cicd commented May 9, 2025

kaiyux commented May 9, 2025

tensorrt-cicd commented May 9, 2025

tensorrt-cicd commented May 9, 2025

tensorrt-cicd commented May 9, 2025

LinPoly commented May 6, 2025 •

edited

Loading

Superjomn commented May 9, 2025 •

edited

Loading