Skip to content

feat: add health_generate route to openai serving #3856

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 182 commits into
base: main
Choose a base branch
from

Conversation

dsingal0
Copy link

Adds /health_generate for kubernetes to detect when the runtime is hung so it can restart the container/pod even if the server /health returns a 200.
This is helpful in cases when a long context request from a user has caused the llm runtime to hang or die but the fastapi server is still running.

@dsingal0
Copy link
Author

@kaiyux wdyt?

@juney-nvidia juney-nvidia changed the title [feat] add health_generate route to openai serving feat: add health_generate route to openai serving Apr 26, 2025
@dsingal0 dsingal0 requested a review from a team as a code owner April 29, 2025 18:43
@kaiyux kaiyux requested review from Superjomn, kaiyux and LinPoly and removed request for a team May 1, 2025 15:35
@kaiyux
Copy link
Member

kaiyux commented May 1, 2025

@kaiyux wdyt?

@dsingal0 Thanks for the suggestion. Some members in the team are on public holidays and will return next week, we will keep you posted.

At first glance I think we should be careful when introducing a new API that is not within OpenAI API scope, and I'll take a closer look on that.

Copy link
Collaborator

@LinPoly LinPoly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the added entrypoint function, it is acceptable to check if the runtime/executor/engine is alive. But similar logic can also be added to the client code instead of the server side, set a timeout for a simple request, then check if there is any response returned by the server and the response status. Do we have any reason to implement the logic in the server side? @dsingal0

@LinPoly
Copy link
Collaborator

LinPoly commented May 6, 2025

FYI: vLLM monitors executor health with daemon thread, it is more structured and reliable for me, but we need to implement similar checking from executor level. @kaiyux

@dsingal0
Copy link
Author

dsingal0 commented May 6, 2025

@LinPoly

re implementing the check server side vs client side
for deployment in kubernetes we need a health and liveness probe to detect when a pod is healthy and ready to be sent traffic during autoscaling and when its unhealthy and needs to be restarted. There is no client code to add this logic to in that case. Currently trtllm-serve exposes /health but that only checks if the server is up, not the health of the runtime. Would be great to have it at the executor/runtime level if possible

@LinPoly
Copy link
Collaborator

LinPoly commented May 7, 2025

@LinPoly

re implementing the check server side vs client side
for deployment in kubernetes we need a health and liveness probe to detect when a pod is healthy and ready to be sent traffic during autoscaling and when its unhealthy and needs to be restarted. There is no client code to add this logic to in that case. Currently trtllm-serve exposes /health but that only checks if the server is up, not the health of the runtime. Would be great to have it at the executor/runtime level if possible

So you mean K8S is using server side health check for error detection and auto-scaling? If so I think it is reasonable to add such a WAR, we can work on a more structured solution afterwards. @kaiyux for opinions.

@Superjomn
Copy link
Collaborator

Superjomn commented May 9, 2025

@kaiyux @penli9 , as we have no such check yet, I think a /health_generate is reasonable; we can refine the implementation during further iterations. BTW, we are going to introduce the heartbeat mechanism into the LLM-API level, and that could facilitate the implementation for /health -- it will check and update the status of the system periodically in a cheap way, but I think the /health_generate can always return a real-time status once it is invoked.

@kaiyux
Copy link
Member

kaiyux commented May 9, 2025

@dsingal0 Can you also help fix the DCO check? https://github.com/NVIDIA/TensorRT-LLM/pull/3856/checks?check_run_id=41920082141

Summary
Commit sha: 2cd150e, Author: Dhruv Singal, Committer: Dhruv Singal; The sign-off is missing.
Commit sha: a12154a, Author: Dhruv Singal, Committer: Dhruv Singal; The sign-off is missing.
Commit sha: 06f3d21, Author: Dhruv Singal, Committer: Dhruv Singal; The sign-off is missing.

You can follow the guidance here to sign off https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md#signing-your-work.

Please let us know for any questions, thanks!

dsingal0 and others added 13 commits May 8, 2025 23:31
Signed-off-by: Dhruv Singal <[email protected]>
* add MNNVL memory mapping support

Signed-off-by: Dongxu Yang <[email protected]>

* add more MPI environment for trtllm-llmapi-launch

Signed-off-by: Dongxu Yang <[email protected]>

* add MoE communication and prepare kernels

Signed-off-by: Dongxu Yang <[email protected]>

* add MNNVL AlltoAll support for DeepSeekV3

Signed-off-by: Dongxu Yang <[email protected]>

* add output dump for throughput benchmark

Signed-off-by: Dongxu Yang <[email protected]>

* support dynamic kernel launch grid

Signed-off-by: Dongxu Yang <[email protected]>

* address review comments

Signed-off-by: Dongxu Yang <[email protected]>

* address review comments NVIDIA#2

Signed-off-by: Dongxu Yang <[email protected]>

---------

Signed-off-by: Dongxu Yang <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
)

* refactor: Fix headsize 72 attention error for TRTLLM attn backend in PyTorch workflow

- Remove the head size pre-check logic in AttentionOp because head size 72 can be supported with fmha kernels.
- Added support for head size 72 in unfused attention kernels(QKVPreprocessing).
- Enhanced unit tests by introducing a scenario generation function for better test coverage of attention configurations(include head size 72).

Signed-off-by: qixiang-99 <[email protected]>

* update: Waive head_dim=72 test cases and enhance test representation

- Added a waiver for head_dim=72 cases on post sm100 in the test suite to address known issues.
- Introduced a custom __repr__ method in the Scenario class for pytest substring match.

Signed-off-by: qixiang-99 <[email protected]>

---------

Signed-off-by: qixiang-99 <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]>
Signed-off-by: Haohang Huang <[email protected]>
Co-authored-by: Alexandre Milesi <[email protected]>
Co-authored-by: Haohang Huang <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
* Remote results.xml when no cases ran

Signed-off-by: qqiao <[email protected]>

* Change some test config to verify

Signed-off-by: qqiao <[email protected]>

* Update for quotes

Signed-off-by: qqiao <[email protected]>

* Move the remove results.xml in catch section

Signed-off-by: qqiao <[email protected]>

* Add missed path

Signed-off-by: qqiao <[email protected]>

* Change back the test stage setting

Signed-off-by: qqiao <[email protected]>

---------

Signed-off-by: qqiao <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
pcastonguay and others added 12 commits May 8, 2025 23:31
…IA#4164)

feat: Allow overriding cli args with yaml file in trtllm-serve

Signed-off-by: Patrice Castonguay <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
…VIDIA#4141)

* fix bug of attention dp on qwen3

Signed-off-by: bhsueh <[email protected]>

* fix pre-commit changes

Signed-off-by: bhsueh <[email protected]>

* fix bug of attention dp 8

Signed-off-by: bhsueh <[email protected]>

---------

Signed-off-by: bhsueh <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
Add Piecewise CUDA Graph Support

Signed-off-by: Yi Zhang <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
…input sequence length. (NVIDIA#4089)

Fix apply_per_channel_scale for extremely large input seq length.

Signed-off-by: Jiang Shao <[email protected]>
Co-authored-by: crazy-JiangDongHua <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
[fix] Fix trtllm-bench for llama 4

Signed-off-by: Mike Iovine <[email protected]>
Co-authored-by: Zhihan Jiang <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
…+ pipeline reuse (NVIDIA#4169)

Fix import break caused by rebase.

Signed-off-by: Yukun He <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
… case for pre-ada (NVIDIA#4095)

skip pre ada

Signed-off-by: Ivy Zhang <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
…accuracy test suite (NVIDIA#3440)

* add mistral-7b-v0.1 torch flow test case

Signed-off-by: Ivy Zhang <[email protected]>

* rearrange mistral

Signed-off-by: Ivy Zhang <[email protected]>

* rearrange mixtral case

Signed-off-by: Ivy Zhang <[email protected]>

* remove api function test

Signed-off-by: Ivy Zhang <[email protected]>

* move mistral nemo cases

Signed-off-by: Ivy Zhang <[email protected]>

* move mixtral cases

Signed-off-by: Ivy Zhang <[email protected]>

* update threshold

Signed-off-by: Ivy Zhang <[email protected]>

* fix failure

Signed-off-by: Ivy Zhang <[email protected]>

* fix name

Signed-off-by: Ivy Zhang <[email protected]>

* fix failure cases

Signed-off-by: Ivy Zhang <[email protected]>

* update list

Signed-off-by: Ivy Zhang <[email protected]>

* update threshold

Signed-off-by: Ivy Zhang <[email protected]>

* remove awq llmapi test

Signed-off-by: Ivy Zhang <[email protected]>

* adjust threshold

Signed-off-by: Ivy Zhang <[email protected]>

* fix ci

Signed-off-by: Ivy Zhang <[email protected]>

* fix partial comments

Signed-off-by: Ivy Zhang <[email protected]>

* fix path

Signed-off-by: Ivy Zhang <[email protected]>

* update thres

Signed-off-by: Ivy Zhang <[email protected]>

* update

Signed-off-by: Ivy Zhang <[email protected]>

* remove duplicate test case

Signed-off-by: Ivy Zhang <[email protected]>

* fix ci

Signed-off-by: Ivy Zhang <[email protected]>

---------

Signed-off-by: Ivy Zhang <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
* Add fp8 kv cache tests to DSV3-Lite integration tests.

Signed-off-by: Bo Li <[email protected]>

* Refactor. Make fp8kv parallel to attention_dp, overlap_scheduler and cuda_graph.

Signed-off-by: Bo Li <[email protected]>

* Update gsm8k.

Signed-off-by: Bo Li <[email protected]>

* Update CI list.

Signed-off-by: Bo Li <[email protected]>

* Update TestDeepSeekR1.

Signed-off-by: Bo Li <[email protected]>

* Fix test list.

Signed-off-by: Bo Li <[email protected]>

* Need quant_config besides pytorch_config.

Signed-off-by: Bo Li <[email protected]>

* Update waive list (bug 5239087).

Signed-off-by: Bo Li <[email protected]>

* Update waive list.

Signed-off-by: Bo Li <[email protected]>

* Correct test name.

Signed-off-by: Bo Li <[email protected]>

* Update waive list.

Signed-off-by: Bo Li <[email protected]>

---------

Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Bo Li <[email protected]>
Signed-off-by: Enwei Zhu <[email protected]>
Co-authored-by: Enwei Zhu <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
Signed-off-by: Dhruv Singal <[email protected]>
@dsingal0
Copy link
Author

dsingal0 commented May 9, 2025

@kaiyux done

@kaiyux
Copy link
Member

kaiyux commented May 9, 2025

@dsingal0 Thanks for fixing the DCO check. Mind squashing the commits? Not sure why there were a lot of un-related commits introduced, the diff looks good though.

@dsingal0
Copy link
Author

dsingal0 commented May 9, 2025

@kaiyux I think rebasing to add signoff added those commits to the PR, can we squash on merge instead? I added the test and removed the mtp.py change.

@dsingal0
Copy link
Author

dsingal0 commented May 9, 2025

alternatively, I could open a new PR

@kaiyux
Copy link
Member

kaiyux commented May 9, 2025

@kaiyux I think rebasing to add signoff added those commits to the PR, can we squash on merge instead? I added the test and removed the mtp.py change.

Thanks a lot! I think we should be fine merge this one since we will squash the commits when merge it, @chzblych do you think differently?

@kaiyux
Copy link
Member

kaiyux commented May 9, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4694 [ run ] triggered by Bot

@kaiyux
Copy link
Member

kaiyux commented May 9, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4705 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4694 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4705 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3393 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.