Skip to content

feat: add health_generate route to openai serving #3856

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 182 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
182 commits
Select commit Hold shift + click to select a range
0221e09
add health_generate
dsingal0 Apr 25, 2025
e06130e
feat: Add MNNVL MoE A2A support (#3504)
dongxuy04 Apr 25, 2025
d413d3b
[infra] Waive L0 tests (#3853)
yiqingy0 Apr 25, 2025
9b0873e
[chore] Add Llama 4 Maverick to quickstart README (#3848)
mikeiovine Apr 25, 2025
a209283
fix: [AutoDeploy] update hf loading for e_score_correction_bias (#3847)
sugunav14 Apr 25, 2025
f4b5f0b
feat: Add head size 72 support for QKV Preprocessing kernel (#3743)
qixiang-99 Apr 25, 2025
efd64fd
chore: update pytorch only change file list (#3873)
QiJune Apr 25, 2025
c2fdaad
Test: Split C++ unit tests for CI granularity (#3868)
DomBrown Apr 25, 2025
2320cf6
TRTLLM-4875 feat: Add version switcher to doc (#3846)
kaiyux Apr 25, 2025
fe3a278
feat: llama4 input processor (#3383)
milesial Apr 25, 2025
22e675a
fix: Detect pmix and raise error when mpirun is not used. (#3858)
yuxianq Apr 26, 2025
2afb813
fix bug of deepseek gropu_size setting (#3860)
byshiue Apr 27, 2025
0e29db1
Infra: Remove empty junit xml (#3794)
EmmaQiaoCh Apr 27, 2025
7bdf3ba
fix: Update num_of_ctx_tokens in iteration stats (#3785)
HuiGao-NV Apr 27, 2025
74bc5df
cacheTransceiver buffer manager (#3798)
chuangz0 Apr 27, 2025
ac3d488
fix: add warmup flag into py_executor to prevent enable profiler duri…
byshiue Apr 27, 2025
db4169e
fix: trtllm-bench build trt engine on slurm (#3825)
Superjomn Apr 27, 2025
53bc187
infra: install Triton in the base image (#3759)
Tabrizian Apr 27, 2025
0c66015
fix bug of create cuda stream as default parameter which will be init…
byshiue Apr 28, 2025
ebac837
Test: waive intermittent test hang (#3894)
chzblych Apr 28, 2025
7d9d70e
infra: add scaffolding paths to pytorch only files (#3835)
dc3671 Apr 28, 2025
645f092
update waives & tests (#3887)
xinhe-nv Apr 28, 2025
897aed4
test: [CI] Add failed cases into waives.txt (#3867)
xinhe-nv Apr 28, 2025
962d188
Fix the link of doc (#3903)
litaotju Apr 28, 2025
4330e55
[TRTLLM-4638 ][feat] add best of n support with reward model in scaff…
dc3671 Apr 28, 2025
e788c6c
Add docs about DeepSeek-R1 long context support. (#3910)
qiaoxj07 Apr 28, 2025
905acff
fix(requirements): fix neither 'setup.py' nor 'pyproject.toml' found …
dc3671 Apr 28, 2025
ab6f2d7
[chore] Make llama4 MoE use maybe_execute_in_parallel (#3779)
mikeiovine Apr 28, 2025
5fc08c8
Fixing minor typo in allreduce kernel selection (#3912)
hyukn Apr 28, 2025
c29bb8b
test: add deepseek v3 & r1 cases (#3528)
VALLIS-NERIA Apr 28, 2025
f4c0627
[fix] Fix flashinfer + speculation issues (#3686)
mikeiovine Apr 28, 2025
5c9beb8
waive test_attention_no_cache (#3921)
hchings Apr 28, 2025
b6653d5
fix: Fix FMHA-based MLA in the generation phase and add MLA unit test…
jinyangyuan-nvidia Apr 29, 2025
c0a94cc
chore: remove DummyKvCacheManager. (#3896)
yuxianq Apr 29, 2025
5921e00
refactor(test): remove random context sequence lengths and set seed f…
qixiang-99 Apr 29, 2025
431949f
feat: fix erros on scaffolding README (#3899)
WeiHaocheng Apr 29, 2025
83ddd55
Fix fp8 kvcache (#3877)
hlu1 Apr 29, 2025
eddcbc1
feat: add CGA reduction fmha kernels on Blackwell. (#3763)
PerkzZheng Apr 29, 2025
ef15ca8
increase H100 CI nodes for PyTorch only pipelines (#3927)
QiJune Apr 29, 2025
a2fbb89
[TRTLLM-4883][fix]: Update output speed calculation. (#3923)
FrankD412 Apr 29, 2025
8297039
add num_scheduled_requests into print_log (#3914)
byshiue Apr 29, 2025
968044d
fix: revert https://github.com/NVIDIA/TensorRT-LLM/pull/3858 (#3928)
yuxianq Apr 29, 2025
16b81fd
change log level of some text from info to debug (#3930)
byshiue Apr 29, 2025
76447e2
optimize cudaMemGetInfo for TllmGenFmhaRunner (#3907)
zhhuang-nv Apr 29, 2025
cc47137
chore: bump version to 0.19.0 (#3598) (#3841)
DomBrown Apr 29, 2025
0358623
feat: parallel q_b_proj and concat (#3917)
hello-11 Apr 29, 2025
8efb398
refactor: (part1) Add contraints doc for fusedMoe module. (#3882)
HuiGao-NV Apr 29, 2025
f5d1842
fix: get head_dim from model’s config. (#3916)
yuxianq Apr 29, 2025
7ce64a3
TRTLLM-4624 feat: Add nvfp4 gemm and moe support for SM120 (#3770)
VALLIS-NERIA Apr 29, 2025
80f0d01
Support NemotronH FP8 Quantization
tomeras91 Apr 29, 2025
2bd4a8e
fix: change the seq_lens sync copy to an async one (#3786)
lfr-0531 Apr 29, 2025
a265597
skip blackwell tests for sm120 (#3815)
pamelap-nvidia Apr 29, 2025
0c85a57
ci: skip pipeline parallelism test of pytorch flow (#3947)
QiJune Apr 29, 2025
7dcf4f7
sync internal cutlass kernel changes (#3968)
pamelap-nvidia Apr 30, 2025
935f044
chore: update multi-gpu trigger file list (#3971)
QiJune Apr 30, 2025
4201b59
update waive list (#3890)
xinhe-nv Apr 30, 2025
650d5f3
chore: Remove duplicated get_sm_version. (#3935)
yuxianq Apr 30, 2025
1fea894
chore: bump version to 0.20.0rc2 (#3949)
ZhanruiSunCh Apr 30, 2025
dfd00d9
perf: Optimise MOE prologue to use fused setup function (#3790)
djns99 Apr 30, 2025
635693f
remove release branch codeowners (#3954)
tburt-nv Apr 30, 2025
c5c2bef
fix: [https://nvbugspro.nvidia.com/bug/5243482] If FlashMLA is used, …
bobboli Apr 30, 2025
72fe31d
unwaive disagg tests (#3925)
chuangz0 Apr 30, 2025
2eccd40
infra: open source XQA kernels (#3762)
ming-wei Apr 30, 2025
3fc98fd
feat: Mistral-Large-2 support in the Pytorch workflow
hypdeb Apr 30, 2025
8865d29
chore: update internal_cutlass_kernels. (#3973)
nv-guomingz Apr 30, 2025
036851a
[fix] Pad requests to maximum draft length in spec decode (#3957)
mikeiovine Apr 30, 2025
520da31
infra: add conan (#3744)
tburt-nv Apr 30, 2025
3bf0d1e
waive test_tinyllama_guided_decoding (#3997)
hchings Apr 30, 2025
e2bad33
[TRTLLM-4460] test: Use Llama 3.2 1B for Llama C++ tests (#3206)
DomBrown Apr 30, 2025
5d7a012
Clean up allreduce op in Deepseek V3 model. (#3829)
hyukn Apr 30, 2025
8de3955
[feat]: Allow for a settable end-of-sequence/padding token in max thr…
FrankD412 May 1, 2025
ae05cd6
feat: Add multimodal embedding field in LlmRequest (#3855)
katec846 May 1, 2025
5d08412
Llama4 processor fixes (#3994)
milesial May 1, 2025
9c38f93
Add attention workspace memory check (#3970)
hlu1 May 1, 2025
c6ae15b
feat: add relaxed acceptance for DS (#3865)
yweng0828 May 1, 2025
9beabde
fix:https://nvbugs/5246733 (#3989)
nv-guomingz May 1, 2025
62bf060
model: support Qwen3 (#4010)
byshiue May 1, 2025
b1f027b
test: [CI] Add failed cases into waives.txt (#3943)
xinhe-nv May 1, 2025
a785931
feat: Support Top-K logprobs and prompt_logprobs in LLMAPI (#3388)
hchings May 1, 2025
45dc58e
[AutoDeploy] Make all ranks agree on kv-cache size (#4007)
suyoggupta May 1, 2025
3a83a24
feat: LogitsProcessor in PyTorch backend (#3145)
hchings May 1, 2025
9b32797
Fallback to NCCL for various patterns when input size is large. (#4009)
hyukn May 1, 2025
b050731
replace raw_request that was None with mock_request that is a dict to…
dsingal0 May 7, 2025
1d9fffd
feat: [AutoDeploy] unfusing attention for native support (#3668)
lucaslie May 2, 2025
34f1b19
feat: Add group_rms_norm kernel to normalize multiple inputs in a sin…
SimengLiu-nv May 2, 2025
dcc73e3
add ci and doc for qwen3 (#4022)
byshiue May 2, 2025
9340534
Fix Deepseek MTP with moe_backend=TRTLLM (#4001)
hlu1 May 2, 2025
1811c96
fix: Move all casters to customCasters. (#3945)
dcampora May 2, 2025
aeaec95
fix: [nvbug/5252057] Fix kv cache reuse on PyTorch multimodal (#4025)
yechank-nvidia May 2, 2025
51f52e6
[https://nvbugs/5248923] fix: Correctly sizes seqslotmanager consider…
dcampora May 2, 2025
d605b38
[infra] Improve llama4 parallelism test coverage (#3821)
mikeiovine May 2, 2025
7a65da1
feat: add Pytorch support of Vision Encoder for multimodal models (#3…
qixiang-99 May 2, 2025
a9f8d43
build: keep using system python for dev install (#4014)
tburt-nv May 3, 2025
d901d2c
refactor: Move ModelSpec to core library (#3980)
Funatiq May 3, 2025
d225845
infra: Remove the WAR for test items incompletely (#3313)
EmmaQiaoCh May 4, 2025
3773fec
refactor: Introduce MpiTag enumeration and update MPI function signat…
Funatiq May 4, 2025
ce52de2
chore: refactor llmapi e2e tests (#3803)
Superjomn May 4, 2025
377b488
update CI allowlist (#3969)
tburt-nv May 5, 2025
6cc2abb
feat: support to trace executor loop. (#3983)
yuxianq May 5, 2025
5c6d5c3
fix: [nvbug/5241627] Fix AllReduce kernel hang issue when both tp and…
hyukn May 5, 2025
33310af
Waive L0 tests (#4051)
yiqingy0 May 5, 2025
36a815a
fix: apply rope twice in Qwen3. (#4040)
yuxianq May 5, 2025
71cf0b7
fix: instantiate decoder early in pytorch (#4029)
dcampora May 5, 2025
a278dd8
feat: run mmlu and summarize without engine_dir. (#4056)
yuxianq May 5, 2025
c461231
[Test]: Waive unsupported tests (#4059)
chzblych May 5, 2025
cc96066
fix: request termination in pipeline parallelism (#3892)
Funatiq May 5, 2025
25e27b1
[Test]: Clean up stale waives (#4062)
chzblych May 5, 2025
993b755
test: Add disaggregated serving accuracy tests (#4036)
Tabrizian May 5, 2025
9838acf
[fix] Skip debugCheckSemaphores in stream capture mode (#4032)
mikeiovine May 5, 2025
48aea8c
test: Test OOB access issue in penaltyKernel for endId=-1 (#4035)
brb-nv May 5, 2025
76c76a7
feat: add deepseek-r1 reasoning parser to trtllm-serve (#3354)
pansicheng May 6, 2025
91a3150
Fix: fix bug of qwen3 moe (#4058)
byshiue May 6, 2025
50be72c
doc: update qwen3 document (#4073)
byshiue May 6, 2025
869f76b
[AutoDeploy][perf] Further optimize flashinfer backend in AutoDeploy …
suyoggupta May 6, 2025
1aa2cb6
[fix] Loosen the thresholds of test_attention_mla (#4074)
jinyangyuan-nvidia May 6, 2025
31fe795
feat: support add internal cutlass kernels as subproject (#3658)
tongyuantongyu May 6, 2025
4db432a
fix: skip add new slot if request has slot 0 (#3991)
HuiGao-NV May 6, 2025
be149ce
fix: Fix NVLink version decoding. (#3996)
yuxianq May 6, 2025
59bf5c7
[https://nvbugs/5247414] fix: draft/target probs shape (#4055)
Funatiq May 6, 2025
69e0a1a
infra: [TRTLLM-4475][TRTLLM-4565] Add pipeline hierarchy and basic in…
ZhanruiSunCh May 6, 2025
a28c667
fix: trtllm-serve hang in stress test and ds v3 stress parameter upda…
dominicshanshan May 6, 2025
8c1ff24
[TRTLLM-3429] feat: Overlap scheduling in C++ runtime (#3625)
Funatiq May 6, 2025
6a42831
fix: Properly get decoding mode according to same logic as cpp. (#4026)
dcampora May 6, 2025
646b943
cleanup logprob params (#4039)
hchings May 6, 2025
9055c2a
fix: Pass local dir to processor creation (#4018)
milesial May 6, 2025
dc520f6
test(perf): Add Llama-3.1-Nemotron-8B-v1 to perf tests (#3822)
venkywonka May 7, 2025
2a0b1e2
bench: TRTLLM-4936 Port benchmark_serving.py (#4011)
kaiyux May 7, 2025
f9eed9d
fix cache buffer (#3942)
chuangz0 May 7, 2025
443b198
[TRTLLM-3925, https://nvbugs/5245262] [fix] Normalize LLM.generate AP…
syuoni May 7, 2025
7deb53b
[Qwen3] chore: fix bug of fused_moe on tp > 1 (#4093)
byshiue May 7, 2025
f5296de
Adding option to specify a set of token ids for multimodal tokens (#4…
rakib-hasan May 7, 2025
1c16c7f
chore: Cleanup deprecated APIs from LLM-API (part 1/2) (#3732)
Superjomn May 7, 2025
134cf64
[Infra] - Update code ownership rules (#4109)
chzblych May 7, 2025
013f8de
tests: skip writing prepare_dataset output to logs, and add llama_v3.…
ruodil May 7, 2025
99ea43c
fix: Align default setting & remove unnecessary check for chat and co…
LinPoly May 7, 2025
3ecae77
infra: [TRTLLM-4051] Support only run some backend type test (#3578)
ZhanruiSunCh May 7, 2025
3cb3b1a
chore:update .gitignore for doc building task. (#3993)
nv-guomingz May 7, 2025
ada0842
enh: Update docker Makefile to use only the visible GPUs of machine (…
venkywonka May 7, 2025
8ca2a44
feat: Reduce branch overhead in groupRMSNorm kernels (#4067)
SimengLiu-nv May 7, 2025
f25f888
[Deepseek] Refactor Deepseek Decoder layer (#4016)
hlu1 May 7, 2025
85c29cc
[feat/] enable attention DP in Llama4 maverick model - part 1 (#4065)
zihaok May 7, 2025
9ac40e6
test: add INTEGRATION_TEST env var to speed up integration test (#3618)
crazydemo May 8, 2025
4b8fdc4
[Infra] - Update code ownership rules for public APIs (#4122)
chzblych May 8, 2025
79bd4a8
chore: remove data stage in serve example on slurm (#4138)
Superjomn May 8, 2025
65e120f
test: Waive test_llm cases (#4136)
syuoni May 8, 2025
b5f3817
test: Waive disagg accuracy test (#4124)
syuoni May 8, 2025
4580a02
infra: WAR for Argument list too long of globalVars[CACHED_CHANGED_FI…
ZhanruiSunCh May 8, 2025
c1585df
feat: Add Slurm support and enable RTX Pro 6000 testing pipeline in C…
yuanjingx87 May 8, 2025
7888478
[Infra] Waive L0 flaky test (#4148)
yiqingy0 May 8, 2025
dcdf7af
doc: TRTLLM-4797 Update perf-analysis.md (#4100)
kaiyux May 8, 2025
40b0627
Fix TP8 for NVFP4 kv dupilcation. (#4143)
Tracin May 8, 2025
d2d25af
test: [CI] remove closed bugs (#4046)
xinhe-nv May 8, 2025
8de0811
[TRTQA-2861][test]: add nemotron and llama4 cases into qa test (#4053)
crazydemo May 8, 2025
1961282
chore: enhance the cmake experience by ignoring the additional semico…
nv-guomingz May 8, 2025
44e0287
[TRTLLM-4480][doc] Documentation for new accuracy test suite and trtl…
syuoni May 8, 2025
5ad095d
feat: adopt new logprob definition in PyTorch flow (#4057)
tongyuantongyu May 8, 2025
f580e6d
infra: Add NIXL into the Dockerfile (#3981)
Shixiaowei02 May 8, 2025
549d305
feat: support multi lora adapters and TP (#3885)
shaharmor98 May 8, 2025
2095073
small fix to not check if disconnected on the raw_request if it is None
dsingal0 May 9, 2025
3a9280f
feat: Fallback to NCCL for various patterns when input size is large.…
hyukn May 8, 2025
51a75a3
Cherry-pick trtllm-gen from feat/llama4 to main (#4086)
chenfeiz0326 May 8, 2025
2a75126
[fix] [AutoDeploy] flashinfer usage on H100 (#4162)
lucaslie May 8, 2025
977de4b
Fix incorrect conversion. (#4112)
FrankD412 May 8, 2025
05bf3cd
[fix] Fix llama4 + eagle3 (#3998)
mikeiovine May 8, 2025
0a120c2
Support RingAttention in the BertAttention plugin and the DiT model (…
ChunhuanLin May 9, 2025
b8dfe0e
fix: alltoall padding for chunked MoE (#4157)
dongxuy04 May 9, 2025
be41c13
[feat] Allow overriding cli args with yaml file in trtllm-serve (#4164)
pcastonguay May 9, 2025
c8e4dd8
[TRTLLM-5147][Qwen3] fix: fix bug of attention dp on qwen3_moe model …
byshiue May 9, 2025
0f029a2
chore: Clean up the legacy DeepseekAllreudceFusionOp. (#4081)
hyukn May 9, 2025
929b23f
test: add qwen3 and disaggregated serving accuracy tests to qa test l…
StanleySun639 May 9, 2025
4659238
[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support (#3804)
yizhang-nv May 9, 2025
46f39d1
fix: change pp broadcast pattern for LPs (#4130)
hchings May 9, 2025
78bc10b
[#4085][fix] Fix `apply_per_channel_scale` for extremely large input …
StudyingShao May 9, 2025
98828fe
[nvbug/5262268][fix] Fix trtllm-bench for llama 4 (#4104)
mikeiovine May 9, 2025
71c1893
chore: Fix pipeline break caused by previous PR (#4081) rebase + pipe…
hyukn May 9, 2025
2afc5c0
[https://nvbugspro.nvidia.com/bug/5260676]test: skip fp8 quantization…
crazydemo May 9, 2025
3236545
test: move mistral / mixtral test cases in QA test list into the new …
crazydemo May 9, 2025
d78a50a
test: Add fp8kv to DS-v3-lite integration tests. (#3950)
bobboli May 9, 2025
3dc1aca
Merge branch 'main' into main
dsingal0 May 9, 2025
e9300fb
feat: Add health genearte, health_generate test and fix mpt.py
dsingal0 May 9, 2025
c436961
Merge branch 'main' into main
kaiyux May 9, 2025
6da7ae8
fix issues caught by pre-commit checks
dsingal0 May 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 39 additions & 2 deletions tensorrt_llm/serve/openai_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,8 @@ async def validation_exception_handler(_, exc):
self.register_routes()

async def await_disconnected(self, raw_request: Request, promise):
if raw_request is None:
return
while not await raw_request.is_disconnected():
await asyncio.sleep(1)
if not promise.finished:
Expand Down Expand Up @@ -116,6 +118,7 @@ def register_routes(self):
self.app.add_api_route("/v1/chat/completions",
self.openai_chat,
methods=["POST"])
self.app.add_api_route("/health_generate", self.health_generate, methods=["GET"])

async def health(self) -> Response:
return Response(status_code=200)
Expand Down Expand Up @@ -144,6 +147,40 @@ async def get_kv_cache_events(self) -> JSONResponse:
pass
return JSONResponse(content=events)

async def health_generate(self) -> Response:
"""Health check that performs a minimal generation."""
try:
# Create a minimal chat request
health_request = ChatCompletionRequest(
messages=[{"role": "user", "content": "hi"}], # Minimal prompt (often > 1 token after tokenization)
model=self.model,
max_completion_tokens=1, # Request only 1 token out
stream=False,
temperature=0.0 # Deterministic output
)

mock_request = None

# Call the chat completion logic
response = await self.openai_chat(health_request, mock_request)

# Check if the response indicates success (status code 200)
if response.status_code == 200:
return Response(status_code=200, content="Generation health check OK")
else:
logger.error(f"Health generate check failed with status code: {response.status_code}")
try:
# Attempt to get body for more details if possible
body = response.body if hasattr(response, 'body') else await response.body()
logger.error(f"Health generate check response body: {body}")
except Exception:
pass # Ignore errors trying to get body details
return Response(status_code=500, content="Generation health check failed")

except Exception as e:
logger.error(f"Health generate check encountered exception: {e}", exc_info=True)
return Response(status_code=500, content=f"Generation health check failed: {str(e)}")

async def openai_chat(self, request: ChatCompletionRequest, raw_request: Request) -> Response:

def get_role() -> str:
Expand All @@ -161,7 +198,7 @@ async def chat_stream_generator(
pp_results = res.outputs[0]._postprocess_result if self.postproc_worker_enabled else post_processor(res, args)
for pp_res in pp_results:
yield pp_res
yield f"data: [DONE]\n\n"
yield "data: [DONE]\n\n"
nvtx_mark("generation ends")

async def create_chat_response(
Expand Down Expand Up @@ -281,7 +318,7 @@ async def create_completion_generator(
pp_result = request_output.outputs[0]._postprocess_result
for pp_res in pp_result:
yield pp_res
yield f"data: [DONE]\n\n"
yield "data: [DONE]\n\n"

async def create_completion_response(
generator: AsyncIterator[Tuple[RequestOutput, Optional[PostprocParams]]]) -> CompletionResponse:
Expand Down
5 changes: 5 additions & 0 deletions tests/unittest/llmapi/apps/_test_llm_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@ def test_health(client):
assert response.status_code == 200


def test_health_generate(client):
response = client.get("/health_generate")
assert response.status_code == 200


def test_generate(client):
response = client.post("/generate", json={"prompt": "A B C"})
assert response.status_code == 200
Expand Down