[Executor] Avoid OOM when start the service while Enable Chunked Prefill + CudaGraph #2914

littledgg · 2025-07-18T08:40:58Z

when enable chunked prefill，max-model-len is much bigger than max-num-batched-tokens，cudagraph capture graph of max-model-len will cost a lot GPU Memory，which may cause OOM.

paddle-bot · 2025-07-18T08:41:02Z

Thanks for your contribution!

gongshaotian

LGTM

littledgg · 2025-07-21T07:17:47Z

Due to some mistakes, new pr is #2936

paddle-bot bot added the contributor External developers label Jul 18, 2025

gongshaotian approved these changes Jul 18, 2025

View reviewed changes

littledgg closed this Jul 21, 2025

littledgg force-pushed the oom_chunkedprefill_cudagraph branch from 093aaab to 67990e0 Compare July 21, 2025 06:57

littledgg mentioned this pull request Jul 21, 2025

[Executor] Avoid OOM when start the service while Enable Chunked Prefill + CudaGraph #2936

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Executor] Avoid OOM when start the service while Enable Chunked Prefill + CudaGraph #2914

[Executor] Avoid OOM when start the service while Enable Chunked Prefill + CudaGraph #2914

Uh oh!

littledgg commented Jul 18, 2025

Uh oh!

paddle-bot bot commented Jul 18, 2025

Uh oh!

gongshaotian left a comment

Uh oh!

littledgg commented Jul 21, 2025

Uh oh!

Uh oh!

[Executor] Avoid OOM when start the service while Enable Chunked Prefill + CudaGraph #2914

[Executor] Avoid OOM when start the service while Enable Chunked Prefill + CudaGraph #2914

Uh oh!

Conversation

littledgg commented Jul 18, 2025

Uh oh!

paddle-bot bot commented Jul 18, 2025

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

littledgg commented Jul 21, 2025

Uh oh!

Uh oh!