-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Lazily initialize null_block only when it is needed #26939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Elaine Zhao <[email protected]>
Signed-off-by: Elaine Zhao <[email protected]>
Signed-off-by: Elaine Zhao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
…ee blocks exist Signed-off-by: Elaine Zhao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Signed-off-by: Elaine Zhao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
…block as null block Signed-off-by: Elaine Zhao <[email protected]>
Signed-off-by: Elaine Zhao <[email protected]>
|
I think lazy initialization is a bit magic. Can you take |
Thanks @heheda12345. I agree that
In addition, I think it should also account for:
However, those validations seem orthogonal to this PR. The goal of this PR is purely to introduce lazy initialization of null_block to avoid unnecessary early allocations. I would like to:
Let me know your thoughts. |
Purpose
After PR #14097 we now always subtract 1 usable block from the block pool and designate it as the
null_block, even when sliding window attention is not used. This creates an edge case when users specify the minimum required number of blocks for their model & config via--num-gpu-blocks-overridebut then vllm comes and deducts 1 from that (if user specifies 1 block it's effectively 0).This PR proposes to lazily-initialize the null_block only when it is called (mainly in
remove_skipped_blocks). As a result, for kv cache managers like FullAttentionManager, thenull_blocknever materializes and all blocks are available for allocation.Test Plan
Added unit tests.
Also tested the change E2E locally on Neuron device with the following server command and request:
Test Result
Without the change, the vllm scheduler is unable to schedule the request since it only sees 3 blocks available. It will repeatedly send an empty SchedulerOutput to the model runner and the server hangs indefinitely.
After the change, the request goes through successfully.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.