-
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model][VLM] Add Qwen2.5-Omni model support (thinker only) #15130
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Sorry I don't have time to review in detail tonight, but from a quick glance, can you add this model to the following pages?
|
OK,I will add them tomorrow. |
@fyabc Qwen/Qwen2.5-Omni-7B ?? |
Sorry for the delay - going to take a look at this PR tonight! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution! I have left some comments!
Hi @ywang96 @DarkLight1337 , I update some other examples here, please check the code. |
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Looks like this PR doesn't work with huggingface/transformers#36752 yet
|
I will take a look at it. |
Signed-off-by: fyabc <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much thanks for making this contribution to vLLM!
I did a few fixes & code changes and confirmed now that the examples of this model work on both V1 and V0 (with use_audio_in_video
supported by V0 only), so the only blocker we have is to wait for huggingface/transformers#36752 to be merged!
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: fyabc <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
self.qkv = MergedColumnParallelLinear( | ||
input_size=embed_dim, | ||
output_sizes=[projection_size] * 3, | ||
bias=True, | ||
quant_config=quant_config, | ||
prefix=f"{prefix}.qkv", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some Investigation it's discovered that this change actually introduced some regression for Qwen2.5VL inference, so I'm blocking this until we resolve the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that it works well when tp is 1, but the results are not quite right when tp > 1. I am currently investigating further.
This PR adding support for Qwen2.5-Omni model (thinker only).
Requirements
This PR requires this corresponding transformers PR.
Note: You need to install transformers from source from that branch
Example Usage
Notes
The whole Qwen2.5-Omni model includes three parts:
thinker
: multimodal inputs -> text responses & hidden statestalker
: text responses & hidden states from thinker -> speech codescode2wav
(streaming codec decoder): codes -> speechThis PR only implements the
thinker
part now, it accepts multimodal inputs (images / videos / audios), and generate text responses, similar to other common VLMs.We have also develped an end-to-end implementation (will be released soon), but due to its significant impact on the vLLM framework architecture, we will not create the related pull request for now.
FIX #15563