[WIP] Add image-text-to-text pipeline #1347
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #1295
I want to give it a try at implementing this pipeline. It's my first time contributing for this repo so some guidance would be greatly appreciated.
I've had success with the current implementation but it's very limited in scope as a quick look at the code will make it clear.
I'm going to lay out here what I've found getting up to this point.
I started by looking at the implementation in the
image-to-text
pipeline. This gave me some structure of what to do in terms of the general skeleton.I was aiming for Qwen2VL so that's the model I started testing with. The first problem I ran into was that in the
MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
mapping there was noqwen2_vl
which is the way that the model identifies itself. There was though aqwen2-vl
. I don't know if this is a typo but I could trace it to this very helpful pr.Out of caution I added an entry to the mapping but if confirmed the original entry was a typo I can include the fix in this merge. The folder with the model is correctly named qwen2_vl.
After this I could get some output but it was fairly nonsensical as if the image was not being taken into account. I later came to realize that this was due to my lack of knowledge of the library and some wrong presuppositions. Thinking the problem was with the model I decided to switch it up and test with Moondream2 which is also listed in the mapping for the image-text-to-text.
This lead me to a new error. Turns out the processor for this model does not take the text so I had to do the image processing and the text tokenization separately.
Q1: What is the recommendation for normalizing the processor interface in this case?
The new model was also giving me some nonsensical replies but this time it was obvious to me that it was because I wasn't adding the images placeholders in the text. I tried using
this.processor.apply_chat_template
but it obviously didn't work because as mentioned above the Moondream2 provided processor only has image processing.Then I tried
this.tokenizer.apply_chat_template(...)
but then for the Moondream2 the AutoTokenizer chosen didn't have a chat_template which then lead me to go back to Qwen2VL.Q2: What do I do about these models that don't have chat_templates?
After coming back to Qwen2VL I put the image placeholder as part of the conversation and using the
apply_chat_template
on the processor worked. After taking a quick look at the code qwen2_vl processor I could see that in this case the image tag was being correctly parsed. Finally I got a correct response. The drawback is that in the content I had to put the qwen2_vl specific placeholder.Q3: What is the recommendation to normalizing the content type declarations?
Questions
I left in the previous texts some descriptions that led to the laid out questions which I'm going to elaborate on in this section. Most of these questions are regarding the library architecture and given this is my first attempt at contributing to this repo I don't want to impose any decisions and would appreciate some direction.
Q1
Given the processor interface is different I'd like to know how does the maintainers suggest I address this. The options I see are:
self.processor(images=images, text=text, ...)
is possible. Obviously backwards compatibility with the current single tensor if no object is passed would need to be maintained.Q2
Honest question. Some guidance here would be appreciated. I don't quite understand why the apply_chat_template for the Moondream2 is not working and some help here would be very nice. I don't know how I can specify where the images belong in the context of the conversation without doing this step.
Q3
Here the problem for me lies with how the conversation argument is defined. In this library it is simply an array of Message and message only has two attributes:
role
andcontent
. This is very different from the Python library where the content is itself an object with atype
and other attributes depending on the type. In the case of type text there's an attributetext
. This is a useful approach that allows me to define a conversation as having an "image" somewhere and then the specific token for that will be determined by the tokenizer.I hope my questions make sense and don't just purely reveal my very superficial knowledge of the library. Any and all guidance will be appreciated. Looking forward for the feedback.