[WIP] Add image-text-to-text pipeline #1347

AndreasBBS · 2025-06-26T03:12:55Z

I want to give it a try at implementing this pipeline. It's my first time contributing for this repo so some guidance would be greatly appreciated.

I've had success with the current implementation but it's very limited in scope as a quick look at the code will make it clear.
I'm going to lay out here what I've found getting up to this point.

I started by looking at the implementation in the image-to-text pipeline. This gave me some structure of what to do in terms of the general skeleton.

I was aiming for Qwen2VL so that's the model I started testing with. The first problem I ran into was that in the MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES mapping there was no qwen2_vl which is the way that the model identifies itself. There was though a qwen2-vl. I don't know if this is a typo but I could trace it to this very helpful pr.
Out of caution I added an entry to the mapping but if confirmed the original entry was a typo I can include the fix in this merge. The folder with the model is correctly named qwen2_vl.

After this I could get some output but it was fairly nonsensical as if the image was not being taken into account. I later came to realize that this was due to my lack of knowledge of the library and some wrong presuppositions. Thinking the problem was with the model I decided to switch it up and test with Moondream2 which is also listed in the mapping for the image-text-to-text.
This lead me to a new error. Turns out the processor for this model does not take the text so I had to do the image processing and the text tokenization separately.
Q1: What is the recommendation for normalizing the processor interface in this case?

The new model was also giving me some nonsensical replies but this time it was obvious to me that it was because I wasn't adding the images placeholders in the text. I tried using this.processor.apply_chat_template but it obviously didn't work because as mentioned above the Moondream2 provided processor only has image processing.
Then I tried this.tokenizer.apply_chat_template(...) but then for the Moondream2 the AutoTokenizer chosen didn't have a chat_template which then lead me to go back to Qwen2VL.
Q2: What do I do about these models that don't have chat_templates?

After coming back to Qwen2VL I put the image placeholder as part of the conversation and using the apply_chat_template on the processor worked. After taking a quick look at the code qwen2_vl processor I could see that in this case the image tag was being correctly parsed. Finally I got a correct response. The drawback is that in the content I had to put the qwen2_vl specific placeholder.
Q3: What is the recommendation to normalizing the content type declarations?

Questions

I left in the previous texts some descriptions that led to the laid out questions which I'm going to elaborate on in this section. Most of these questions are regarding the library architecture and given this is my first attempt at contributing to this repo I don't want to impose any decisions and would appreciate some direction.

Q1

Given the processor interface is different I'd like to know how does the maintainers suggest I address this. The options I see are:

to normalize the processor class so that the input can be an object with keys of the type of input. This would mimic the Python implementation where something like self.processor(images=images, text=text, ...) is possible. Obviously backwards compatibility with the current single tensor if no object is passed would need to be maintained.
to adapt all the other processors of the image-text-to-text tasks to comply with the interface currently implemented in the qwen2_vl. Even though this might be a quicker solution for now in my opinion doesn't really scale as well as the previous option cause eventually other combinations of inputs will come into the fold like video and audio.

Q2

Honest question. Some guidance here would be appreciated. I don't quite understand why the apply_chat_template for the Moondream2 is not working and some help here would be very nice. I don't know how I can specify where the images belong in the context of the conversation without doing this step.

Q3

Here the problem for me lies with how the conversation argument is defined. In this library it is simply an array of Message and message only has two attributes: role and content. This is very different from the Python library where the content is itself an object with a type and other attributes depending on the type. In the case of type text there's an attribute text. This is a useful approach that allows me to define a conversation as having an "image" somewhere and then the specific token for that will be determined by the tokenizer.

I hope my questions make sense and don't just purely reveal my very superficial knowledge of the library. Any and all guidance will be appreciated. Looking forward for the feedback.

feat: Introduced the ImageTextToTextPipeline class for processing images and text together. feat: Added support for the 'image-text-to-text' task in the pipeline configuration. chore: Included a new model mapping for 'qwen2_vl' in models.js. chore: Created unit tests for the ImageTextToTextPipeline to ensure functionality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Add image-text-to-text pipeline #1347

[WIP] Add image-text-to-text pipeline #1347

Uh oh!

AndreasBBS commented Jun 26, 2025

Uh oh!

Uh oh!

[WIP] Add image-text-to-text pipeline #1347

Are you sure you want to change the base?

[WIP] Add image-text-to-text pipeline #1347

Uh oh!

Conversation

AndreasBBS commented Jun 26, 2025

Questions

Q1

Q2

Q3

Uh oh!

Uh oh!