Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify Idefics2, Idefics3, SmolVLM images handling #37291

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

yonigozlan
Copy link
Member

@yonigozlan yonigozlan commented Apr 4, 2025

Simplify the handling of images in both processing and modeling.

Now the images/patches are flattened before being processed and passed to the models. This means that the image processing is simplified (no need for padding in the number of images/patches dimension), along with the modeling code ( No more padding images/patches containing only 0/False needing to be removed).

I tested thoroughly for each models with multiple images, batched images etc. and found no differences.

Cc @andimarafioti @orrzohar

@github-actions github-actions bot marked this pull request as draft April 4, 2025 17:19
Copy link

github-actions bot commented Apr 4, 2025

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yonigozlan yonigozlan marked this pull request as ready for review April 4, 2025 18:09
@github-actions github-actions bot requested review from ArthurZucker and qubvel April 4, 2025 18:09
@yonigozlan yonigozlan requested a review from zucchini-nlp April 7, 2025 23:54
@yonigozlan
Copy link
Member Author

yonigozlan commented Apr 7, 2025

@zucchini-nlp Hello! Pinging you here as smolvlm also handles video inputs, and I'm wondering what you think about having flattened pixel_values by default when processing videos, instead of grouping them by frames or video instance. Also since most image (and maybe video?) processors for vlm using some kind of patching/splitting flatten the patches when preprocessing, we might want to update the base processing tests to account for that? Or at least make them parameterized

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for cleaning it! I have a few concerns though.

  1. After this PR idefics models will output pixels where first dim is not necessarily the batch size, whenever an image splitting happens. We had problems in the past with Gemma3 (Gemma3 can't be fine-tuned on multi-image examples trl#3121 (comment)) and Qwen2-VL (Qwen2-VL: Multi-GPU training #33666) for the same reason. Tl;DR; train loaders/frameworks iterate over data assuming the first dim is batch and fail when it is not.
    I realize this is not a common case, but we might be breaking train for some users with this. So I'm a bit hesitant to return flat images. LMK what you think about it
  2. Do the model logits stay same if we test with several batches and several images per batch? Let's run slow tests before merging :)

elif inputs_embeds is not None:
batch_size, seq_length, _ = inputs_embeds.shape
else:
if input_ids is None and inputs_embeds is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to also check cases when both are not None:

if (input_ids is None) ^ (inputs_embeds is not None):

elif inputs_embeds is not None:
batch_size, seq_length, _ = inputs_embeds.shape
else:
if input_ids is None and inputs_embeds is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants