Skip to content

Commit 9aab965

Browse files
molbapariG23498
andauthored
Add vision contribution guide (#41456)
* vision contrib guide * Update CONTRIBUTING.md Co-authored-by: Aritra Roy Gosthipaty <[email protected]> * Update CONTRIBUTING.md Co-authored-by: Aritra Roy Gosthipaty <[email protected]> * Update CONTRIBUTING.md Co-authored-by: Aritra Roy Gosthipaty <[email protected]> * Update CONTRIBUTING.md Co-authored-by: Aritra Roy Gosthipaty <[email protected]> * Update CONTRIBUTING.md Co-authored-by: Aritra Roy Gosthipaty <[email protected]> * Update CONTRIBUTING.md Co-authored-by: Aritra Roy Gosthipaty <[email protected]> * update tiny things --------- Co-authored-by: Aritra Roy Gosthipaty <[email protected]>
1 parent 1a034ce commit 9aab965

File tree

1 file changed

+119
-1
lines changed

1 file changed

+119
-1
lines changed

CONTRIBUTING.md

Lines changed: 119 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,125 @@ New models are constantly released and if you want to implement a new model, ple
112112

113113
If you are willing to contribute the model yourself, let us know so we can help you add it to 🤗 Transformers!
114114

115-
We have a technical guide for [how to add a model to 🤗 Transformers](https://huggingface.co/docs/transformers/add_new_model).
115+
We have a technical guide for [how to add a model to 🤗 Transformers](https://huggingface.co/docs/transformers/modular_transformers).
116+
117+
### Vision-Language Model Contribution Checklist
118+
119+
If you're contributing a **vision-language model** (or any multimodal model that processes images/videos), please follow this checklist. Maintainers will use this to review your PR, and completing these steps will significantly increase the likelihood of your PR being merged quickly.
120+
121+
**Required checklist for all vision-language model contributions:**
122+
123+
**1. Implement a modular file**
124+
125+
All new models should use the modular architecture pattern. Create a `modular_<model_name>.py` file using the modular model converter:
126+
127+
- Use the CLI, [`transformers add-new-model-like`](https://github.com/huggingface/transformers/blob/main/src/transformers/cli/add_new_model_like.py) to generate a modular skeleton and get started
128+
- All code should be in the modular file if possible. Modeling must be in it, it's better if configuration is in it as well.
129+
- Reuse existing patterns from similar models as much as possible
130+
131+
To verify your modular file is correct, run:
132+
133+
```bash
134+
python utils/modular_model_converter.py <model_name>
135+
```
136+
137+
This will generate the separate files (`modeling_*.py`, `configuration_*.py`, etc.) from your modular file. The CI will enforce that these generated files match your modular file.
138+
139+
**2. Add a fast image processor (for image models)**
140+
141+
If your model processes images, implement a fast image processor that uses `torch` and `torchvision` instead of PIL/numpy for better inference performance:
142+
143+
- See the detailed guide in [#36978](https://github.com/huggingface/transformers/issues/36978)
144+
- Fast processors inherit from `BaseImageProcessorFast`
145+
- Examples: `LlavaOnevisionImageProcessorFast`, `Idefics2ImageProcessorFast`
146+
147+
**3. Create a weight conversion script**
148+
149+
Add a `convert_<model_name>_to_hf.py` script that converts the original model weights to the HuggingFace format:
150+
151+
- Script should handle checkpoint loading, key mapping, and saving in HF format
152+
- Include usage examples and documentation in the script
153+
- Examples: [`convert_llava_onevision_weights_to_hf.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_onevision/convert_llava_onevision_weights_to_hf.py), [`convert_idefics2_weights_to_hf.py`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics2/convert_idefics2_weights_to_hf.py)
154+
155+
**4. Add integration tests with exact output matching**
156+
157+
At minimum, add an `IntegrationTest` class that tests end-to-end generation (processing and modelling) with **exact** output matching:
158+
159+
- For generative models: test that generated text matches expected output exactly
160+
- For non-generative models: test that output logits match expected values
161+
- Tests should use real checkpoints (load in 4-bit or half precision if the checkpoint is too big to fit in our CI runners) and real inputs
162+
- Example pattern:
163+
164+
```python
165+
class MyModelIntegrationTest(unittest.TestCase):
166+
@slow
167+
def test_model_integration(self):
168+
model = MyModelForConditionalGeneration.from_pretrained("org/model-name")
169+
processor = AutoProcessor.from_pretrained("org/model-name")
170+
171+
inputs = processor(images=image, text=prompt, return_tensors="pt")
172+
output = model.generate(**inputs, max_new_tokens=20)
173+
174+
EXPECTED_TEXT = "exact expected output"
175+
self.assertEqual(processor.decode(output[0]), EXPECTED_TEXT)
176+
```
177+
178+
See `tests/models/llava_onevision/test_modeling_llava_onevision.py` for complete examples.
179+
180+
**5. Update documentation**
181+
182+
Add or update model documentation:
183+
184+
- Create if the cli hasn't `docs/source/en/model_doc/<model_name>.md` with usage examples
185+
- Include model description, paper link, and basic usage with `Pipeline` and `AutoModel`
186+
- Add the model to the appropriate TOC files
187+
188+
**6. Look for reusable patterns**
189+
190+
The library has 400+ models with many established patterns:
191+
192+
- Search for similar models (e.g., other vision-language models)
193+
- Reuse attention mechanisms, layer implementations, and processing patterns
194+
- Check models like LLaVA, Idefics2, Fuyu for vision-language patterns
195+
- Use provided decorators like (`auto_docstring`, `can_return_tuple`, `check_model_inputs` and `_can_record_outputs`) where relevant.
196+
- Don't reinvent the wheel
197+
198+
**7. Run quality checks and read the output**
199+
200+
Before submitting your PR, install quality dependencies and run the full check suite:
201+
202+
```bash
203+
pip install -e ".[quality]"
204+
make fixup
205+
```
206+
207+
**Important**: Take time to read the output of `make fixup`. It will:
208+
- Lint and format your code automatically
209+
- Run consistency checks (imports, docstrings, etc.)
210+
- Show any remaining issues that need manual fixes
211+
212+
All checks must pass before your PR can be merged.
213+
214+
**If this checklist is complete, your PR has a very high likelihood of being merged!** Following these steps makes the maintainers' work much easier and will reduce the number of review iterations, getting your important work out there faster.
215+
216+
#### Copy-pastable checklist for maintainers
217+
218+
Here's a condensed version maintainers can copy into PRs:
219+
220+
```markdown
221+
## Multimodal Model Addition Checklist
222+
223+
Please ensure your PR completes all following items. See the [full checklist](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#vision-language-model-contribution-checklist) for details.
224+
225+
- [ ] **Modular file**: `modular_<model_name>.py` implemented and verified with `python utils/modular_model_converter.py <model_name>`
226+
- [ ] **Fast image processor**: Implemented using `BaseImageProcessorFast` (see [#36978](https://github.com/huggingface/transformers/issues/36978))
227+
- [ ] **Conversion script**: `convert_<model_name>_to_hf.py` added with usage examples
228+
- [ ] **Integration tests**: End-to-end tests with exact output matching (text or logits)
229+
- [ ] **Documentation**: Model docs added/updated in `docs/source/en/model_doc/`
230+
- [ ] **Pattern reuse**: Verified against similar models (LLaVA, Idefics2, etc.)
231+
- [ ] **Quality checks**: `make fixup` passes with no errors
232+
233+
```
116234

117235
## Do you want to add documentation?
118236

0 commit comments

Comments
 (0)