Merge pull request #239 from mistralai/add_patch_merger

juliendenize · web-flow · commit a4fe135b68c7 · 2025-03-20T14:37:46.000+01:00
Add support to Mistral Small 3.1
diff --git a/README.md b/README.md
@@ -15,6 +15,7 @@ Blog Mathstral 7B: [https://mistral.ai/news/mathstral/](https://mistral.ai/news/
 Blog Nemo: [https://mistral.ai/news/mistral-nemo/](https://mistral.ai/news/mistral-nemo/) \
 Blog Mistral Large 2: [https://mistral.ai/news/mistral-large-2407/](https://mistral.ai/news/mistral-large-2407/) \
 Blog Pixtral 12B: [https://mistral.ai/news/pixtral-12b/](https://mistral.ai/news/pixtral-12b/)
+Blog Mistral Small 3.1: [https://mistral.ai/news/mistral-small-3-1/](https://mistral.ai/news/mistral-small-3-1/)
 
 Discord: [https://discord.com/invite/mistralai](https://discord.com/invite/mistralai)\
 Documentation: [https://docs.mistral.ai/](https://docs.mistral.ai/)\
@@ -39,6 +40,8 @@ cd $HOME/mistral-inference && poetry install .
 
 ## Model download
 
+### Direct links
+
 | Name        | Download | md5sum |
 |-------------|-------|-------|
 | 7B Instruct | https://models.mistralcdn.com/mistral-7b-v0-3/mistral-7B-Instruct-v0.3.tar | `80b71fcb6416085bcb4efad86dfb4d52` |
@@ -54,16 +57,27 @@ cd $HOME/mistral-inference && poetry install .
 | Nemo Instruct | https://models.mistralcdn.com/mistral-nemo-2407/mistral-nemo-instruct-2407.tar | `296fbdf911cb88e6f0be74cd04827fe7` |
 | Mistral Large 2 | https://models.mistralcdn.com/mistral-large-2407/mistral-large-instruct-2407.tar | `fc602155f9e39151fba81fcaab2fa7c4` |
 
-Note: 
+Note:
 - **Important**:
   - `mixtral-8x22B-Instruct-v0.3.tar` is exactly the same as [Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1), only stored in `.safetensors` format
   - `mixtral-8x22B-v0.3.tar` is the same as [Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1), but has an extended vocabulary of 32768 tokens.
   - `codestral-22B-v0.1.tar` has a custom non-commercial license, called [Mistral AI Non-Production (MNPL) License](https://mistral.ai/licenses/MNPL-0.1.md)
   - `mistral-large-instruct-2407.tar` has a custom non-commercial license, called [Mistral AI Research (MRL) License](https://mistral.ai/licenses/MRL-0.1.md)
-- All of the listed models above support function calling. For example, Mistral 7B Base/Instruct v3 is a minor update to Mistral 7B Base/Instruct v2,  with the addition of function calling capabilities. 
-- The "coming soon" models will include function calling as well. 
+- All of the listed models above support function calling. For example, Mistral 7B Base/Instruct v3 is a minor update to Mistral 7B Base/Instruct v2,  with the addition of function calling capabilities.
+- The "coming soon" models will include function calling as well.
 - You can download the previous versions of our models from our [docs](https://docs.mistral.ai/getting-started/open_weight_models/#downloading).
 
+### From Hugging Face Hub
+
+| Name        | ID | URL |
+|-------------|-------|-------|
+| Pixtral Large Instruct | mistralai/Pixtral-Large-Instruct-2411 | https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411 |
+| Pixtral 12B Base | mistralai/Pixtral-12B-Base-2409 | https://huggingface.co/mistralai/Pixtral-12B-Base-2409 |
+| Pixtral 12B | mistralai/Pixtral-12B-2409 | https://huggingface.co/mistralai/Pixtral-12B-2409 |
+| Mistral Small 3.1 24B Base | mistralai/Mistral-Small-3.1-24B-Base-2503 | https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Base-2503
+| Mistral Small 3.1 24B Instruct | mistralai/Mistral-Small-3.1-24B-Instruct-2503 | https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503 |
+
+
 ### Usage
 
 **News!!!**: Mistral Large 2 is out. Read more about its capabilities [here](https://mistral.ai/news/mistral-large-2407/).
@@ -83,7 +97,7 @@ mkdir -p $12B_DIR
 tar -xf mistral-nemo-instruct-2407.tar -C $12B_DIR
 ```
 
-or 
+or
 
 ```sh
 export M8x7B_DIR=$MISTRAL_MODEL/8x7b_instruct
@@ -92,6 +106,27 @@ mkdir -p $M8x7B_DIR
 tar -xf Mixtral-8x7B-v0.1-Instruct.tar -C $M8x7B_DIR
 ```
 
+For Hugging Face models' weights, here is an example to download [Mistral Small 3.1 24B Instruct](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503):
+
+```python
+from pathlib import Path
+from huggingface_hub import snapshot_download
+
+
+mistral_models_path = Path.home().joinpath("mistral_models")
+
+model_path = mistral_models_path / "mistral-small-3.1-instruct"
+model_path.mkdir(parents=True, exist_ok=True)
+
+repo_id = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
+
+snapshot_download(
+    repo_id=repo_id,
+    allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"],
+    local_dir=model_path,
+)
+```
+
 ## Usage
 
 The following sections give an overview of how to run the model from the Command-line interface (CLI) or directly within Python.
@@ -170,7 +205,7 @@ To use [Codestral-Mamba](https://mistral.ai/news/codestral-mamba/) as a coding a
 Make sure `$7B_CODESTRAL_MAMBA` is set to a valid path to the downloaded codestral-mamba folder, e.g. `$HOME/mistral_models/mamba-codestral-7B-v0.1`.
 
 You then need to additionally install the following packages:
-  
+
 ```
 pip install packaging mamba-ssm causal-conv1d transformers
 ```
@@ -194,6 +229,19 @@ If you prompt it with *"Albert likes to surf every week. Each surfing session la
 
 You can then continue chatting afterwards, *e.g.* with *"How much would he spend in a year?"*.
 
+- **Chat with Mistral Small 3.1 24B Instruct**
+
+To use [Mistral Small 3.1 24B Instruct](https://mistral.ai/news/mistral-small-3-1/) as an assistant you can run the following command using `mistral-chat`.
+Make sure `$MISTRAL_SMALL_3_1_INSTRUCT` is set to a valid path to the downloaded mistral small folder, e.g. `$HOME/mistral_models/mistral-small-3.1-instruct`
+
+```sh
+    mistral-chat $MISTRAL_SMALL_3_1_INSTRUCT --instruct --max_tokens 256
+```
+
+If you prompt it with *"The above image presents an image of which park ? Please give the hints to identify the park."* with the following image URL *https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png*, the model should answer with the Yosemite park and give hints to identify it.
+
+You can then continue chatting afterwards, *e.g.* with *"What is the name of the lake in the image?"*. The model should respond that it is not a lake but a river.
+
 ### Python
 
 - *Instruction Following*:
@@ -222,6 +270,44 @@ result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])
 print(result)
 ```
 
+- *Multimodal Instruction Following*:
+
+
+```python
+from pathlib import Path
+
+from huggingface_hub import snapshot_download
+from mistral_common.protocol.instruct.messages import ImageURLChunk, TextChunk
+from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
+from mistral_inference.generate import generate
+from mistral_inference.transformer import Transformer
+
+model_path = Path.home().joinpath("mistral_models") / "mistral-small-3.1-instruct" # change to extracted model
+
+tokenizer = MistralTokenizer.from_file(model_path / "tekken.json")
+model = Transformer.from_folder(model_path)
+
+url = "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"
+prompt = "The above image presents an image of which park ? Please give the hints to identify the park."
+
+user_content = [ImageURLChunk(image_url=url), TextChunk(text=prompt)]
+
+tokens, images = tokenizer.instruct_tokenizer.encode_user_content(user_content, False)
+
+out_tokens, _ = generate(
+    [tokens],
+    model,
+    images=[images],
+    max_tokens=256,
+    temperature=0.15,
+    eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id,
+)
+result = tokenizer.decode(out_tokens[0])
+
+print("Prompt:", prompt)
+print("Completion:", result)
+```
+
 - *Function Calling*:
 
 ```py
@@ -298,7 +384,7 @@ print(middle)
 
 ### One-file-ref
 
-If you want a self-contained implementation, look at `one_file_ref.py`, or run it with 
+If you want a self-contained implementation, look at `one_file_ref.py`, or run it with
 
 ```
 python -m one_file_ref $M7B_DIR
diff --git a/src/mistral_inference/args.py b/src/mistral_inference/args.py
@@ -6,6 +6,8 @@
 from mistral_inference.lora import LoraArgs
 from mistral_inference.moe import MoeArgs
 
+PATCH_MERGE = "patch_merge"
+
 
 @dataclass
 class VisionEncoderArgs:
@@ -18,6 +20,10 @@ class VisionEncoderArgs:
     num_attention_heads: int
     rope_theta: float = 1e4  # for rope-2D
     image_token_id: int = 10
+    adapter_bias: bool = True
+    spatial_merge_size: int = 1
+    add_pre_mm_projector_layer_norm: bool = False
+    mm_projector_id: str = ""
 
 
 @dataclass
diff --git a/src/mistral_inference/main.py b/src/mistral_inference/main.py
@@ -161,6 +161,7 @@ def interactive(
             length_tensor = torch.tensor([len(tokens)], dtype=torch.int)
         else:
             length_tensor = torch.tensor([0], dtype=torch.int)
+            images = []
 
         if is_torchrun():
             dist.broadcast(length_tensor, src=0)
diff --git a/src/mistral_inference/transformer.py b/src/mistral_inference/transformer.py
@@ -9,13 +9,13 @@
 import torch
 from torch import nn
 
-from mistral_inference.args import TransformerArgs
+from mistral_inference.args import PATCH_MERGE, TransformerArgs
 from mistral_inference.cache import BufferCache, CacheInputMetadata
 from mistral_inference.lora import LoRALoaderMixin
 from mistral_inference.model import ModelBase
 from mistral_inference.rope import precompute_freqs_cis
 from mistral_inference.transformer_layers import RMSNorm, TransformerBlock
-from mistral_inference.vision_encoder import VisionLanguageAdapter, VisionTransformer
+from mistral_inference.vision_encoder import PatchMerger, VisionLanguageAdapter, VisionTransformer
 
 
 @dataclass
@@ -58,9 +58,22 @@ def __init__(
 
             self.vision_encoder: Optional[VisionTransformer] = None
             self.vision_language_adapter: Optional[VisionLanguageAdapter] = None
+
             if args.vision_encoder is not None:
                 self.vision_encoder = VisionTransformer(args.vision_encoder)
-                self.vision_language_adapter = VisionLanguageAdapter(args.vision_encoder.hidden_size, args.dim)
+                self.vision_language_adapter = VisionLanguageAdapter(
+                    args.vision_encoder.hidden_size, args.dim, args.vision_encoder.adapter_bias
+                )
+
+                if args.vision_encoder.add_pre_mm_projector_layer_norm:
+                    self.pre_mm_projector_norm = RMSNorm(args.vision_encoder.hidden_size, eps=1e-5)
+
+                if args.vision_encoder.mm_projector_id == PATCH_MERGE:
+                    self.patch_merger = PatchMerger(
+                        vision_encoder_dim=args.vision_encoder.hidden_size,
+                        spatial_merge_size=args.vision_encoder.spatial_merge_size,
+                    )
+
         if pipeline_rank == num_pipeline_ranks - 1:
             self.norm = RMSNorm(args.dim, eps=args.norm_eps)
             self.output = nn.Linear(args.dim, args.vocab_size, bias=False)
@@ -106,7 +119,7 @@ def freqs_cis(self) -> torch.Tensor:
             self._precomputed_freqs_cis = self._precomputed_freqs_cis.to(device=self.device)
         return self._precomputed_freqs_cis
 
-    def embed_vision_language_features(self, input_ids: torch.Tensor, images: List[torch.tensor]) -> torch.Tensor:  # type: ignore[valid-type]
+    def embed_vision_language_features(self, input_ids: torch.Tensor, images: List[torch.Tensor]) -> torch.Tensor:
         assert self.tok_embeddings is not None
         assert self.vision_encoder is not None
         assert self.vision_language_adapter is not None
@@ -115,16 +128,28 @@ def embed_vision_language_features(self, input_ids: torch.Tensor, images: List[t
         text_locations = input_ids != self.args.vision_encoder.image_token_id
         image_locations = input_ids == self.args.vision_encoder.image_token_id
         text_features = self.tok_embeddings(input_ids[text_locations])
-        image_features = self.vision_language_adapter(self.vision_encoder(images))
 
-        seq_len = input_ids.shape[0]
+        image_features = self.vision_encoder(images)
+
+        if self.args.vision_encoder.add_pre_mm_projector_layer_norm:
+            image_features = self.pre_mm_projector_norm(image_features)
+
+        if self.args.vision_encoder.mm_projector_id == PATCH_MERGE:
+            patch_size = self.args.vision_encoder.patch_size
+            img_patch_dims = [(img.shape[1] // patch_size, img.shape[2] // patch_size) for img in images]
+            image_features = self.patch_merger(image_features, image_sizes=img_patch_dims)
+
+        image_features = self.vision_language_adapter(image_features)
+
         N_txt, D_txt = text_features.shape
         N_img, D_img = image_features.shape
 
+        seq_len = input_ids.shape[0]
+
         assert D_txt == D_img, f"Text features dim {D_txt} should be equal to image features dim {D_img}"
-        assert (
-            seq_len == N_txt + N_img
-        ), f"seq_len {seq_len} should be equal to N_txt + N_img {(N_txt, N_img, image_locations.sum().item())}"
+        assert seq_len == N_txt + N_img, (
+            f"seq_len {seq_len} should be equal to N_txt + N_img {(N_txt, N_img, image_locations.sum().item())}"
+        )
 
         combined_features = torch.empty(
             (seq_len, D_txt),
@@ -147,9 +172,9 @@ def forward_partial(
         If doing pipeline parallelism, this will return the activations of the last layer of this stage.
         For the last stage, this will return the normalized final embeddings.
         """
-        assert (
-            len(seqlens) <= self.args.max_batch_size
-        ), f"Max batch size is {self.args.max_batch_size}, got batch size of {len(seqlens)}"
+        assert len(seqlens) <= self.args.max_batch_size, (
+            f"Max batch size is {self.args.max_batch_size}, got batch size of {len(seqlens)}"
+        )
         (num_toks,) = input_ids.shape
         assert sum(seqlens) == num_toks, (sum(seqlens), num_toks)
 
@@ -251,9 +276,19 @@ def load_state_dict(self, state_dict: Mapping[str, Any], strict: bool = True, as
                         self.pipeline_rank,
                     )
                     skipped.add(k)
-            elif k.startswith("vision_encoder") or k.startswith("vision_language_adapter"):
-                assert not self.pipeline_rank
-                state_to_load[k] = v
+            elif any(
+                k.startswith(key)
+                for key in ["vision_encoder", "vision_language_adapter", "patch_merger", "pre_mm_projector_norm"]
+            ):
+                if self.pipeline_rank == 0:
+                    state_to_load[k] = v
+                else:
+                    logging.debug(
+                        "Skipping parameter %s at pipeline rank %d",
+                        k,
+                        self.pipeline_rank,
+                    )
+                    skipped.add(k)
             else:
                 raise ValueError(f"Unexpected key {k}")
         assert set(state_dict.keys()) == skipped.union(set(state_to_load.keys()))
@@ -286,12 +321,12 @@ def from_folder(
         pt_model_file = Path(folder) / "consolidated.00.pth"
         safetensors_model_file = Path(folder) / "consolidated.safetensors"
 
-        assert (
-            pt_model_file.exists() or safetensors_model_file.exists()
-        ), f"Make sure either {pt_model_file} or {safetensors_model_file} exists"
-        assert not (
-            pt_model_file.exists() and safetensors_model_file.exists()
-        ), f"Both {pt_model_file} and {safetensors_model_file} cannot exist"
+        assert pt_model_file.exists() or safetensors_model_file.exists(), (
+            f"Make sure either {pt_model_file} or {safetensors_model_file} exists"
+        )
+        assert not (pt_model_file.exists() and safetensors_model_file.exists()), (
+            f"Both {pt_model_file} and {safetensors_model_file} cannot exist"
+        )
 
         if pt_model_file.exists():
             loaded = torch.load(str(pt_model_file), mmap=True)
diff --git a/src/mistral_inference/vision_encoder.py b/src/mistral_inference/vision_encoder.py
diff --git a/tests/test_generate.py b/tests/test_generate.py