Replies: 5 comments 1 reply
-
| Thank you for looking into this, very much appreciated! @ngxson should be able to give you some insights into  | 
Beta Was this translation helpful? Give feedback.
-
| 
 I think our implementation of Qwen 2.5 VL has been subtly broken in some way (#13694). Needs a detailed investigation of where numerical results start to diverge. | 
Beta Was this translation helpful? Give feedback.
-
| I've done a sanity check using the official  llama.cpp qwen25vl
First 10 ViT values: -1.080078 -1.655273 1.696289 -0.018097 1.216431 0.605103 -3.717308 0.730347 -4.380524 2.823242 
hf / pytorch qwen25vl
First 10 ViT values: -0.9140625 -1.625 1.3828125 -0.32226562 1.0625 0.49609375 -3.5625 0.40039062 -4.375 2.984375The results from ViT show very similar values to what we see in jina-embeddings-v4, and very similar discrepancies between llama.cpp and pytorch implementation 🤔 not sure if it helps in any way but at least I can confirm it's not limited to jina-embeddings-v4 ... | 
Beta Was this translation helpful? Give feedback.
-
| We continued debugging the vision tower implementation layer by layer, and we noticed inconsistencies become drastic at the very first self-attention layer. We then debugged RoPE for Q,K and again noticed inconsistencies. Example attention outputs from llama.cpp: Example attention outputs from Pytorch: Example K before and after RoPE, llama.cpp: Same for Pytorch: At this point we are not sure how to proceed: 
 Thoughts? I think the K values (and V values though not attached) match relatively well BEFORE RoPE, but can see big differences after. This happens quite early, and error rate increases a lot by the time we reach the last layer of the vision tower. | 
Beta Was this translation helpful? Give feedback.
-
| After further testing, we've found out RoPE was not necessarily the problem, we noticed several difference in the patch creation, patch projection, patch gathering, and the cross attention in the LLM (the attention mask), tests that go around these issues have yielded embeddings comparable to what our HF model produces, I'll cleanup my code and ping here in case these fixes could be adapted to the main repo (potentially including the multi-modal embeddings). | 
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey folks!
I'm working on getting multimodal embeddings working with
jina-embeddings-v4(based on Qwen 2.5 VL) through llama.cpp server.I've hit an issue with
mtmdinconsistencies and was hoping someone might have insights on this, or suggestion on how to proceed.What I'm trying to do
I'm implementing token-level embeddings (no pooling) for a retrieval system using jina-embeddings-v4.
This model is based on Qwen 2.5 VL but was further trained for embedding tasks (supports both text and image).
To get it working in llama.cpp, we merged the jina-embeddings-v4 weights back into the Qwen 2.5 VL architecture.
The setup uses llama.cpp server, processing prompts like
<|im_start|>user\n<__image__>Describe the image.<|im_end|>\n.On the llama.cpp side, we're not applying any pooling or normalization - just extracting the raw token embeddings.
The issue
Here's what's got me scratching my head: everything seems to work perfectly until the vision encoder kicks in.
What's working:
Where it diverges:
Right after mtmd_encode_chunk() runs, the vision encoder outputs start differing significantly from what I get with the Python/HuggingFace implementation.
llama.cpp vision encoder output:
python reference:
They're in the same ballpark but consistently different across all values.
Debugging steps taken
I added some debug output to
mtmd_helper_eval_chunk_single():The processing flow:
process_chunk()The fact that text embeddings match perfectly makes me think this isn't a fundamental model loading or quantization issue.
And since the image embeddings respond to different text contexts, the attention mechanism seems to be working.
What I'm wondering:
I'm happy to run more tests or provide additional debugging info if that would help figure this out.
Really appreciate any insights you might have!
My implementation changes
I've modified the
update_slots()function to capture embeddings during multimodal processing.Here are the key changes I made:
Image processing section (around the LLAMA_TOKEN_NULL check):
Pre-image text embedding capture (in the batch processing loop):
I also modified
send_embedding()to assemble the complete multimodal embedding sequence by combining pre-image text embeddings, image embeddings, and post-image text embeddings.Environment details:
Let me know if there's any other info that would be useful!
Our code can be found here if anyone wants to take a look: https://github.com/jina-ai/llama.cpp
Beta Was this translation helpful? Give feedback.
All reactions