Questions about the paper #1

sev777 · 2024-12-10T12:44:11Z

Hi, I am interesting in this paper, and have some questions, (1) how could you get the score like: "the hallucination tokens are overly dependent (0.27) on previous language tokens, and later tokens are all hallucinations. " in Fig 1 (a). (2) The goal for training is: the model's output should keep same before and after add noise to the image tokens. In this way, the text tokens are same in the two situation, so, why the model could improve the attention to image tokens, rather than text tokens?
And will the code release soon?

Thanks!

KaiWU5 · 2024-12-10T14:45:40Z

For the first question, the conclusion is more empirical than theoretical proof. We checked many erroneous long captions and took the average of many layers' attention map to visualize dependencies, and found that the anchor token exists in most of the cases.
For the second question, the intuition is to make the model suffer. Consider if the image is changing in front of us, if we want to output a consistent answer, what a human would do -- we would look at the image closer and put more attention on it

For the code release, I'm sorry for the late release. We have submitted to NIPS but got many reviews of adding more recent evaluation benchmarks and testing on stronger models.
Since I didn't get many feedbacks after put our paper on Arxiv, I plan to take it slow and add Qwen2VL, InternVL2.5 for a SOTA model. But they didn't release their data, surpassing their results is not easy, I would say 3-6 months, may be submit to another NIPS😭

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the paper #1

Questions about the paper #1

sev777 commented Dec 10, 2024 •

edited

Loading

KaiWU5 commented Dec 10, 2024

Questions about the paper #1

Questions about the paper #1

Comments

sev777 commented Dec 10, 2024 • edited Loading

KaiWU5 commented Dec 10, 2024

sev777 commented Dec 10, 2024 •

edited

Loading