You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am interesting in this paper, and have some questions, (1) how could you get the score like: "the hallucination tokens are overly dependent (0.27) on previous language tokens, and later tokens are all hallucinations. " in Fig 1 (a). (2) The goal for training is: the model's output should keep same before and after add noise to the image tokens. In this way, the text tokens are same in the two situation, so, why the model could improve the attention to image tokens, rather than text tokens?
And will the code release soon?
Thanks!
The text was updated successfully, but these errors were encountered:
For the first question, the conclusion is more empirical than theoretical proof. We checked many erroneous long captions and took the average of many layers' attention map to visualize dependencies, and found that the anchor token exists in most of the cases.
For the second question, the intuition is to make the model suffer. Consider if the image is changing in front of us, if we want to output a consistent answer, what a human would do -- we would look at the image closer and put more attention on it
For the code release, I'm sorry for the late release. We have submitted to NIPS but got many reviews of adding more recent evaluation benchmarks and testing on stronger models.
Since I didn't get many feedbacks after put our paper on Arxiv, I plan to take it slow and add Qwen2VL, InternVL2.5 for a SOTA model. But they didn't release their data, surpassing their results is not easy, I would say 3-6 months, may be submit to another NIPS😭
Hi, I am interesting in this paper, and have some questions, (1) how could you get the score like: "the hallucination tokens are overly dependent (0.27) on previous language tokens, and later tokens are all hallucinations. " in Fig 1 (a). (2) The goal for training is: the model's output should keep same before and after add noise to the image tokens. In this way, the text tokens are same in the two situation, so, why the model could improve the attention to image tokens, rather than text tokens?
And will the code release soon?
Thanks!
The text was updated successfully, but these errors were encountered: