Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the paper #1

Open
sev777 opened this issue Dec 10, 2024 · 1 comment
Open

Questions about the paper #1

sev777 opened this issue Dec 10, 2024 · 1 comment

Comments

@sev777
Copy link

sev777 commented Dec 10, 2024

Hi, I am interesting in this paper, and have some questions, (1) how could you get the score like: "the hallucination tokens are overly dependent (0.27) on previous language tokens, and later tokens are all hallucinations. " in Fig 1 (a). (2) The goal for training is: the model's output should keep same before and after add noise to the image tokens. In this way, the text tokens are same in the two situation, so, why the model could improve the attention to image tokens, rather than text tokens?
And will the code release soon?

Thanks!

@KaiWU5
Copy link
Owner

KaiWU5 commented Dec 10, 2024

For the first question, the conclusion is more empirical than theoretical proof. We checked many erroneous long captions and took the average of many layers' attention map to visualize dependencies, and found that the anchor token exists in most of the cases.
For the second question, the intuition is to make the model suffer. Consider if the image is changing in front of us, if we want to output a consistent answer, what a human would do -- we would look at the image closer and put more attention on it

For the code release, I'm sorry for the late release. We have submitted to NIPS but got many reviews of adding more recent evaluation benchmarks and testing on stronger models.
Since I didn't get many feedbacks after put our paper on Arxiv, I plan to take it slow and add Qwen2VL, InternVL2.5 for a SOTA model. But they didn't release their data, surpassing their results is not easy, I would say 3-6 months, may be submit to another NIPS😭

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants