Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong results on SuperCLEVER test #100

Open
llltttwww opened this issue Feb 15, 2025 · 2 comments
Open

wrong results on SuperCLEVER test #100

llltttwww opened this issue Feb 15, 2025 · 2 comments

Comments

@llltttwww
Copy link

I followed the instruction on README.md and run its script on to evaluate on SuperCLEVER dataset. But even with basic Qwen2-VL-2B-Instruct, the accuracy is 42.5%, different from 48%. I have no idea why this happens.

Below is part of the log file:
{
"accuracy": 42.5,
"results": [
{
"question": {
"image_path": "./images/superCLEVR_new_025000.png",
"question": "How many different items are there in the image?",
"ground_truth": 4
},
"ground_truth": 4,
"model_output": "\n There are four distinct objects in the image: a cyan fighter, a red minivan, a cyan mountain bike, and a green mountain bike.\n \n\n 4\n ",
"extracted_answer": 4
},
{
"question": {
"image_path": "./images/superCLEVR_new_025001.png",
"question": "How many different items are there in the image?",
"ground_truth": 9
},
"ground_truth": 9,
"model_output": "\nA. 10\n\n\n10\n",
"extracted_answer": 10
},
{
"question": {
"image_path": "./images/superCLEVR_new_025002.png",
"question": "How many different items are there in the image?",
"ground_truth": 10
},
"ground_truth": 10,
"model_output": "\nA. 10\n\n\n10\n",
"extracted_answer": 10
},
{
"question": {
"image_path": "./images/superCLEVR_new_025003.png",
"question": "How many different items are there in the image?",
"ground_truth": 4
},
"ground_truth": 4,
"model_output": "\nThe image contains four distinct objects: a purple dirt bike, a blue mountain bike, a brown bicycle, and a red tandem bicycle. There are no other objects in the image.\n\n\n\n0\n",
"extracted_answer": 0
},
{
"question": {
"image_path": "./images/superCLEVR_new_025004.png",
"question": "How many different items are there in the image?",
"ground_truth": 3
},
"ground_truth": 3,
"model_output": "\nTo determine the number of different items in the image, let's analyze each object:\n\n1. There is a yellow car.\n2. There is a blue car.\n3. There is a gray motorcycle.\n\nSo, there are three different items in the image.\n\n\n3",
"extracted_answer": 3
},
...

@zhiwenhou1227
Copy link

I followed the instruction on README.md and run its script on to evaluate on SuperCLEVER dataset. But even with basic Qwen2-VL-2B-Instruct, the accuracy is 42.5%, different from 48%. I have no idea why this happens.

Below is part of the log file: { "accuracy": 42.5, "results": [ { "question": { "image_path": "./images/superCLEVR_new_025000.png", "question": "How many different items are there in the image?", "ground_truth": 4 }, "ground_truth": 4, "model_output": "\n There are four distinct objects in the image: a cyan fighter, a red minivan, a cyan mountain bike, and a green mountain bike.\n \n\n 4\n ", "extracted_answer": 4 }, { "question": { "image_path": "./images/superCLEVR_new_025001.png", "question": "How many different items are there in the image?", "ground_truth": 9 }, "ground_truth": 9, "model_output": "\nA. 10\n\n\n10\n", "extracted_answer": 10 }, { "question": { "image_path": "./images/superCLEVR_new_025002.png", "question": "How many different items are there in the image?", "ground_truth": 10 }, "ground_truth": 10, "model_output": "\nA. 10\n\n\n10\n", "extracted_answer": 10 }, { "question": { "image_path": "./images/superCLEVR_new_025003.png", "question": "How many different items are there in the image?", "ground_truth": 4 }, "ground_truth": 4, "model_output": "\nThe image contains four distinct objects: a purple dirt bike, a blue mountain bike, a brown bicycle, and a red tandem bicycle. There are no other objects in the image.\n\n\n\n0\n", "extracted_answer": 0 }, { "question": { "image_path": "./images/superCLEVR_new_025004.png", "question": "How many different items are there in the image?", "ground_truth": 3 }, "ground_truth": 3, "model_output": "\nTo determine the number of different items in the image, let's analyze each object:\n\n1. There is a yellow car.\n2. There is a blue car.\n3. There is a gray motorcycle.\n\nSo, there are three different items in the image.\n\n\n3", "extracted_answer": 3 }, ...

i get the same result 42.5, i want to know when you after grpo,the acc is?

@liyd
Copy link

liyd commented Feb 19, 2025

maybe reason for doing sampling, I can't get 48% either

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants