wrong results on SuperCLEVER test #100

llltttwww · 2025-02-15T08:23:18Z

I followed the instruction on README.md and run its script on to evaluate on SuperCLEVER dataset. But even with basic Qwen2-VL-2B-Instruct, the accuracy is 42.5%, different from 48%. I have no idea why this happens.

Below is part of the log file:
{
"accuracy": 42.5,
"results": [
{
"question": {
"image_path": "./images/superCLEVR_new_025000.png",
"question": "How many different items are there in the image?",
"ground_truth": 4
},
"ground_truth": 4,
"model_output": "\n There are four distinct objects in the image: a cyan fighter, a red minivan, a cyan mountain bike, and a green mountain bike.\n \n\n 4\n ",
"extracted_answer": 4
},
{
"question": {
"image_path": "./images/superCLEVR_new_025001.png",
"question": "How many different items are there in the image?",
"ground_truth": 9
},
"ground_truth": 9,
"model_output": "\nA. 10\n\n\n10\n",
"extracted_answer": 10
},
{
"question": {
"image_path": "./images/superCLEVR_new_025002.png",
"question": "How many different items are there in the image?",
"ground_truth": 10
},
"ground_truth": 10,
"model_output": "\nA. 10\n\n\n10\n",
"extracted_answer": 10
},
{
"question": {
"image_path": "./images/superCLEVR_new_025003.png",
"question": "How many different items are there in the image?",
"ground_truth": 4
},
"ground_truth": 4,
"model_output": "\nThe image contains four distinct objects: a purple dirt bike, a blue mountain bike, a brown bicycle, and a red tandem bicycle. There are no other objects in the image.\n\n\n\n0\n",
"extracted_answer": 0
},
{
"question": {
"image_path": "./images/superCLEVR_new_025004.png",
"question": "How many different items are there in the image?",
"ground_truth": 3
},
"ground_truth": 3,
"model_output": "\nTo determine the number of different items in the image, let's analyze each object:\n\n1. There is a yellow car.\n2. There is a blue car.\n3. There is a gray motorcycle.\n\nSo, there are three different items in the image.\n\n\n3",
"extracted_answer": 3
},
...

zhiwenhou1227 · 2025-02-18T12:05:30Z

I followed the instruction on README.md and run its script on to evaluate on SuperCLEVER dataset. But even with basic Qwen2-VL-2B-Instruct, the accuracy is 42.5%, different from 48%. I have no idea why this happens.

Below is part of the log file: { "accuracy": 42.5, "results": [ { "question": { "image_path": "./images/superCLEVR_new_025000.png", "question": "How many different items are there in the image?", "ground_truth": 4 }, "ground_truth": 4, "model_output": "\n There are four distinct objects in the image: a cyan fighter, a red minivan, a cyan mountain bike, and a green mountain bike.\n \n\n 4\n ", "extracted_answer": 4 }, { "question": { "image_path": "./images/superCLEVR_new_025001.png", "question": "How many different items are there in the image?", "ground_truth": 9 }, "ground_truth": 9, "model_output": "\nA. 10\n\n\n10\n", "extracted_answer": 10 }, { "question": { "image_path": "./images/superCLEVR_new_025002.png", "question": "How many different items are there in the image?", "ground_truth": 10 }, "ground_truth": 10, "model_output": "\nA. 10\n\n\n10\n", "extracted_answer": 10 }, { "question": { "image_path": "./images/superCLEVR_new_025003.png", "question": "How many different items are there in the image?", "ground_truth": 4 }, "ground_truth": 4, "model_output": "\nThe image contains four distinct objects: a purple dirt bike, a blue mountain bike, a brown bicycle, and a red tandem bicycle. There are no other objects in the image.\n\n\n\n0\n", "extracted_answer": 0 }, { "question": { "image_path": "./images/superCLEVR_new_025004.png", "question": "How many different items are there in the image?", "ground_truth": 3 }, "ground_truth": 3, "model_output": "\nTo determine the number of different items in the image, let's analyze each object:\n\n1. There is a yellow car.\n2. There is a blue car.\n3. There is a gray motorcycle.\n\nSo, there are three different items in the image.\n\n\n3", "extracted_answer": 3 }, ...

i get the same result 42.5, i want to know when you after grpo，the acc is?

liyd · 2025-02-19T09:57:08Z

maybe reason for doing sampling, I can't get 48% either

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong results on SuperCLEVER test #100

wrong results on SuperCLEVER test #100

llltttwww commented Feb 15, 2025

zhiwenhou1227 commented Feb 18, 2025

liyd commented Feb 19, 2025

wrong results on SuperCLEVER test #100

wrong results on SuperCLEVER test #100

Comments

llltttwww commented Feb 15, 2025

zhiwenhou1227 commented Feb 18, 2025

liyd commented Feb 19, 2025