🖼️ MULTI-Benchmark: Multimodal Understanding Leaderboard with Text and Images

🌐 Website | 📃 Paper | 🤗 Dataset | 🏆 Leaderboard | 📮 Submit

🔥 News

[2025.10.16] We have released ground truth answers for all questions in MULTI as human expert baseline was surpassed by several models. Now you can run evaluation and get the final scores locally.
[2025.9.28] MULTI is now available online at https://doi.org/10.1007/s11432-024-4602-x.
[2025.6.22] MULTI is now accepted by Science China Information Sciences (Special Topic on Large Multimodal Models).
[2025.1.7] We have updated our leaderboard with the latest results.
[2025.1.2] We have updated MULTI to v1.3.1.
[2024.3.4] We have released the evaluation page (no longer maintained).
[2024.2.19] We have released the HuggingFace Page.
[2024.2.6] We have published our paper on arXiv.
[2023.12.7] We have released the code of our benchmark evaluation.
[2023.12.5] We have released the GitHub Page.

📖 Overview

The rapid development of multimodal large language models (MLLMs) raises the question of how they compare to human performance. While existing datasets often feature synthetic or overly simplistic tasks, some models have already surpassed human expert baselines. In this paper, we present MULTI, a Chinese multimodal dataset derived from authentic examination questions. Comprising over 18,000 carefully selected and refined questions, MULTI evaluates models using real-world examination standards, encompassing image-text comprehension, complex reasoning, and knowledge recall. Additionally, We also introduce MULTI-Elite, a 500-question selected hard subset, and MULTI-Extend with more than 4,500 external knowledge context pieces for testing in-context learning capabilities. MULTI serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI.

⏬ Download

You can simply download data using the following command:

cd eval
python download_data.py

Or directly download the zip file in the Huggingface repository and unzip it.

The structure of ./data should be something like:

./data
├── images                                       # folder containing images
├── problem_v1.3.1_20241210.json                 # MULTI (with answers)
├── problem_v1.3.1_20241210_release.json         # MULTI
├── knowledge_v1.2.2_20240212_release.json       # MULTI-Extend
├── hard_list_v1.3.0_20241203.json               # MULTI-Elite
├── captions_v1.3.1_20241210_blip.csv            # image captions generated by BLIP-6.7B
├── captions_v1.3.1_20241210_points.csv          # image captions generated by POINTS-1-5
├── ocr_v1.3.1_20241210_easyocr.csv              # OCR data generated by EasyOCR
└── ocr_v1.3.1_20241210_points.csv               # OCR data generated by POINTS-1-5

📝 How to Evaluate

We provide a unified evaluation framework in eval. Each file in eval/models contains an evaluator specified to one M/LLM, and implements a generate_answer method to receive a question as input and give out the answer of it.

cd eval
python eval.py -h # to list all supported arguments
python eval.py -l # to list all supported models

Environment Preparation Before Usage

Each evaluator requires its unique environment setting, and a universal environment may not work for all evaluators. Just follow the official guide. If the corresponding model runs well, then so should it fit in our framework.

You just need to install several another packages to run the evaluation code:

pip install tiktoken tqdm rouge_chinese jieba matplotlib

If you just want to generate data for a specific setting (using --debug argument), this line above is all you need.

Running Evaluation

For a quick start, see these examples:

Test GPT-4o model on whole MULTI with multimodal input, using MULTI-Extend as external knowledge:

python eval.py \
  --problem_file ../data/problem_v1.3.1_20241210_release.json \
  --knowledge_file ../data/knowledge_v1.2.2_20240212_release.json \
  --questions_type 0,1,2,3 \
  --image_type 0,1,2 \
  --input_type 2 \
  --model gpt-4o \
  --model_version gpt-4o-latest \
  --api_key sk-************************************************

Test Qwen-VL model on MULTI-Elite with image caption input, skip all questions not containing images, evaluate only multiple-choice questions, automatically set cuda device:

python eval.py \
  --problem_file ../data/problem_v1.3.1_20241210_release.json \
  --subset ../data/hard_list_v1.3.0_20241203.json \
  --caption_file ../data/captions_v1.3.1_20241210_points.csv \
  --questions_type 0,1 \
  --image_type 1,2 \
  --input_type 1 \
  --model qwen-vl \
  --model_dir ../models/Qwen-VL-Chat

The evaluation script will generate a folder named results under the root directory, and the result will be saved in ../results/{EXPERIMENT_NAME}. During the evaluation, the script will save checkpoints in ../results/{EXPERIMENT_NAME}/checkpoints, you can delete them after the evaluation is done. If the evaluation is interrupted, you can continue from the last checkpoint:

python eval.py \
  --checkpoint_dir ../results/{EXPERIMENT_NAME}

Most of the arguments are saved in ../results/{EXPERIMENT_NAME}/args.json, so you can continue the evaluation without specifying all the arguments again. Please note that --api_key is not saved in args.json for security reasons, so you need to specify it again.

python eval.py \
  --checkpoint_dir ../results/{EXPERIMENT_NAME} \
  --api_key sk-************************************************

For more details of arguments, please use python eval.py -h, and refer to args.py and eval.py.

You can directly use the standard answers we provide to score the answer sheets:

python metrics.py \
  --label_file ../data/problem_v1.3.1_20241210.json \
  --detail \
  --answer_position end \
  --prediction_file ../results/{EXPERIMENT_NAME}/prediction.json

You will see the final scoring data in ../results/{EXPERIMENT_NAME}.

Add Support for Your Models

It's recommended to read the code of the other given evaluators in eval/models before your implementation.

Create class YourModelEvaluator and implement generate_answer(self, question:dict) to match the design supported in eval.py and eval.sh, which is anticipated to largely ease the coding process.

Do not forget to add their references into args.py for the convenience of usage.

You can execute model_tester.py in the eval folder to check the correctness of you implementation. Various problems including implementation errors, small bugs in code, and even wrong environment settings may cause failure of the evaluation. The examples provided in the file cover most kinds of cases presented in our benchmark. Feel free to change the code in it to debug your code😊

python model_tester.py <args> # args are similar to the default settings above

Create Captions and OCR Data for Images

Generate captions or OCR data for images, and save them in csv with format below:

../data/images/czls/502_1.png,a cartoon drawing of a man standing in front of a large block
../data/images/czls/525_1.png,a chinese newspaper with the headline, china's new year
...

We provide two example scripts to generate captions (image_caption.py) and OCR data (image_ocr.py) for images.

📮 How to Submit

You can do evaluation locally directly

You need to first prepare a UTF-8 encoded JSON file with the following format:

{
    "czsx_0_0": {
        "question_id": "czsx_0_0",
        "question_image_number": 1,
        "image_list": [...],            # optional
        "input_message": ...,           # optional
        "prediction": "C"
    },
    ...
}

If you evaluate the model with our official code, you can simply zip the prediction file prediction.json and the configuration file args.json in the experiment results folder . /results/{EXPERIMENT_NAME} in .zip format.

Then, you can submit your result to our evaluation page.

You are also welcomed to pull a request and contribute your code to our evaluation code. We will be very grateful for your contribution!

[Notice] Thank you for being so interested in the MULTI dataset! If you want to add your model in our leaderboard, please fill in this questionnaire, your information will be kept strictly confidential, so please feel free to fill it out. 🤗

📑 Citation

If you find our work useful, please cite us!

@article{zhu2025multi,
    title={{MULTI}: Multimodal Understanding Leaderboard with Text and Images}, 
    author={Zichen Zhu and Yang Xu and Lu Chen and Jingkai Yang and Yichuan Ma and Yiming Sun and Hailin Wen and Jiaqi Liu and Jinyu Cai and Yingzi Ma and Situo Zhang and Zihan Zhao and Liangtai Sun and Kai Yu},
    journal = "SCIENCE CHINA Information Sciences",
    year = "2025",
    volume = "68",
    number = "10",
    pages = "200107.1--200107.26",
    doi = "https://doi.org/10.1007/s11432-024-4602-x"
}

📧 Contact Us

If you have any questions, please feel free to contact us via email [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 143 Commits
data		data
deploy		deploy
docs		docs
eval		eval
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

🖼️ MULTI-Benchmark: Multimodal Understanding Leaderboard with Text and Images

🔥 News

📖 Overview

⏬ Download

📝 How to Evaluate

Environment Preparation Before Usage

Running Evaluation

Add Support for Your Models

Create Captions and OCR Data for Images

📮 How to Submit

📑 Citation

📧 Contact Us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Uh oh!

License

Uh oh!

OpenDFM/MULTI-Benchmark

Folders and files

Latest commit

History

Repository files navigation

🖼️ MULTI-Benchmark: Multimodal Understanding Leaderboard with Text and Images

🔥 News

📖 Overview

⏬ Download

📝 How to Evaluate

Environment Preparation Before Usage

Running Evaluation

Add Support for Your Models

Create Captions and OCR Data for Images

📮 How to Submit

📑 Citation

📧 Contact Us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages