GAGE Architecture

🌐 Why GAGE

GAGE (General AI Gauge Engine) is a unified, extensible evaluation framework designed for large language, multimodal models, audio models and diffusion models. Compared with other evaluation engine, GAGE focuses on scalability, flexibility, and real-world agent evaluation.

Key Advantages

🧩 Extensible Design – Modular config, dataset, and adapter system for rapid integration of new tasks and models.
⚙️ Multi-Engine Inference – Supports vLLM, SGLang, and HF backends with distributed multi-GPU execution.
🧠 Agent Sandbox Evaluation – Built-in sandbox (FinQuery, GUI, Search) for realistic agent performance testing.
🤖 LLM-as-a-Judge – Native support for external APIs (GPT-4o, DeepSeek, Gemini, Claude) for automatic judgment.
📊 Unified Scoring Framework – Customizable metrics (accuracy, F1, reasoning consistency) across all tasks.

Support testsets

14 NLP testsets，2 Finance testsets，9 audio testsets，5 GUI testsets，18 multimodal testsets，see config folder for more information

TODO:

Upload formatted testsets to HuggingFace

Directory structure

llm-eval
├── README.md
├── __init__.py
├── benchmark_code        # All scoring (evaluation) functions are located in this folder
├── config                # Custom sample configuration files can be found here
├── docs                  # Automatically generated API documentation (built with Sphinx)
├── inference             # Contains all inference engine–related code
├── post_eval.py          # Script for launching evaluation after inference is completed
├── requirements.txt
├── run.py                # Used together with run_pipeline.py
├── run_pipeline.py       # The main entry point for running the entire evaluation pipeline
├── scripts               # Example shell scripts (e.g., run.sh)
├── tools                 # Utility functions and wrappers (e.g., HTTP requests)
├── statistic.py          # Script for aggregating and uploading final evaluation statistics
├── testsets              # All non-business test sets are located here
└── utils                 # Common utility functions

Example

Example config.yaml

# where to save your predict result
save_dir: /mnt/workspace/inference

# Subsample（optional）
# subsample: 0.001
# seed: 1235

# infra params（optional）
# backend: hf
# temperature: 0.6
# preprocess: preprocess_no_think
# max_length: 32768
# max_new_tokens: 4096
# load_type: last
# engine_args: "--kv-cache-dtype fp8_e5m2  --quantization awq_marlin"
# tensor_parallel: 2
# judge_tensor_parallel: 4
# judge_max_length: 65536
# judge_max_new_tokens: 4096

tasks:
    Stock_Price_Prediction:
        compare_func:
            path: benchmark_code/BizFinBench/eval_stock_prediction.py
        data_path: Stock_Price_Prediction.jsonl
        type: text
    MATH (LLM as judge):
        type: text
        data_path: math__1-0-2.jsonl
        # judge: api judge model
            # preprocess: utils.judge.data_preprocess
            # method: gpt-4o #support gpt-4o，deepseek，gemini，claude
        judge:
            # data preprocess
            preprocess: benchmark_code.BizFinBench.eval_financial_description.data_preprocess
            # local judge model
            judge_model_path: /mnt/judge-model/Qwen-72B-Instruct/V5
            judge_tensor_parallel: 4
        compare_func:
            path: utils/eval_math500.py

optional params list

Name	Type	Default	Description
subsample	float / int	None	Test set downsampling. If set to a float between 0 and 1, samples proportionally. If set to an int > 1, samples by count.
seed	int	None	Random seed for test set sampling, used to ensure reproducible samples.
backend	str	vllm	Model inference backend. Supported options: vllm / sglang / hf.
preprocess	str	preprocess	Preprocessing script for test samples.
prompt_type	str	chat_template	Prompt template type; defaults to the model’s built-in `chat_template`.
chat_template_kwargs	str	None	Keyword arguments passed to `tokenizer.apply_chat_template`, e.g., `"enable_thinking=False"`.
temperature	float	0	Sampling temperature during generation; defaults to greedy decoding.
max_length	int	None	Maximum model sequence length (input + output); defaults to model configuration.
max_new_tokens	int	32768	Maximum number of tokens the model can generate in the output.
load_type	str	last	When the model directory contains multiple checkpoints, automatically load the last one (`last`) or the best-performing one on the validation set (`best`, requires training with the Swift framework).
engine_args	str	None	Arguments passed directly to the inference engine (vllm / sglang).
tensor_parallel	int	1	Tensor parallelism degree (number of GPUs) used for model evaluation.
judge_tensor_parallel	int	1	Tensor parallelism degree (number of GPUs) used for the judge model.
judge_max_length	int	None	Maximum sequence length (input + output) for the judge model; defaults to model configuration.
judge_max_new_tokens	int	2048	Maximum number of tokens the judge model can generate in the output.

Quick Start evaluate local model

python run_pipeline.py \
    --config unit_test.yaml \
    --model_path /mnt/data/llm/models/chat/Qwen3-0.6B \

Quick Start: Evaluate a Local Model (Disable Thinking Mode)

When evaluating hybrid reasoning models such as Qwen3, you can disable the thinking mode by adding chat_template_kwargs: enable_thinking=False to the configuration file.

Quick Start: Evaluate a Local Model (Enable GPT-OSS Thinking Mode)

When evaluating gpt-oss series models, you can control the reasoning depth by adding chat_template_kwargs: reasoning_effort="high" to the configuration file. Supported options include low, medium, and high (default: medium).

Quick Start: Evaluate External APIs

export EXTERNAL_API=gpt-4o  # Supports gpt-4o, deepseek, gemini, claude, and custom additions.  
                           # If using official APIs such as OpenAI, set this field to 'chatgpt'.
export API_KEY=API_KEY     # Required if using official APIs like OpenAI.
export MODEL_NAME=gpt-4o

python run_pipeline.py \
    --config unit_test.yaml \

Quick Start: Evaluate External APIs and Use External Models as Judges

You can specify an external API as the judge model directly in the YAML file:

test:
    type: text
    data_path: /sft/data/TESTSET/TESTSET__OpenSource-Math__1-0-2.jsonl
    judge:
        preprocess: utils.judge.data_preprocess
        method: gpt-4o  # Specify which external model acts as the judge.  
                        # Supports gpt-4o, deepseek, gemini, claude, and custom additions.
    compare_func:
        path: utils/eval_math500.py

post eval

If the model inference has been completed and the evaluation phase fails, you can manually execute the following code to obtain the evaluation results.

python post_eval.py --eval_func benchmark_code/Multiple_Choice_QA/eval_multi_choice.py --input_path xxx.jsonl --output_path xxx.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GAGE Architecture

🌐 Why GAGE

Support testsets

Directory structure

Example

Example config.yaml

optional params list

Quick Start evaluate local model

Quick Start: Evaluate a Local Model (Disable Thinking Mode)

Quick Start: Evaluate a Local Model (Enable GPT-OSS Thinking Mode)

Quick Start: Evaluate External APIs

Quick Start: Evaluate External APIs and Use External Models as Judges

post eval

💖 Acknowledgement

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
benchmark_code		benchmark_code
docs		docs
envs		envs
inference		inference
scripts		scripts
src		src
tests		tests
tools		tools
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
post_eval.py		post_eval.py
requirements.txt		requirements.txt
run.py		run.py
run_pipeline.py		run_pipeline.py
run_pipeline.sh		run_pipeline.sh
statistic.py		statistic.py

HiThink-Research/GAGE

Folders and files

Latest commit

History

Repository files navigation

GAGE Architecture

🌐 Why GAGE

Support testsets

Directory structure

Example

Example config.yaml

optional params list

Quick Start evaluate local model

Quick Start: Evaluate a Local Model (Disable Thinking Mode)

Quick Start: Evaluate a Local Model (Enable GPT-OSS Thinking Mode)

Quick Start: Evaluate External APIs

Quick Start: Evaluate External APIs and Use External Models as Judges

post eval

💖 Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages