GitHub - modelscope/OpenJudge: OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards

Holistic Evaluation, Quality Rewards: Driving Application Excellence

🌟 If you find OpenJudge helpful, please give us a Star! 🌟

📑 Table of Contents

Key Features
News
Installation
Quickstart
Integrations
Contributing
Community
Citation

OpenJudge is a unified framework designed to drive LLM and Agent application excellence through Holistic Evaluation and Quality Rewards.

💡 Evaluation and reward signals are the cornerstones of application excellence. Holistic evaluation enables the systematic analysis of shortcomings to drive rapid iteration, while high-quality rewards provide the essential foundation for advanced optimization and fine-tuning.

OpenJudge unifies evaluation metrics and reward signals into a single, standardized Grader interface, offering pre-built graders, flexible customization, and seamless framework integration.

✨ Key Features

📦 Systematic & Quality-Assured Grader Library

Access 50+ production-ready graders featuring a comprehensive taxonomy, rigorously validated for reliable performance.

🎯 General

Focus: Semantic quality, functional correctness, structural compliance

Key Graders:

Relevance - Semantic relevance scoring
Similarity - Text similarity measurement
Syntax Check - Code syntax validation
JSON Match - Structure compliance

🤖 Agent

Focus: Agent lifecycle, tool calling, memory, plan feasibility, trajectory quality

Key Graders:

Tool Selection - Tool choice accuracy
Memory - Context preservation
Plan - Strategy feasibility
Trajectory - Path optimization

🖼️ Multimodal

Focus: Image-text coherence, visual generation quality, image helpfulness

Key Graders:

Image Coherence - Visual-text alignment
Text-to-Image - Generation quality
Image Helpfulness - Image contribution

🌐 Multi-Scenario Coverage: Extensive support for diverse domains including Agent, text, code, math, and multimodal tasks. 👉 Explore Supported Scenarios
🔄 Holistic Agent Evaluation: Beyond final outcomes, we assess the entire lifecycle—including trajectories, Memory, Reflection, and Tool Use. 👉 Agent Lifecycle Evaluation
✅ Quality Assurance: Every grader comes with benchmark datasets and pytest integration for validation. 👉 View Benchmark Datasets

🛠️ Flexible Grader Building Methods

Choose the build method that fits your requirements:

Customization: Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader. 👉 Custom Grader Development Guide
Zero-shot Rubrics Generation: Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queries—the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping when you want to get started immediately. 👉 Zero-shot Rubrics Generation Guide
Data-driven Rubrics Generation: Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. 👉 Data-driven Rubrics Generation Guide
Training Judge Models: Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short.👉 Train Judge Models

🔌 Easy Integration

Using mainstream observability platforms like LangSmith or Langfuse? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We're also building integrations with training frameworks like verl. 👉 See Integrations for details

News

2025-12-26 - Released OpenJudge v0.2.0 on PyPI - Major Update! This release expands our core capabilities by adding robust support for diverse evaluation scenarios on top of reward construction. By unifying reward and evaluation signals, OpenJudge v0.2.0 provides a more holistic approach to optimizing application performance and excellence. → migration-guide
2025-10-20 - Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling - We released a new paper on learning generalizable reward criteria for robust modeling.
2025-10-17 - Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning - We introduced techniques to align judge feedback and improve RL stability.
2025-07-09 - Released OpenJudge v0.1.0 on PyPI

📥 Installation

pip install py-openjudge

💡 More installation methods can be found in the Quickstart Guide.

🚀 Quickstart

import asyncio
from openjudge.models import OpenAIChatModel
from openjudge.graders.common.relevance import RelevanceGrader

async def main():
    # 1️⃣ Create model client
    model = OpenAIChatModel(model="qwen3-32b")

    # 2️⃣ Initialize grader
    grader = RelevanceGrader(model=model)

    # 3️⃣ Prepare data
    data = {
        "query": "What is machine learning?",
        "response": "Machine learning is a subset of AI that enables computers to learn from data.",
    }

    # 4️⃣ Evaluate
    result = await grader.aevaluate(**data)

    print(f"Score: {result.score}")   # Score: 5
    print(f"Reason: {result.reason}")

if __name__ == "__main__":
    asyncio.run(main())

📚 Complete Quickstart can be found in the Quickstart Guide.

🔗 Integrations

Seamlessly connect OpenJudge with mainstream observability and training platforms:

Category	Platform	Status	Documentation
Observability	LangSmith	✅ Available	👉 LangSmith Integration Guide
	Langfuse	✅ Available	👉 Langfuse Integration Guide
	Other frameworks	🔵 Planned	—
Training	verl	🟡 In Progress	—
	Trinity-RFT	🔵 Planned	—

💬 Have a framework you'd like us to prioritize? Open an Issue!

🤝 Contributing

We love your input! We want to make contributing to OpenJudge as easy and transparent as possible.

🎨 Adding New Graders — Have domain-specific evaluation logic? Share it with the community! 🐛 Reporting Bugs — Found a glitch? Help us fix it by opening an issue 📝 Improving Docs — Clearer explanations or better examples are always welcome 💡 Proposing Features — Have ideas for new integrations? Let's discuss!

📖 See full Contributing Guidelines for coding standards and PR process.

💬 Community

Join our DingTalk group to connect with the community:

Migration Guide (v0.1.x → v0.2.0)

OpenJudge was previously distributed as the legacy package rm-gallery (v0.1.x). Starting from v0.2.0, it is published as py-openjudge and the Python import namespace is openjudge.

OpenJudge v0.2.0 is NOT backward compatible with v0.1.x. If you are currently using v0.1.x, choose one of the following paths:

Stay on v0.1.x (legacy): keep using the old package

pip install rm-gallery

We preserved the source code of v0.1.7 (the latest v0.1.x release) in the v0.1.7-legacy branch.

Migrate to v0.2.0 (recommended): follow the Installation section above, then walk through Quickstart (or the full Quickstart Guide) to update your imports / usage.

If you run into migration issues, please open an issue with your minimal repro and current version.

📄 Citation

If you use OpenJudge in your research, please cite:

@software{
  title  = {OpenJudge: A Unified Framework for Holistic Evaluation and Quality Rewards},
  author = {The OpenJudge Team},
  url    = {https://github.com/modelscope/OpenJudge},
  month  = {07},
  year   = {2025}
}

Made with ❤️ by the OpenJudge Team

⭐ Star Us · 🐛 Report Bug · 💡 Request Feature

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
.github		.github
cookbooks		cookbooks
docs		docs
openjudge		openjudge
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Holistic Evaluation, Quality Rewards: Driving Application Excellence

📑 Table of Contents

✨ Key Features

📦 Systematic & Quality-Assured Grader Library

🎯 General

🤖 Agent

🖼️ Multimodal

🛠️ Flexible Grader Building Methods

🔌 Easy Integration

News

📥 Installation

🚀 Quickstart

🔗 Integrations

🤝 Contributing

💬 Community

Migration Guide (v0.1.x → v0.2.0)

📄 Citation

About

Uh oh!

Releases

Packages

Contributors 12

Uh oh!

Languages

License

modelscope/OpenJudge

Folders and files

Latest commit

History

Repository files navigation

Holistic Evaluation, Quality Rewards: Driving Application Excellence

📑 Table of Contents

✨ Key Features

📦 Systematic & Quality-Assured Grader Library

🎯 General

🤖 Agent

🖼️ Multimodal

🛠️ Flexible Grader Building Methods

🔌 Easy Integration

News

📥 Installation

🚀 Quickstart

🔗 Integrations

🤝 Contributing

💬 Community

Migration Guide (v0.1.x → v0.2.0)

📄 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 12

Uh oh!

Languages

Packages