-
Notifications
You must be signed in to change notification settings - Fork 1
Add mcp universe benchmark #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…st infrastructure
Reverts the https:// to file:// URL changes that were introduced as a CI/CD workaround. These tests should use realistic https URLs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…hmark # Conflicts: # pyproject.toml # uv.lock
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Restore tests/utils/logger.py to original StructuredEventLogger signatures, adding new methods for MCP-Universe support - Move HumanReadableLogger to tests/benchmarks/mcp_universe/reporting.py (matching AppWorld's pattern) - Fix README install instructions to use UV_GIT_LFS=1 uv pip install - Remove redundant bfcl and mcpuniverse-eval optional dep groups 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Fix fragile HumanReadableLogger init (use proper constructor) - Simplify complex ternary expression for expected value extraction - Remove unused error classification variables (always 0) - Move late json import to top of file - Consolidate apply_patch() using dict iteration - Extract magic numbers as constants (MAX_ITERATIONS, MAX_TOKENS, GITHUB_API_*) - Extract _find_repo_with_fewest_issues helper to eliminate duplication - Use specific exceptions instead of broad Exception catching - Replace assert with explicit ValueError validation - Remove no-op placeholder methods and their calls Net reduction: -64 lines 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Add proper type annotations to evaluator_patch.py functions - Use cast() for json.load() returns in loader.py - Remove manual secrets loading (FastAgent handles automatically) - Remove unnecessary mypy/ruff overrides for empty bfcl data dir - Rename msg -> user_msg to avoid type shadowing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Extract helper functions to reduce complexity (C901, PLR0912, PLR0915) - Add LoggingContext and EvaluationCheck dataclasses to reduce params (PLR0913) - Use list comprehensions instead of append loops (PERF401) - Auto-apply evaluator patches on module import for cleaner imports - Remove all MCP-Universe ruff ignores from pyproject.toml 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Extract _validate_test() to handle validation + human logging - Extract _log_evaluation_results() for cleaner separation - Remove asyncio.sleep(0), simplify parametrize and assert - Main test function now cleanly separates run vs validate modes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Execution phase now only writes to structured JSONL log - Human-readable log generated during validation by replaying structured events - Added HumanReadableLogger.from_structured_log() classmethod for replay - Removed LoggingContext dataclass and simplified _process_message_logs - Removed unused functions: _find_tool_name, _get_final_assistant_message, _determine_completion_status 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Make max_iterations a parameter with default value instead of duplicating the constant in two files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a comprehensive benchmark integration for MCP-Universe's repository management tasks. The implementation includes test infrastructure, evaluation logic with patches for compatibility issues, human-readable logging capabilities, and configuration for running 28 GitHub-based tasks using FastAgent with the GitHub MCP server v0.15.0.
Key changes:
- Added MCP-Universe benchmark test suite with pytest infrastructure
- Implemented evaluator patches to fix false negatives in the original MCP-Universe evaluation functions
- Created human-readable logging system for test results and manual annotation
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/utils/fastagent_helpers.py | Added JSON serialization mode parameter to model_dump call |
| tests/conftest.py | Added output_dir fixture for test result directories |
| tests/benchmarks/mcp_universe/test_mcp_universe.py | Main test implementation for running and validating MCP-Universe tasks |
| tests/benchmarks/mcp_universe/reporting.py | Human-readable logging infrastructure for benchmark results |
| tests/benchmarks/mcp_universe/mcp_server_config.json | Docker-based GitHub MCP server configuration |
| tests/benchmarks/mcp_universe/instruction.txt | Agent instruction template for task execution |
| tests/benchmarks/mcp_universe/fastagent.config.yaml | FastAgent configuration with pinned GitHub MCP server version |
| tests/benchmarks/mcp_universe/evaluator_patch.py | Patches for MCP-Universe evaluator compatibility with GitHub MCP Server v0.15.0 |
| tests/benchmarks/mcp_universe/evaluator.py | Evaluation orchestration for repository management tasks |
| tests/benchmarks/mcp_universe/init.py | Package initialization |
| tests/benchmarks/mcp_universe/README.md | Comprehensive documentation for setup and usage |
| tests/benchmarks/appworld/mcp_server.py | Removed type ignore comments from decorators |
| pyproject.toml | Added mcpuniverse dependency and related overrides |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
@vinamra57