Release OpenCompass v0.2.1 · open-compass/opencompass

We're thrilled to announce OpenCompass v0.2.1, loaded with new datasets, features, and vital fixes. This release is a testament to our ongoing commitment to enhancing user experience and broadening research capabilities.

🌟 Highlights:

Add Agent and Code datasets: Diverse new datasets like GPQA, mastermath2024v1, and more, significantly expanding the scope of OpenCompass.
Support Different JudgeLLM Subjective Evaluation: Providing more choice when choose judgellms.
Support Needle in Haystack: Support Needle in Haystack for longtext evaluation.
Add VLLM Evaluation: We support VLLM inference and evaluation.

Here's what's new:

🚀 New Features:

📦 Dataset Expansion:
- Added rwkv-5-3b model (#666)
- Integration of diverse datasets including GPQA, Creationbench, and more.
- Support for new datasets like mastermath2024v1, mbpp_plus, and sanitized_mbpp (#744, #770, #745)
🛠 Functional Enhancements:
- Subjective evaluation improvements (#692, #724)
- Updated python action, slurm, and docker docs (#694, #718)
- Turbomind API support and Qwen API integration (#693, #735)
📖 Documentation Updates:
- Updated contamination, alignmentbench, and other docs for better clarity (#698, #707)
- Fixed dead links and typos in various documents (#455, #773, #774)

🐛 Bug Fixes:

Addressed various issues including those in alignmentbench, configs, and postprocess scripts.
Fixed bugs concerning subjective evaluation and EOS string detection.
Quick fixes for improved performance and reliability.

🎉 Welcome New Contributors:

A warm welcome to our first-time contributors:
- @BBuf, @DseidLi, @Skyfall-xzz, @RunningLeon, @zehuichen123, @AllentDan, @Connor-Shen, @Francis-llgg, @hzhwcmhf, @ChrisLiu6, @yanyc428, @tpoisonooo, @jiangjin1999

🔗 Full Changelog

add rwkv-5-3b model by @BBuf in #666
[Feature] Add double order of subjective evaluation and removing duplicated response among two models by @bittersweet1999 in #692
[Feat] update python action and slurm by @yingfhu in #694
[Doc] Update contamination docs by @Leymore in #698
alignmentbench infer and judge by @bittersweet1999 in #697
[Fix] Update alignmentbench by @tonysy in #704
removed redundant code in GSM8KDataset.load method. by @DseidLi in #700
[Fix] fix a bug on configs/eval_mixtral_8x7b.py by @jingmingzhuo in #706
[Doc] Update Doc for Alignbench by @tonysy in #707
[Fix] minor fix openai by @yingfhu in #711
Add Judgellms by @bittersweet1999 in #710
[Feat] Update math/agent by @yingfhu in #716
[Docs] update docker docs by @yingfhu in #718
[Fix] Quick fix for max_out_len in subjective evaluation by @bittersweet1999 in #719
[Feature] Support the use of humaneval_plus. by @jingmingzhuo in #720
[Feature] Add reasonbench dataset by @Skyfall-xzz in #577
[Feature] Add abbr for judgemodel in subjective evaluation by @bittersweet1999 in #724
Update configs for evaluating chat models like qwen, baichuan, llama2 using turbomind backend by @RunningLeon in #721
[News] add news for T-Eval by @zehuichen123 in #727
Add NeedleInAHaystack Test Support by @DseidLi in #714
[Fix] Fixed abbr erro of subjective alignbench and size partition by @bittersweet1999 in #730
add turbomind restful api support by @AllentDan in #693
[Fix] Update merge script for non-split settting by @tonysy in #733
[Sync] Sync with internal codes by @Leymore in #734
[Feature] Add InfiniteBench by @philipwangOvO in #739
Update LightllmApi and Fix mmlu bug by @helloyongyang in #738
[Feature] Add other judgelm prompts for Alignbench by @bittersweet1999 in #731
[Feat] support sanitized mbpp dataset by @yingfhu in #745
[Fix] SubSizePartition fix by @bittersweet1999 in #746
add chinese version of humaneval, mbpp by @Connor-Shen in #743
[Fix] fix erro in configs by @bittersweet1999 in #750
[Feature] Add Creationbench Dataset by @bittersweet1999 in #753
[Feat] update code config by @yingfhu in #749
update plot function in tools_needleinahaystack.py by @DseidLi in #747
[Feature] Add new dataset mastermath2024v1 by @Francis-llgg in #744
[Feature] Add GPQA Dataset by @Francis-llgg in #729
change NeedleInAHaystackDataset to dynamic loading by @DseidLi in #754
[Feature] Add support of Qwen API by @hzhwcmhf in #735
[Feature] Support LLaMA2-Accessory by @ChrisLiu6 in #732
[Fix] Fix small bug in alignbench by @bittersweet1999 in #764
[Feature] Add multi_round dataset evaluation by @bittersweet1999 in #766
[Feature] add subject ir dataset by @bittersweet1999 in #755
[Update] Update introduction of CompassBench-2024-Q1 by @tonysy in #769
[Fix] quick fix for postprocess by @bittersweet1999 in #771
Support Mbpp_plus dataset by @Connor-Shen in #770
[Fix] fix typos in drop prompt by @yanyc428 in #773
typo(installation.md): fix unzip commands by @tpoisonooo in #774
Contamination analysis for MMLU, Hellaswag, and ARC_c by @liyucheng09 in #699
[Docs] Update contamination docs by @Leymore in #775
[Feature] _batch_generate function, add the MultiTokenEOSCriteria by @jiangjin1999 in #772
[Sync] Sync with internal codes 2023.01.08 by @Leymore in #777

For a full list of updates, visit our Full Changelog.

Thank you to every contributor, old and new. Your dedication is shaping OpenCompass into a more robust and versatile tool. 🙌 🎉

Remember to star 🌟 our GitHub repository if OpenCompass aids your research and development! Your support and feedback are crucial for our continuous improvement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCompass v0.2.1

🌟 Highlights:

🚀 New Features:

🐛 Bug Fixes:

🎉 Welcome New Contributors:

🔗 Full Changelog

Contributors