OpenCompass v0.2.1
We're thrilled to announce OpenCompass v0.2.1, loaded with new datasets, features, and vital fixes. This release is a testament to our ongoing commitment to enhancing user experience and broadening research capabilities.
🌟 Highlights:
- Add Agent and Code datasets: Diverse new datasets like
GPQA
,mastermath2024v1
, and more, significantly expanding the scope of OpenCompass. - Support Different JudgeLLM Subjective Evaluation: Providing more choice when choose judgellms.
- Support Needle in Haystack: Support Needle in Haystack for longtext evaluation.
- Add VLLM Evaluation: We support VLLM inference and evaluation.
Here's what's new:
🚀 New Features:
-
📦 Dataset Expansion:
-
🛠 Functional Enhancements:
-
📖 Documentation Updates:
🐛 Bug Fixes:
- Addressed various issues including those in alignmentbench, configs, and postprocess scripts.
- Fixed bugs concerning subjective evaluation and EOS string detection.
- Quick fixes for improved performance and reliability.
🎉 Welcome New Contributors:
- A warm welcome to our first-time contributors:
🔗 Full Changelog
- add rwkv-5-3b model by @BBuf in #666
- [Feature] Add double order of subjective evaluation and removing duplicated response among two models by @bittersweet1999 in #692
- [Feat] update python action and slurm by @yingfhu in #694
- [Doc] Update contamination docs by @Leymore in #698
- alignmentbench infer and judge by @bittersweet1999 in #697
- [Fix] Update alignmentbench by @tonysy in #704
- removed redundant code in GSM8KDataset.load method. by @DseidLi in #700
- [Fix] fix a bug on configs/eval_mixtral_8x7b.py by @jingmingzhuo in #706
- [Doc] Update Doc for Alignbench by @tonysy in #707
- [Fix] minor fix openai by @yingfhu in #711
- Add Judgellms by @bittersweet1999 in #710
- [Feat] Update math/agent by @yingfhu in #716
- [Docs] update docker docs by @yingfhu in #718
- [Fix] Quick fix for max_out_len in subjective evaluation by @bittersweet1999 in #719
- [Feature] Support the use of humaneval_plus. by @jingmingzhuo in #720
- [Feature] Add reasonbench dataset by @Skyfall-xzz in #577
- [Feature] Add abbr for judgemodel in subjective evaluation by @bittersweet1999 in #724
- Update configs for evaluating chat models like qwen, baichuan, llama2 using turbomind backend by @RunningLeon in #721
- [News] add news for T-Eval by @zehuichen123 in #727
- Add NeedleInAHaystack Test Support by @DseidLi in #714
- [Fix] Fixed abbr erro of subjective alignbench and size partition by @bittersweet1999 in #730
- add turbomind restful api support by @AllentDan in #693
- [Fix] Update merge script for non-split settting by @tonysy in #733
- [Sync] Sync with internal codes by @Leymore in #734
- [Feature] Add InfiniteBench by @philipwangOvO in #739
- Update LightllmApi and Fix mmlu bug by @helloyongyang in #738
- [Feature] Add other judgelm prompts for Alignbench by @bittersweet1999 in #731
- [Feat] support sanitized mbpp dataset by @yingfhu in #745
- [Fix] SubSizePartition fix by @bittersweet1999 in #746
- add chinese version of humaneval, mbpp by @Connor-Shen in #743
- [Fix] fix erro in configs by @bittersweet1999 in #750
- [Feature] Add Creationbench Dataset by @bittersweet1999 in #753
- [Feat] update code config by @yingfhu in #749
- update plot function in tools_needleinahaystack.py by @DseidLi in #747
- [Feature] Add new dataset mastermath2024v1 by @Francis-llgg in #744
- [Feature] Add GPQA Dataset by @Francis-llgg in #729
- change NeedleInAHaystackDataset to dynamic loading by @DseidLi in #754
- [Feature] Add support of Qwen API by @hzhwcmhf in #735
- [Feature] Support LLaMA2-Accessory by @ChrisLiu6 in #732
- [Fix] Fix small bug in alignbench by @bittersweet1999 in #764
- [Feature] Add multi_round dataset evaluation by @bittersweet1999 in #766
- [Feature] add subject ir dataset by @bittersweet1999 in #755
- [Update] Update introduction of CompassBench-2024-Q1 by @tonysy in #769
- [Fix] quick fix for postprocess by @bittersweet1999 in #771
- Support Mbpp_plus dataset by @Connor-Shen in #770
- [Fix] fix typos in drop prompt by @yanyc428 in #773
- typo(installation.md): fix unzip commands by @tpoisonooo in #774
- Contamination analysis for MMLU, Hellaswag, and ARC_c by @liyucheng09 in #699
- [Docs] Update contamination docs by @Leymore in #775
- [Feature] _batch_generate function, add the MultiTokenEOSCriteria by @jiangjin1999 in #772
- [Sync] Sync with internal codes 2023.01.08 by @Leymore in #777
For a full list of updates, visit our Full Changelog.
Thank you to every contributor, old and new. Your dedication is shaping OpenCompass into a more robust and versatile tool. 🙌 🎉
Remember to star 🌟 our GitHub repository if OpenCompass aids your research and development! Your support and feedback are crucial for our continuous improvement.