Skip to content

Commit 72be960

Browse files
MarkovChain-whykennymckormickmzr1996
authored
Add hipho physics dataset (#1318)
* Initial commit fot HiPhO * update * update * refactor: simplify HiPhO dataset logging system - Remove custom LogBuffer class and thread-safe logging - Replace safe_print with standard print statements - Remove threading and datetime imports - Simplify build_prompt function by removing verbose debug output - Update dataset URL from haiyuanwan/HiPhO to HY-Wan/HiPhO - Reduce code from 899 to 803 lines (10.7% reduction) - Maintain all core functionality: evaluation logic, prompt building, hipho_verifier integration * refactor: remove parallel evaluation framework from HiPhO dataset - Remove complex parallel evaluation using track_progress_rich - Simplify to sequential evaluation for better stability and debugging - Remove multiprocessing and parallel task management dependencies - Rename functions to remove '_with_buffer' suffix and log_buffer parameters - Remove nproc parameter handling and temporary file management - Reduce code from 803 to 774 lines (additional 3.6% reduction) - Maintain all core evaluation logic: fine/coarse-grained scoring, hipho_verifier integration - Sequential evaluation is sufficient for physics olympiad problem counts * refactor: major simplification of HiPhO dataset implementation Major improvements: - Remove 6 unnecessary try-except blocks that were hiding errors - Standardize judge model initialization to follow VLMEvalKit conventions - Move all prompt templates to utils/prompt_inference.py for better organization - Remove redundant count statistics (fine_grained_count, coarse_grained_count, total_count) - Remove unused fallback functions (_simple_answer_matching, _extract_prediction_for_display) - Fix multi-image base64 processing bug - Correct dataset name display in summary output - Remove verbose debugging output and unnecessary comments Code reduction: 899 → 604 lines (32.8% reduction) Eliminated potential bugs and improved maintainability while preserving all core functionality * Improve HiPhO dataset: translate comments to English and enhance configuration - Translate all Chinese comments to English in hipho.py, hipho_verifier.py, and prompt_inference.py - Simplify comments while maintaining technical accuracy - Replace hardcoded verifier model configuration with environment variables - Use VLMEvalKit standard environment variable approach for better flexibility - Add support for HIPHO_VERIFIER_* environment variables for model configuration - Improve code maintainability and international accessibility * Add new dependencies for HiPhO dataset functionality - Add datasets: for HuggingFace dataset loading - Add scikit-learn: for machine learning utilities - Add pylatexenc==2.10: for LaTeX text processing - Add math-verify: for mathematical answer verification These dependencies are required for the HiPhO physics olympiad dataset evaluation and verification functionality. * Add hipho_prompt_inference.py utility file * Update import statement for prompt inference module --------- Co-authored-by: Haodong Duan <[email protected]> Co-authored-by: Ma Zerun <[email protected]>
1 parent 8dc65e0 commit 72be960

File tree

5 files changed

+2078
-1
lines changed

5 files changed

+2078
-1
lines changed

requirements.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
accelerate
2+
datasets
23
dotenv
34
einops
45
# for gemini api
@@ -8,6 +9,7 @@ huggingface_hub
89
imageio
910
ipdb
1011
json_repair
12+
math-verify
1113
matplotlib
1214
nltk
1315
numpy
@@ -19,10 +21,12 @@ pandas
1921
pillow
2022
portalocker
2123
protobuf
24+
pylatexenc==2.10
2225
python-dotenv
2326
qwen_vl_utils
2427
requests
2528
rich
29+
scikit-learn
2630
sentencepiece
2731
setuptools
2832
sty

vlmeval/dataset/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@
9393
from .matbench import MATBench
9494

9595
from .reasonmap_plus import ReasonMap_Plus
96+
from .hipho import HiPhODataset
9697
from .gsm8k_v import GSM8KVDataset
9798

9899

@@ -223,7 +224,7 @@ def evaluate(self, eval_file, **judge_kwargs):
223224
OmniEarthMCQBench, VisFactor, OSTDataset, OCRBench_v2, TreeBench, CVQA, M4Bench,
224225
AyaVisionBench, TopViewRS, VLMBias, MMHELIX, MedqbenchMCQDataset, MathCanvas,
225226
MedqbenchPairedDescriptionDataset, MedqbenchCaptionDataset, ChartMuseum, ChartQAPro, ReasonMap_Plus,
226-
olmOCRBench, OceanOCRBench, MATBench, VLRMBench, RefCOCODataset, SimpleVQA
227+
olmOCRBench, OceanOCRBench, MATBench, VLRMBench, RefCOCODataset, SimpleVQA, HiPhODataset
227228
]
228229

229230
VIDEO_DATASET = [

0 commit comments

Comments
 (0)