CommitSuite is a comprehensive benchmark designed for both commit classification and commit message generation tasks. By integrating CCS (Conventional Commits Specification) and high-quality commit messages, we construct a large-scale, clean dataset covering multiple languages and repositories. We propose a new binary evaluation metric system that enables automated assessment of commit messages without relying on human-written references. Experiments demonstrate that LLMs outperform existing tools in both tasks, especially in message generation.
Our dataset comprises 63,581 high-quality commits from 243 open-source repositories, all strictly following the CCS format and covering seven popular programming languages.
(counts only include commits modifying files in a single language)
| Language | Commit Count |
|---|---|
| GO | 14807 |
| JAVASCRIPT | 10903 |
| TYPESCRIPT | 18835 |
| PYTHON | 12740 |
| C/C++ | 5612 |
| JAVA | 2009 |
- url
- hash
- msg
- message_type
- description
- why
- what
- author
- email
- date
- modified_files (array)
- old_path
- new_path
- filename
- change_type
- diff
- added_lines
- deleted_lines
- issues (array)
- number
- title
- body
- url
- prs (array)
- comments (array)
- repo_name
- strict_ccs
- ast_changes (array)
- file
- changes (array)
- type
- name
- structure_type
- location
- details (object)- url: URL of the commit
- hash: Hash identifier of the commit
- msg: Full commit message (raw text)
- message_type: Commit type specified by the developer (e.g.,
fix,feat) - description: Descriptive text following the commit type (use this field to hide type information from tools/LLMs)
- why: Indicates presence of "why" rationale in commit message (
0= absent,1= present) - what: Indicates presence of "what" change information in commit message (
0= absent,1= present) - author: Developer who authored the commit
- email: Author's email address
- date: Commit date (YYYY-MM-DD format)
- modified_files: Modified files metadata including diffs
- issues: Issues referenced in commit message (via
#xxxnotation) - prs: Pull requests referenced in commit message (via
#xxxnotation) - comments: Code review comments on the commit
- repo_name: Source repository name (format:
owner/repo) - strict_ccs: CCS compliance level (
0= near-perfect adherence,1= majority adherence) - AST_changes: AST changes parsed by tree-sitter
-
Test your target tool on our Ten-category-eval Dataset
-
Save results in
Ten-category_result.jsonwith these required fields:"message_type"(original commit type)"Ten-category_result"(tool-predicted type)
-
Run the evaluation script:
python Ten-category_eval.py
-
Generated reports will appear in
./Ten-category_reports:- Classification metrics:
classification_metrics.csv - Confusion matrix visualization:
confusion_matrix.png
- Classification metrics:
-
Test your target tool on our CMG-eval Dataset
-
Save results in
CMG_result.jsoncontaining:- All original dataset fields
- Additional
"CMG_result"field (tool-generated message)
-
Evaluate Traditional Metrics:
python CMG_eval_traditional_metrics.py- Output:
./CMG_reports/CMG_eval_traditional_metrics.csv - Metrics: BLEU, ROUGE-L, METEOR
- Output:
-
Evaluate Binary Metrics:
Before running
CMG_eval_binary_metrics.py, please read Configuration Note first.python CMG_eval_binary_metrics.py- Outputs in
./CMG_reports:CMG_eval_binary_metrics.json(augmented with binary scores per entry)CMG_eval_binary_metrics.csv(aggregated binary metrics)
Configuration Note: The binary evaluation script uses:
- API: https://api.deepseek.com
- Model:
deepseek-v3(Modify prompts/script logic if changing API/model)
- Outputs in