This repository contains the replication package for the research paper titled "Algorithm-Based Pipeline for Reliable and Intent-Preserving Code Translation with LLMs" submitted at 34th IEEE/ACM International Conference on Program Comprehension (ICPC 2026).
This study evaluates five state-of-the-art large language models:
- DeepSeek-V3 (
deepseek_chat) - large Mixture-of-Experts model with state-of-the-art open-source results on code and math - DeepSeek-R1 (
deepseek_r1) - reasoning-oriented model trained with supervised fine-tuning and multi-stage RL - GPT-4o (
gpt_4o) - OpenAI's proprietary model with strong zero-shot software-engineering performance - Llama-3.3-70B-Versatile (
llama_4_maverick) - Meta's versatile general-purpose open-source model - Qwen2.5-72B-Instruct (
qwen_2.5_72b_instruct) - high-capacity dense model with strong code and math skills and long-context support
All models were accessed via the OpenRouter API for consistent evaluation.
The research evaluates bidirectional translation:
- Java → Python (
java_to_python) - Python → Java (
python_to_java)
See QUICKSTART.md for detailed setup and usage instructions.
The dataset/ directory serves as the input for the translation experiments. It contains two benchmark datasets with both source code and corresponding test cases:
dataset/
├── avatar/
│ ├── Java/
│ │ ├── Code/ # Java source code files
│ │ └── TestCases/ # Input/output test cases for Java code
│ └── Python/
│ ├── Code/ # Python source code files
│ └── TestCases/ # Input/output test cases for Python code
└── codenet/
├── Java/
│ ├── Code/ # Java source code files
│ └── TestCases/ # Input/output test cases for Java code
└── Python/
├── Code/ # Python source code files
└── TestCases/ # Input/output test cases for Python code
The output/ directory contains the translated code generated by different LLM models using both translation approaches:
output/
├── deepseek_v3/
│ ├── avatar/
│ │ ├── algo_based_translation/
│ │ │ ├── java/
│ │ │ │ ├── algorithm/ # Extracted algorithms from Java code (Phase 1)
│ │ │ │ └── python/ # Java→Python translations (algorithm-based, Phase 2)
│ │ │ └── python/
│ │ │ ├── algorithm/ # Extracted algorithms from Python code (Phase 1)
│ │ │ └── java/ # Python→Java translations (algorithm-based, Phase 2)
│ │ └── direct_translation/
│ │ ├── java/
│ │ │ └── python/ # Java→Python translations (direct)
│ │ └── python/
│ │ └── java/ # Python→Java translations (direct)
│ └── codenet/
│ ├── algo_based_translation/
│ │ ├── java/
│ │ │ ├── algorithm/ # Extracted algorithms from Java code (Phase 1)
│ │ │ └── python/ # Java→Python translations (algorithm-based, Phase 2)
│ │ └── python/
│ │ ├── algorithm/ # Extracted algorithms from Python code (Phase 1)
│ │ └── java/ # Python→Java translations (algorithm-based, Phase 2)
│ └── direct_translation/
│ ├── java/
│ │ └── python/
│ └── python/
│ └── java/
├── deepseek_r1/
├── gpt_4o/
├── llama_4_maverick/
└── qwen_2.5_72b_instruct/
Note: All model directories follow the same hierarchical structure as shown above for deepseek_v3.
The reports/ directory contains detailed evaluation results organized by model and dataset. These reports are directly linked to specific research questions:
reports/
├── rq2_error_taxonomies.txt # RQ2 - Error taxonomy derived from compile-time and runtime errors
├── rq3_error_taxonomy_comparison.txt # RQ3 - Quantitative comparison of error frequencies from the error taxonomy derived from RQ2
├── deepseek_chat/
│ ├── avatar/
│ │ ├── {source}_to_{target}_for_direct_translation.txt # RQ1
│ │ ├── {source}_to_{target}_for_algo_based_translation.txt # RQ1
│ │ ├── {source}_to_{target}_compile_error_report_for_direct_translation.csv # RQ2
│ │ ├── {source}_to_{target}_compile_error_report_for_algo_based_translation.csv # RQ2
│ │ ├── {source}_to_{target}_runtime_error_report_for_direct_translation.csv # RQ2
│ │ ├── {source}_to_{target}_runtime_error_report_for_algo_based_translation.csv # RQ2
│ │ ├── {source}_to_{target}_test_fail_report_for_direct_translation.csv # RQ2
│ │ ├── {source}_to_{target}_test_fail_report_for_algo_based_translation.csv # RQ2
│ │ └── {source}_to_{target}_infinite_loop_report_for_algo_based_translation.csv # RQ2
│ └── codenet/
│ └── [same structure as avatar]
├── deepseek_r1/ [same structure as deepseek_chat]
├── gpt_4o/ [same structure as deepseek_chat]
├── llama_4_maverick/ [same structure as deepseek_chat]
└── qwen_2.5_72b_instruct/ [same structure as deepseek_chat]
Files: {source}_to_{target}_for_direct_translation.txt and {source}_to_{target}_for_algo_based_translation.txt
This question evaluates the overall effectiveness and accuracy of each LLM across both translation directions.
Examples:
reports/deepseek_r1/avatar/python_to_java_for_direct_translation.txt- Overall accuracy of DeepSeek R1 on Avatar dataset for Python→Java direct translation.reports/gpt_4o/codenet/java_to_python_for_algo_based_translation.txt- Overall accuracy of GPT-4o on CodeNet dataset for Java→Python algorithm-based translation.
Files:
rq2_error_taxonomies.txt- Comprehensive error taxonomy categorizing compile time and runtime errors for both Python→Java and Java→Python translations{source}_to_{target}_compile_error_report_for_*.csv- Compilation error reports{source}_to_{target}_runtime_error_report_for_*.csv- Runtime error reports
This question analyzes all compile time and runtime errors across both translation directions to create a structured error taxonomy. The taxonomy categorizes error subtypes specific to each language direction.
Examples:
reports/rq2_error_taxonomies.txt- Complete error taxonomy with language-specific error mappingsreports/deepseek_chat/avatar/python_to_java_compile_error_report_for_direct_translation.csv- Compilation errors in DeepSeek V3's Python→Java direct translations on Avatar datasetreports/qwen_2.5_72b_instruct/codenet/java_to_python_runtime_error_report_for_algo_based_translation.csv- Runtime errors in Qwen 2.5's Java→Python algorithm-based translations on CodeNet dataset
Files:
rq3_error_taxonomy_comparison.txt- Quantitative comparison of error frequencies across taxonomy categories, showing direct vs. algorithm-based approach distributions- All error report files (compile, runtime, test fail, infinite loop) comparing direct vs. algorithm-based translation approaches
This question investigates how the algorithm-based approach reduces specific error patterns compared to direct translation. The analysis uses the error taxonomy from RQ2 to quantify error reduction rates across different categories:
Examples:
reports/rq3_error_taxonomy_comparison.txt- Complete quantitative breakdown of error distributions across all taxonomy categories