Algorithm-Based Pipeline for Reliable and Intent-Preserving Code Translation with LLMs

This repository contains the replication package for the research paper titled "Algorithm-Based Pipeline for Reliable and Intent-Preserving Code Translation with LLMs" submitted at 34th IEEE/ACM International Conference on Program Comprehension (ICPC 2026).

LLM Models Evaluated

This study evaluates five state-of-the-art large language models:

DeepSeek-V3 (deepseek_chat) - large Mixture-of-Experts model with state-of-the-art open-source results on code and math
DeepSeek-R1 (deepseek_r1) - reasoning-oriented model trained with supervised fine-tuning and multi-stage RL
GPT-4o (gpt_4o) - OpenAI's proprietary model with strong zero-shot software-engineering performance
Llama-3.3-70B-Versatile (llama_4_maverick) - Meta's versatile general-purpose open-source model
Qwen2.5-72B-Instruct (qwen_2.5_72b_instruct) - high-capacity dense model with strong code and math skills and long-context support

All models were accessed via the OpenRouter API for consistent evaluation.

Translation Directions

The research evaluates bidirectional translation:

Java → Python (java_to_python)
Python → Java (python_to_java)

Quick Start

See QUICKSTART.md for detailed setup and usage instructions.

Replication Package Structure

1. Dataset (Input)

The dataset/ directory serves as the input for the translation experiments. It contains two benchmark datasets with both source code and corresponding test cases:

dataset/
├── avatar/
│   ├── Java/
│   │   ├── Code/          # Java source code files
│   │   └── TestCases/     # Input/output test cases for Java code
│   └── Python/
│       ├── Code/          # Python source code files
│       └── TestCases/     # Input/output test cases for Python code
└── codenet/
    ├── Java/
    │   ├── Code/          # Java source code files
    │   └── TestCases/     # Input/output test cases for Java code
    └── Python/
        ├── Code/          # Python source code files
        └── TestCases/     # Input/output test cases for Python code

2. Output (Translation Results)

The output/ directory contains the translated code generated by different LLM models using both translation approaches:

output/
├── deepseek_v3/
│   ├── avatar/
│   │   ├── algo_based_translation/
│   │   │   ├── java/
│   │   │   │   ├── algorithm/       # Extracted algorithms from Java code (Phase 1)
│   │   │   │   └── python/          # Java→Python translations (algorithm-based, Phase 2)
│   │   │   └── python/
│   │   │       ├── algorithm/       # Extracted algorithms from Python code (Phase 1)
│   │   │       └── java/            # Python→Java translations (algorithm-based, Phase 2)
│   │   └── direct_translation/  
│   │       ├── java/
│   │       │   └── python/          # Java→Python translations (direct)
│   │       └── python/
│   │           └── java/            # Python→Java translations (direct)
│   └── codenet/
│       ├── algo_based_translation/
│       │   ├── java/
│       │   │   ├── algorithm/       # Extracted algorithms from Java code (Phase 1)
│       │   │   └── python/          # Java→Python translations (algorithm-based, Phase 2)
│       │   └── python/
│       │       ├── algorithm/       # Extracted algorithms from Python code (Phase 1)
│       │       └── java/            # Python→Java translations (algorithm-based, Phase 2)
│       └── direct_translation/
│           ├── java/
│           │   └── python/
│           └── python/
│               └── java/
├── deepseek_r1/
├── gpt_4o/
├── llama_4_maverick/
└── qwen_2.5_72b_instruct/

Note: All model directories follow the same hierarchical structure as shown above for deepseek_v3.

3. Reports (Evaluation Results)

The reports/ directory contains detailed evaluation results organized by model and dataset. These reports are directly linked to specific research questions:

reports/
├── rq2_error_taxonomies.txt            # RQ2 - Error taxonomy derived from compile-time and runtime errors
├── rq3_error_taxonomy_comparison.txt   # RQ3 - Quantitative comparison of error frequencies from the error taxonomy derived from RQ2
├── deepseek_chat/
│   ├── avatar/
│   │   ├── {source}_to_{target}_for_direct_translation.txt                           # RQ1
│   │   ├── {source}_to_{target}_for_algo_based_translation.txt                       # RQ1
│   │   ├── {source}_to_{target}_compile_error_report_for_direct_translation.csv      # RQ2
│   │   ├── {source}_to_{target}_compile_error_report_for_algo_based_translation.csv  # RQ2
│   │   ├── {source}_to_{target}_runtime_error_report_for_direct_translation.csv      # RQ2
│   │   ├── {source}_to_{target}_runtime_error_report_for_algo_based_translation.csv  # RQ2
│   │   ├── {source}_to_{target}_test_fail_report_for_direct_translation.csv          # RQ2
│   │   ├── {source}_to_{target}_test_fail_report_for_algo_based_translation.csv      # RQ2
│   │   └── {source}_to_{target}_infinite_loop_report_for_algo_based_translation.csv  # RQ2
│   └── codenet/
│       └── [same structure as avatar]
├── deepseek_r1/       [same structure as deepseek_chat]
├── gpt_4o/            [same structure as deepseek_chat]
├── llama_4_maverick/  [same structure as deepseek_chat]
└── qwen_2.5_72b_instruct/  [same structure as deepseek_chat]

RQ1: Effective Performance of Algorithm-based Approach

Files: {source}_to_{target}_for_direct_translation.txt and {source}_to_{target}_for_algo_based_translation.txt

This question evaluates the overall effectiveness and accuracy of each LLM across both translation directions.

Examples:

reports/deepseek_r1/avatar/python_to_java_for_direct_translation.txt - Overall accuracy of DeepSeek R1 on Avatar dataset for Python→Java direct translation.
reports/gpt_4o/codenet/java_to_python_for_algo_based_translation.txt - Overall accuracy of GPT-4o on CodeNet dataset for Java→Python algorithm-based translation.

RQ2: Error Taxonomy and Frequency Distribution Observed in Each Combination

Files:

rq2_error_taxonomies.txt - Comprehensive error taxonomy categorizing compile time and runtime errors for both Python→Java and Java→Python translations
{source}_to_{target}_compile_error_report_for_*.csv - Compilation error reports
{source}_to_{target}_runtime_error_report_for_*.csv - Runtime error reports

This question analyzes all compile time and runtime errors across both translation directions to create a structured error taxonomy. The taxonomy categorizes error subtypes specific to each language direction.

Examples:

reports/rq2_error_taxonomies.txt - Complete error taxonomy with language-specific error mappings
reports/deepseek_chat/avatar/python_to_java_compile_error_report_for_direct_translation.csv - Compilation errors in DeepSeek V3's Python→Java direct translations on Avatar dataset
reports/qwen_2.5_72b_instruct/codenet/java_to_python_runtime_error_report_for_algo_based_translation.csv - Runtime errors in Qwen 2.5's Java→Python algorithm-based translations on CodeNet dataset

RQ3: Scenario and Pattern-Specific Error Reduction by Algorithm-based Approach

Files:

rq3_error_taxonomy_comparison.txt - Quantitative comparison of error frequencies across taxonomy categories, showing direct vs. algorithm-based approach distributions
All error report files (compile, runtime, test fail, infinite loop) comparing direct vs. algorithm-based translation approaches

This question investigates how the algorithm-based approach reduces specific error patterns compared to direct translation. The analysis uses the error taxonomy from RQ2 to quantify error reduction rates across different categories:

Examples:

reports/rq3_error_taxonomy_comparison.txt - Complete quantitative breakdown of error distributions across all taxonomy categories

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
dataset		dataset
output		output
reports		reports
scripts		scripts
src		src
.gitignore		.gitignore
QUICKSTART.md		QUICKSTART.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Algorithm-Based Pipeline for Reliable and Intent-Preserving Code Translation with LLMs

LLM Models Evaluated

Translation Directions

Quick Start

Replication Package Structure

1. Dataset (Input)

2. Output (Translation Results)

3. Reports (Evaluation Results)

RQ1: Effective Performance of Algorithm-based Approach

RQ2: Error Taxonomy and Frequency Distribution Observed in Each Combination

RQ3: Scenario and Pattern-Specific Error Reduction by Algorithm-based Approach

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sdipto7/llm-code-translation

Folders and files

Latest commit

History

Repository files navigation

Algorithm-Based Pipeline for Reliable and Intent-Preserving Code Translation with LLMs

LLM Models Evaluated

Translation Directions

Quick Start

Replication Package Structure

1. Dataset (Input)

2. Output (Translation Results)

3. Reports (Evaluation Results)

RQ1: Effective Performance of Algorithm-based Approach

RQ2: Error Taxonomy and Frequency Distribution Observed in Each Combination

RQ3: Scenario and Pattern-Specific Error Reduction by Algorithm-based Approach

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages