Skip to content

Comprehensive evaluation of 5 state-of-the-art LLMs (DeepSeek V3, DeepSeek R1, GPT-4o, Llama 4 Maverick, Qwen2.5) on bidirectional Java↔Python translation across Avatar and CodeNet benchmarks, comparing direct vs. algorithm-based approaches with detailed error taxonomy analysis.

Notifications You must be signed in to change notification settings

sdipto7/llm-code-translation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

89 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Algorithm-Based Pipeline for Reliable and Intent-Preserving Code Translation with LLMs

This repository contains the replication package for the research paper titled "Algorithm-Based Pipeline for Reliable and Intent-Preserving Code Translation with LLMs" submitted at 34th IEEE/ACM International Conference on Program Comprehension (ICPC 2026).

LLM Models Evaluated

This study evaluates five state-of-the-art large language models:

  1. DeepSeek-V3 (deepseek_chat) - large Mixture-of-Experts model with state-of-the-art open-source results on code and math
  2. DeepSeek-R1 (deepseek_r1) - reasoning-oriented model trained with supervised fine-tuning and multi-stage RL
  3. GPT-4o (gpt_4o) - OpenAI's proprietary model with strong zero-shot software-engineering performance
  4. Llama-3.3-70B-Versatile (llama_4_maverick) - Meta's versatile general-purpose open-source model
  5. Qwen2.5-72B-Instruct (qwen_2.5_72b_instruct) - high-capacity dense model with strong code and math skills and long-context support

All models were accessed via the OpenRouter API for consistent evaluation.

Translation Directions

The research evaluates bidirectional translation:

  • Java → Python (java_to_python)
  • Python → Java (python_to_java)

Quick Start

See QUICKSTART.md for detailed setup and usage instructions.

Replication Package Structure

1. Dataset (Input)

The dataset/ directory serves as the input for the translation experiments. It contains two benchmark datasets with both source code and corresponding test cases:

dataset/
├── avatar/
│   ├── Java/
│   │   ├── Code/          # Java source code files
│   │   └── TestCases/     # Input/output test cases for Java code
│   └── Python/
│       ├── Code/          # Python source code files
│       └── TestCases/     # Input/output test cases for Python code
└── codenet/
    ├── Java/
    │   ├── Code/          # Java source code files
    │   └── TestCases/     # Input/output test cases for Java code
    └── Python/
        ├── Code/          # Python source code files
        └── TestCases/     # Input/output test cases for Python code

2. Output (Translation Results)

The output/ directory contains the translated code generated by different LLM models using both translation approaches:

output/
├── deepseek_v3/
│   ├── avatar/
│   │   ├── algo_based_translation/
│   │   │   ├── java/
│   │   │   │   ├── algorithm/       # Extracted algorithms from Java code (Phase 1)
│   │   │   │   └── python/          # Java→Python translations (algorithm-based, Phase 2)
│   │   │   └── python/
│   │   │       ├── algorithm/       # Extracted algorithms from Python code (Phase 1)
│   │   │       └── java/            # Python→Java translations (algorithm-based, Phase 2)
│   │   └── direct_translation/  
│   │       ├── java/
│   │       │   └── python/          # Java→Python translations (direct)
│   │       └── python/
│   │           └── java/            # Python→Java translations (direct)
│   └── codenet/
│       ├── algo_based_translation/
│       │   ├── java/
│       │   │   ├── algorithm/       # Extracted algorithms from Java code (Phase 1)
│       │   │   └── python/          # Java→Python translations (algorithm-based, Phase 2)
│       │   └── python/
│       │       ├── algorithm/       # Extracted algorithms from Python code (Phase 1)
│       │       └── java/            # Python→Java translations (algorithm-based, Phase 2)
│       └── direct_translation/
│           ├── java/
│           │   └── python/
│           └── python/
│               └── java/
├── deepseek_r1/
├── gpt_4o/
├── llama_4_maverick/
└── qwen_2.5_72b_instruct/

Note: All model directories follow the same hierarchical structure as shown above for deepseek_v3.

3. Reports (Evaluation Results)

The reports/ directory contains detailed evaluation results organized by model and dataset. These reports are directly linked to specific research questions:

reports/
├── rq2_error_taxonomies.txt            # RQ2 - Error taxonomy derived from compile-time and runtime errors
├── rq3_error_taxonomy_comparison.txt   # RQ3 - Quantitative comparison of error frequencies from the error taxonomy derived from RQ2
├── deepseek_chat/
│   ├── avatar/
│   │   ├── {source}_to_{target}_for_direct_translation.txt                           # RQ1
│   │   ├── {source}_to_{target}_for_algo_based_translation.txt                       # RQ1
│   │   ├── {source}_to_{target}_compile_error_report_for_direct_translation.csv      # RQ2
│   │   ├── {source}_to_{target}_compile_error_report_for_algo_based_translation.csv  # RQ2
│   │   ├── {source}_to_{target}_runtime_error_report_for_direct_translation.csv      # RQ2
│   │   ├── {source}_to_{target}_runtime_error_report_for_algo_based_translation.csv  # RQ2
│   │   ├── {source}_to_{target}_test_fail_report_for_direct_translation.csv          # RQ2
│   │   ├── {source}_to_{target}_test_fail_report_for_algo_based_translation.csv      # RQ2
│   │   └── {source}_to_{target}_infinite_loop_report_for_algo_based_translation.csv  # RQ2
│   └── codenet/
│       └── [same structure as avatar]
├── deepseek_r1/       [same structure as deepseek_chat]
├── gpt_4o/            [same structure as deepseek_chat]
├── llama_4_maverick/  [same structure as deepseek_chat]
└── qwen_2.5_72b_instruct/  [same structure as deepseek_chat]

RQ1: Effective Performance of Algorithm-based Approach

Files: {source}_to_{target}_for_direct_translation.txt and {source}_to_{target}_for_algo_based_translation.txt

This question evaluates the overall effectiveness and accuracy of each LLM across both translation directions.

Examples:

  • reports/deepseek_r1/avatar/python_to_java_for_direct_translation.txt - Overall accuracy of DeepSeek R1 on Avatar dataset for Python→Java direct translation.
  • reports/gpt_4o/codenet/java_to_python_for_algo_based_translation.txt - Overall accuracy of GPT-4o on CodeNet dataset for Java→Python algorithm-based translation.

RQ2: Error Taxonomy and Frequency Distribution Observed in Each Combination

Files:

  • rq2_error_taxonomies.txt - Comprehensive error taxonomy categorizing compile time and runtime errors for both Python→Java and Java→Python translations
  • {source}_to_{target}_compile_error_report_for_*.csv - Compilation error reports
  • {source}_to_{target}_runtime_error_report_for_*.csv - Runtime error reports

This question analyzes all compile time and runtime errors across both translation directions to create a structured error taxonomy. The taxonomy categorizes error subtypes specific to each language direction.

Examples:

  • reports/rq2_error_taxonomies.txt - Complete error taxonomy with language-specific error mappings
  • reports/deepseek_chat/avatar/python_to_java_compile_error_report_for_direct_translation.csv - Compilation errors in DeepSeek V3's Python→Java direct translations on Avatar dataset
  • reports/qwen_2.5_72b_instruct/codenet/java_to_python_runtime_error_report_for_algo_based_translation.csv - Runtime errors in Qwen 2.5's Java→Python algorithm-based translations on CodeNet dataset

RQ3: Scenario and Pattern-Specific Error Reduction by Algorithm-based Approach

Files:

  • rq3_error_taxonomy_comparison.txt - Quantitative comparison of error frequencies across taxonomy categories, showing direct vs. algorithm-based approach distributions
  • All error report files (compile, runtime, test fail, infinite loop) comparing direct vs. algorithm-based translation approaches

This question investigates how the algorithm-based approach reduces specific error patterns compared to direct translation. The analysis uses the error taxonomy from RQ2 to quantify error reduction rates across different categories:

Examples:

  • reports/rq3_error_taxonomy_comparison.txt - Complete quantitative breakdown of error distributions across all taxonomy categories

About

Comprehensive evaluation of 5 state-of-the-art LLMs (DeepSeek V3, DeepSeek R1, GPT-4o, Llama 4 Maverick, Qwen2.5) on bidirectional Java↔Python translation across Avatar and CodeNet benchmarks, comparing direct vs. algorithm-based approaches with detailed error taxonomy analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published