Skip to content

Code for the SALT🧂 Paper presented for the SemEval-2025 Task 2 on Entity-Aware Machine Translation @ ACL 2025.

Notifications You must be signed in to change notification settings

LSX-UniWue/Semeval-2025-Task-2

Repository files navigation

SALT🧂: A SQL-based Approach for LLM-Free Entity-Aware-Translation

Note

SALT is our submission to the SemEval-2025 Task 2 on Entity-Aware Machine Translation, achieving top performance among systems not using gold entity information.

🧠 System Design

SALT employs a two-stage approach:

  1. Entity Retrieval: Normalized string matching with SQL for efficient candidate generation, scoring entities based on length and relevance
  2. Knowledge Integration: Augmented input with entity-translation pairs and logit biasing during beam search

This leads to the following in- and output:

Input: "What year did Roald Dahl release the novel The Witches?"
Augmented: "What year did Roald Dahl release the novel The Witches? <meta> The Witches <translates_to> Hexen hexen"
Output: "In welchem Jahr veröffentlichte Roald Dahl den Roman Hexen hexen?"

📚 Prerequisite: Wikidata Database

Before running the code, you'll need to build the Wikidata database:

  1. Download the latest Wikidata dump (~130GB)
  2. Process it using the provided script in /data/wikidata/

Important

See /data/wikidata/README.md for detailed instructions. The final database is ~10GB and not included in this repository.

🔧Running the Code

SALT uses uv as the package manager and Hydra for configuration management:

# Setup environment (requires uv)
uv sync

# Run training
uv run src/translation_train.py

# Override specific parameters
uv run src/translation_train.py model.use_pointer_generator=true model.sequence_bias_weight=3.0

⚙️ Key Configuration Options

Parameter Description Default
model.use_pointer_generator Enable pointer generator mechanism false
model.use_sequence_bias Apply logit biasing for entity constraints true
model.sequence_bias_weight Weight for logit biasing 5.0
dataset.retrieval_top_k Number of entity candidates to retrieve 1
training.max_epochs Maximum training epochs 10
training.batch_size Batch size for training 16

Note

See /src/conf/translation_config.yaml for more all configuration options.

📊 Results

SALT achieves an M-ETA score of 71.7% and a COMET score of 92.5 (Harmonic Mean: 80.8), outperforming all other approaches not using gold-standard entity IDs during inference.

Our ablation studies show several key insights:

  • Simple SQL-based retrieval performs competitively compared to complex neural methods
  • Strategic model refinement outperforms increased model complexity
  • Providing only the top-1 candidate entity yields better results than multiple candidates

If you use this code or our approach in your research, please cite our paper:

@inproceedings{volker-etal-2025-salt,
    title = "{SALT} at {S}em{E}val-2025 Task 2: A {SQL}-based Approach for {LLM}-Free Entity-Aware-Translation",
    author = {V{\"o}lker, Tom  and
      Pfister, Jan  and
      Hotho, Andreas},
    editor = "Rosenthal, Sara  and
      Ros{\'a}, Aiala  and
      Ghosh, Debanjan  and
      Zampieri, Marcos",
    booktitle = "Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.semeval-1.117/",
    pages = "852--864",
    ISBN = "979-8-89176-273-2",
    abstract = "Entity-aware machine translation faces significant challenges when translating culturally-adapted named entities that require knowledge beyond the source text. We present SALT (SQL-based Approach for LLM-Free Entity-Aware-Translation), a parameter-efficient system for the SemEval-2025 Task 2. Our approach combines SQL-based entity retrieval with constrained neural translation via logit biasing and explicit entity annotations. Despite its simplicity, it achieves state-of-the-art performance (First Place) among approaches not using gold-standard data, while requiring far less computation than LLM-based methods. Our ablation studies show simple SQL-based retrieval rivals complex neural models, and strategic model refinement outperforms increased model complexity. SALT offers an alternative to resource-intensive LLM-based approaches, achieving comparable results with only a fraction of the parameters."
}

About

Code for the SALT🧂 Paper presented for the SemEval-2025 Task 2 on Entity-Aware Machine Translation @ ACL 2025.

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •