Note
SALT is our submission to the SemEval-2025 Task 2 on Entity-Aware Machine Translation, achieving top performance among systems not using gold entity information.
SALT employs a two-stage approach:
- Entity Retrieval: Normalized string matching with SQL for efficient candidate generation, scoring entities based on length and relevance
- Knowledge Integration: Augmented input with entity-translation pairs and logit biasing during beam search
This leads to the following in- and output:
Input: "What year did Roald Dahl release the novel The Witches?"
Augmented: "What year did Roald Dahl release the novel The Witches? <meta> The Witches <translates_to> Hexen hexen"
Output: "In welchem Jahr veröffentlichte Roald Dahl den Roman Hexen hexen?"
Before running the code, you'll need to build the Wikidata database:
- Download the latest Wikidata dump (~130GB)
- Process it using the provided script in
/data/wikidata/
Important
See /data/wikidata/README.md for detailed instructions. The final database is ~10GB and not included in this repository.
SALT uses uv as the package manager and Hydra for configuration management:
# Setup environment (requires uv)
uv sync
# Run training
uv run src/translation_train.py
# Override specific parameters
uv run src/translation_train.py model.use_pointer_generator=true model.sequence_bias_weight=3.0| Parameter | Description | Default |
|---|---|---|
model.use_pointer_generator |
Enable pointer generator mechanism | false |
model.use_sequence_bias |
Apply logit biasing for entity constraints | true |
model.sequence_bias_weight |
Weight for logit biasing | 5.0 |
dataset.retrieval_top_k |
Number of entity candidates to retrieve | 1 |
training.max_epochs |
Maximum training epochs | 10 |
training.batch_size |
Batch size for training | 16 |
Note
See /src/conf/translation_config.yaml for more all configuration options.
SALT achieves an M-ETA score of 71.7% and a COMET score of 92.5 (Harmonic Mean: 80.8), outperforming all other approaches not using gold-standard entity IDs during inference.
Our ablation studies show several key insights:
- Simple SQL-based retrieval performs competitively compared to complex neural methods
- Strategic model refinement outperforms increased model complexity
- Providing only the top-1 candidate entity yields better results than multiple candidates
If you use this code or our approach in your research, please cite our paper:
@inproceedings{volker-etal-2025-salt,
title = "{SALT} at {S}em{E}val-2025 Task 2: A {SQL}-based Approach for {LLM}-Free Entity-Aware-Translation",
author = {V{\"o}lker, Tom and
Pfister, Jan and
Hotho, Andreas},
editor = "Rosenthal, Sara and
Ros{\'a}, Aiala and
Ghosh, Debanjan and
Zampieri, Marcos",
booktitle = "Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.semeval-1.117/",
pages = "852--864",
ISBN = "979-8-89176-273-2",
abstract = "Entity-aware machine translation faces significant challenges when translating culturally-adapted named entities that require knowledge beyond the source text. We present SALT (SQL-based Approach for LLM-Free Entity-Aware-Translation), a parameter-efficient system for the SemEval-2025 Task 2. Our approach combines SQL-based entity retrieval with constrained neural translation via logit biasing and explicit entity annotations. Despite its simplicity, it achieves state-of-the-art performance (First Place) among approaches not using gold-standard data, while requiring far less computation than LLM-based methods. Our ablation studies show simple SQL-based retrieval rivals complex neural models, and strategic model refinement outperforms increased model complexity. SALT offers an alternative to resource-intensive LLM-based approaches, achieving comparable results with only a fraction of the parameters."
}