SALT🧂: A SQL-based Approach for LLM-Free Entity-Aware-Translation

Note

SALT is our submission to the SemEval-2025 Task 2 on Entity-Aware Machine Translation, achieving top performance among systems not using gold entity information.

🧠 System Design

SALT employs a two-stage approach:

Entity Retrieval: Normalized string matching with SQL for efficient candidate generation, scoring entities based on length and relevance
Knowledge Integration: Augmented input with entity-translation pairs and logit biasing during beam search

This leads to the following in- and output:

Input: "What year did Roald Dahl release the novel The Witches?"
Augmented: "What year did Roald Dahl release the novel The Witches? <meta> The Witches <translates_to> Hexen hexen"
Output: "In welchem Jahr veröffentlichte Roald Dahl den Roman Hexen hexen?"

📚 Prerequisite: Wikidata Database

Before running the code, you'll need to build the Wikidata database:

Download the latest Wikidata dump (~130GB)
Process it using the provided script in /data/wikidata/

Important

See /data/wikidata/README.md for detailed instructions. The final database is ~10GB and not included in this repository.

🔧Running the Code

SALT uses uv as the package manager and Hydra for configuration management:

# Setup environment (requires uv)
uv sync

# Run training
uv run src/translation_train.py

# Override specific parameters
uv run src/translation_train.py model.use_pointer_generator=true model.sequence_bias_weight=3.0

⚙️ Key Configuration Options

Parameter	Description	Default
`model.use_pointer_generator`	Enable pointer generator mechanism	`false`
`model.use_sequence_bias`	Apply logit biasing for entity constraints	`true`
`model.sequence_bias_weight`	Weight for logit biasing	`5.0`
`dataset.retrieval_top_k`	Number of entity candidates to retrieve	`1`
`training.max_epochs`	Maximum training epochs	`10`
`training.batch_size`	Batch size for training	`16`

Note

See /src/conf/translation_config.yaml for more all configuration options.

📊 Results

SALT achieves an M-ETA score of 71.7% and a COMET score of 92.5 (Harmonic Mean: 80.8), outperforming all other approaches not using gold-standard entity IDs during inference.

Our ablation studies show several key insights:

Simple SQL-based retrieval performs competitively compared to complex neural methods
Strategic model refinement outperforms increased model complexity
Providing only the top-1 candidate entity yields better results than multiple candidates

If you use this code or our approach in your research, please cite our paper:

@inproceedings{volker-etal-2025-salt,
    title = "{SALT} at {S}em{E}val-2025 Task 2: A {SQL}-based Approach for {LLM}-Free Entity-Aware-Translation",
    author = {V{\"o}lker, Tom  and
      Pfister, Jan  and
      Hotho, Andreas},
    editor = "Rosenthal, Sara  and
      Ros{\'a}, Aiala  and
      Ghosh, Debanjan  and
      Zampieri, Marcos",
    booktitle = "Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.semeval-1.117/",
    pages = "852--864",
    ISBN = "979-8-89176-273-2",
    abstract = "Entity-aware machine translation faces significant challenges when translating culturally-adapted named entities that require knowledge beyond the source text. We present SALT (SQL-based Approach for LLM-Free Entity-Aware-Translation), a parameter-efficient system for the SemEval-2025 Task 2. Our approach combines SQL-based entity retrieval with constrained neural translation via logit biasing and explicit entity annotations. Despite its simplicity, it achieves state-of-the-art performance (First Place) among approaches not using gold-standard data, while requiring far less computation than LLM-based methods. Our ablation studies show simple SQL-based retrieval rivals complex neural models, and strategic model refinement outperforms increased model complexity. SALT offers an alternative to resource-intensive LLM-based approaches, achieving comparable results with only a fraction of the parameters."
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
assets		assets
data		data
dspy-approach		dspy-approach
eval_notebooks		eval_notebooks
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
semeval_slurm.sh		semeval_slurm.sh
slurm_ablations.sh		slurm_ablations.sh
slurm_runs.sh		slurm_runs.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SALT🧂: A SQL-based Approach for LLM-Free Entity-Aware-Translation

🧠 System Design

📚 Prerequisite: Wikidata Database

🔧Running the Code

⚙️ Key Configuration Options

📊 Results

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

LSX-UniWue/Semeval-2025-Task-2

Folders and files

Latest commit

History

Repository files navigation

SALT🧂: A SQL-based Approach for LLM-Free Entity-Aware-Translation

🧠 System Design

📚 Prerequisite: Wikidata Database

🔧Running the Code

⚙️ Key Configuration Options

📊 Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages