Skip to content

Feature/moderation hallucination eval multilingual translation #1265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: develop
Choose a base branch
from

Conversation

SnowMasaya
Copy link

Description

feat: Add multilingual translation support for evaluation pipeline(moderation and hallucination)

📋 Summary

This PR introduces comprehensive multilingual translation capabilities to the NeMo-Guardrails evaluation system, enabling users to evaluate AI models across different languages and cultures. The implementation includes a flexible translation provider system, caching mechanisms, and seamless integration with existing evaluation workflows.

🚀 Key Features

🌍 Multilingual Translation System

  • Flexible Translation Providers: Support for both local (HuggingFace) and remote (DeepL, NVIDIA Riva) translation services
  • Translation Caching: Intelligent caching system to avoid redundant translations and improve performance
  • Configurable Backends: Easy configuration for different translation services via YAML configs
  • Progress Tracking: Real-time progress bars for translation operations

Dependencies Added

  • deepl (^1.22.0) - DeepL translation service integration
  • nvidia-riva-client (^2.21.0) - NVIDIA Riva translation service
  • torch (^2.7.1) - PyTorch for local translation models
  • transformers (^4.53.0) - HuggingFace transformers for local models
  • sentencepiece (^0.2.0) - Tokenization support

Architecture

Translation Provider System

nemoguardrails/evaluate/langproviders/
├── base.py # Base provider interface
├── local.py # HuggingFace-based local translator
├── remote.py # Remote service providers (DeepL, Riva)
├── configs/ # Translation service configurations
└── README.md # Provider documentation

Core Components

  • utils_translate.py: Core translation utilities and caching
  • Enhanced utils.py: Integration with dataset loading
  • Updated evaluation modules: Multilingual support in hallucination and moderation evaluation
  • CLI enhancements: Translation configuration support

🔄 Usage Examples

Basic Translation Configuration

# translation.yaml
provider: "deepl"
api_key: "${DEEPL_API_KEY}"
target_language: "ja"

CLI Usage

# Evaluate with translation
nemoguardrails evaluate hallucination \
  --dataset data/hallucination/sample.txt \
  --translation-config configs/translation.yaml

# Evaluate moderation with Japanese translation
nemoguardrails evaluate moderation \
  --dataset data/moderation/harmful.txt \
  --translation-config configs/japanese_translation.yaml

🧪 Testing

The implementation includes comprehensive test coverage:

  • Provider Tests: Unit tests for all translation providers
  • Integration Tests: End-to-end translation workflow testing
  • Cache Tests: Translation caching mechanism validation
  • CLI Tests: Command-line interface testing with translation support

Run tests with:

pytest tests/eval/translate/ -v

Configuration

Translation Service Configuration

# DeepL Configuration
provider: "deepl"
api_key: "${DEEPL_API_KEY}"
target_language: "ja"

# HuggingFace Local Configuration
provider: "huggingface"
model_name: "Helsinki-NLP/opus-mt-en-ja"
target_language: "ja"
device: "cpu"

# NVIDIA Riva Configuration
provider: "riva"
url: "https://riva-server:8000"
target_language: "ja"

Breaking Changes

None. This is a purely additive feature that maintains full backward compatibility.

📝 Documentation

  • Added comprehensive README for translation providers
  • Updated evaluation documentation with multilingual examples
  • Added configuration examples for all supported translation services

🎯 Impact

This enhancement significantly expands NeMo-Guardrails' evaluation capabilities, making it a truly global tool for AI safety and compliance evaluation across different languages and cultures.

Checklist

  • I've read the CONTRIBUTING guidelines.
  • I've updated the documentation if applicable.
  • I've added tests if applicable.
  • @mentions of the person or team responsible for reviewing proposed changes.

@Pouyanpi Pouyanpi requested review from Pouyanpi, Copilot and trebedea July 7, 2025 10:25
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds multilingual translation support to the NeMo-Guardrails evaluation pipeline, introducing translation providers, caching, and integration into moderation and hallucination workflows.

  • Core translation utilities and caching mechanism added (utils_translate.py)
  • Integration of translation into dataset loading and evaluation modules (utils.py, evaluate_moderation.py, evaluate_hallucination.py)
  • New translation provider implementations (DeepL, Riva, local HF) and extensive test coverage

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/eval/translate/ Added unit and integration tests for translation
nemoguardrails/evaluate/utils_translate.py Core translation loading, caching, and dataset I/O
nemoguardrails/evaluate/utils.py Extended dataset loading with translation support
nemoguardrails/evaluate/langproviders/ Implemented DeepL, Riva, and local HF translators
nemoguardrails/evaluate/evaluate_moderation.py Added translation initialization and loading
nemoguardrails/evaluate/evaluate_hallucination.py Added translation initialization and loading
nemoguardrails/evaluate/cli/evaluate.py Exposed translation flags in CLI
pyproject.toml Added translation-related dependencies
Comments suppressed due to low confidence (1)

pyproject.toml:103

  • [nitpick] The pyproject-toml dependency and translation libraries are now always installed; consider moving them into optional extras to avoid pulling heavy packages for users not using translation.
pyproject-toml = "^0.1.0"

# Generate cache file name based on service name
safe_service_name = service_name.replace("/", "_").replace("\\", "_").replace(":", "_")
self.cache_file = self.cache_dir / f"translations_{safe_service_name}.json"
print("cache_file: ", self.cache_file)
Copy link
Preview

Copilot AI Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the debugging print with a logging call or remove it to avoid unwanted console output in production.

Suggested change
print("cache_file: ", self.cache_file)
logging.debug(f"cache_file: {self.cache_file}")

Copilot uses AI. Check for mistakes.

Comment on lines 90 to 92
def get_translation_cache(service_name: str = "default") -> TranslationCache:
"""Get or create translation cache instance for the specified service."""
_translation_caches = {}
Copy link
Preview

Copilot AI Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _translation_caches dictionary is created inside the function, so caching never persists across calls. Move _translation_caches to module scope to reuse cache instances.

Suggested change
def get_translation_cache(service_name: str = "default") -> TranslationCache:
"""Get or create translation cache instance for the specified service."""
_translation_caches = {}
# Global dictionary to store translation cache instances
_translation_caches = {}
def get_translation_cache(service_name: str = "default") -> TranslationCache:
"""Get or create translation cache instance for the specified service."""

Copilot uses AI. Check for mistakes.

langprovider_config = {
"langproviders": {language_service["model_type"]: language_service}
}
logging.debug(f"langauge provision service: {language_service['language']}")
Copy link
Preview

Copilot AI Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in the debug message: 'langauge' should be 'language'.

Suggested change
logging.debug(f"langauge provision service: {language_service['language']}")
logging.debug(f"language provision service: {language_service['language']}")

Copilot uses AI. Check for mistakes.

"""Generate cache key from text and target language."""
# Create a hash of the text and target language
content = f"{text}:{target_lang}"
return content
Copy link
Preview

Copilot AI Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Docstring suggests a hash is created but implementation concatenates text and language directly. Consider actually hashing long texts or updating documentation to match behavior.

Suggested change
return content
return hashlib.sha256(content.encode('utf-8')).hexdigest()

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on this one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I change it.

Copy link

github-actions bot commented Jul 7, 2025

Documentation preview

https://nvidia.github.io/NeMo-Guardrails/review/pr-1265

Copy link
Collaborator

@trebedea trebedea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided several comments - most are just nice to have.
@SnowMasaya try to fix the ones you feel are most important - e.g. some duplicated code and documentation related.

@Pouyanpi can you check if you have any feedback related to tests and using poetry?

A local translation provider using Hugging Face models.

**Supported Models:**
- **M2M100**: Multilingual translation model (supports 100 languages)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **M2M100**: Multilingual translation model (supports 100 languages)
- **M2M100**: Multilingual Many-to-Many translation models (supports 100 languages)

### Remote Providers

#### DeeplTranslator
High-quality translation service using the DeepL API.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
High-quality translation service using the DeepL API.
High-quality translation service using the DeepL API. Requires DeepL API key for using it.


**Features:**
- High-quality translations
- Supports 29 languages
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Supports 29 languages
- Supports 29 languages (check official website for exact number)

- Commercial use available

#### RivaTranslator
Translation service using NVIDIA Riva.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Translation service using NVIDIA Riva.
Translation service using NVIDIA Riva. Requires an API key for using it.


## Configuration Parameters

### Common Parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we highlight the required parameters?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

"""Generate cache key from text and target language."""
# Create a hash of the text and target language
content = f"{text}:{target_lang}"
return content
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on this one.

if isinstance(item, dict):
# For JSON format, translate specific fields
translated_item = item.copy()
for field in ["answer", "question", "evidence"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mention in the documentation that when translation JSONs only these fields are processed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Comment on lines 168 to 175
cached_translation = cache.get(original_text, translator.target_lang)
if cached_translation:
translated_dataset.append(cached_translation)
else:
# Translate and cache
translated_text = translator._translate(original_text)
translated_dataset.append(translated_text)
cache.set(original_text, translator.target_lang, translated_text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are c&p-ed from above - shouldn't we wrap this in a helper method in the translator cache?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

self.dataset = load_dataset(
self.dataset_path, translation_config=self.translation_config
)[: self.num_samples]
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should print a warning if translation is enable , but the translator in None.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

try:
from nemoguardrails.evaluate.utils_translate import _load_langprovider

self.translator = _load_langprovider(self.translation_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done again in load_dataset . Can't we do it only once there?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I try it.

…r translation code

- Add YAML configurable endpoints to RivaTranslator (remote.py):
  * Support uri parameters from YAML config
  * Local mode: only uri can be overridden, others use defaults

- Refactor translation utilities (utils_translate.py):
  * Extract _check_cache_and_translate() helper function
  * Eliminate duplicate cache checking and translation logic
  * Simplify load_dataset() function while preserving functionality
  * Reduce code duplication across different file formats

- Update translation provider tests (base.py, local.py):
  * Fix test configurations to use list format for langproviders
  * Remove assertions on non-existent attributes
  * Update error handling for new validation logic
  * Ensure compatibility with configurable endpoint feature
- Fix test configurations to use list format for langproviders
- Remove obsolete assertions on non-existent attributes
- Add configurable endpoint tests to test_remote_translators.py
- Update cache tests to work with new translation logic
- Consolidate RivaTranslator tests in single file
- Add YAML examples for RivaTranslator endpoint configuration
- Document local  mode parameter behavior
- Update existing examples for consistency

Helps users configure RivaTranslator endpoints via YAML.
- README: remove hf_args
- pyproject.toml: update dependency for translation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants