Skip to content

Conversation

@Pactortester
Copy link

Problem Description

When using WhisperLiveKit with NLLW for voice translation, the --lan parameter is passed to both Whisper (speech recognition) and NLLW (translation). #2

Issue: Whisper only supports zh as the Chinese language code, while NLLW previously only supported zh-CN, causing incompatibility.

Error Reproduction

# Using zh-CN: Whisper throws error
python -m whisperlivekit.basic_server --lan zh-CN --target-language eng_Latn
# Error: Unsupported language: zh-cn

# Using zh: NLLW throws error
python -m whisperlivekit.basic_server --lan zh --target-language eng_Latn
# Error: Unknown language identifier: zh

Root Cause

System Supported Chinese Code
Whisper zh
NLLW zh-CN

The two systems have incompatible language codes.

Solution

Extend the language_code field to support list type, allowing a single language entry to map to multiple language codes:

# Before: single language_code
{"name": "Chinese (Simplified)", "nllb": "zho_Hans", "language_code": "zh-CN"}

# After: support list, compatible with both zh and zh-CN
{"name": "Chinese (Simplified)", "nllb": "zho_Hans", "language_code": ["zh-CN", "zh"]}

Modified Files

  • nllw/languages.py

Changes

1. Data Structure Change

-   {"name": "Chinese (Simplified)", "nllb": "zho_Hans", "language_code": "zh-CN"},
+   {"name": "Chinese (Simplified)", "nllb": "zho_Hans", "language_code": ["zh-CN", "zh"]},

2. New Helper Functions

def _get_language_codes(lang):
    """Convert language_code to list, compatible with both string and list formats"""
    codes = lang["language_code"]
    return codes if isinstance(codes, list) else [codes]


def _match_language_code(lang, identifier):
    """Check if identifier matches any language_code (case-insensitive)"""
    codes = _get_language_codes(lang)
    identifier_lower = identifier.lower()
    return any(code == identifier or code.lower() == identifier_lower for code in codes)

3. Modified Dictionary Building Logic

# Support multiple language_codes mapping to the same nllb code
LANGUAGE_CODE_TO_NLLB = {}
LANGUAGE_CODE_TO_NAME = {}
for lang in LANGUAGES:
    for code in _get_language_codes(lang):
        LANGUAGE_CODE_TO_NLLB[code] = lang["nllb"]
        LANGUAGE_CODE_TO_NAME[code] = lang["name"]

4. Modified Lookup Functions

get_language_info() and list_all_language_code_codes() functions have been updated to support list type language_code.

Test Verification

python -c "
from nllw.languages import convert_to_nllb_code, LANGUAGE_CODE_TO_NLLB

print('zh ->', convert_to_nllb_code('zh'))       # zho_Hans
print('zh-CN ->', convert_to_nllb_code('zh-CN')) # zho_Hans
print('ZH ->', convert_to_nllb_code('ZH'))       # zho_Hans (case-insensitive)
"

After the fix, the following command works correctly:

python -m whisperlivekit.basic_server --lan zh --target-language eng_Latn

Impact

  • Backward compatible: zh-CN still works, no changes needed for existing code
  • New support: zh is now recognized
  • Extensible: Other languages can support multiple codes by changing language_code to a list
  • No impact: Entries with single language_code value need no modification

Design Advantages

Compared to simply adding duplicate entries:

# Option A (not recommended): add duplicate entries
{"name": "Chinese (Simplified)", "nllb": "zho_Hans", "language_code": "zh-CN"},
{"name": "Chinese", "nllb": "zho_Hans", "language_code": "zh"},  # duplicate

This solution is more elegant:

# Option B (this PR): support list
{"name": "Chinese (Simplified)", "nllb": "zho_Hans", "language_code": ["zh-CN", "zh"]},
  • No duplicate data
  • Clearly expresses "multiple codes for the same language" semantics
  • Easy to extend for other languages

Related Projects

…patibility

When using WhisperLiveKit with NLLW, the --lan parameter is passed to both
Whisper and NLLW. Whisper only accepts "zh" for Chinese while NLLW previously
only accepted "zh-CN", causing incompatibility.

Changes:
- Allow language_code field to be a list (e.g., ["zh-CN", "zh"])
- Add helper functions _get_language_codes() and _match_language_code()
- Update dictionary building to map all codes to the same nllb code
- Update lookup functions to support list type language_code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant