Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change word-level audio ASR metrics to the token level #3342

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ImKeTT
Copy link
Collaborator

@ImKeTT ImKeTT commented Feb 14, 2025

Using tiktoken to tokenize texts in different languages for ASR-related metrics.

Not sure if there's a better way to create the tokenizer for this, could you take a look? Thanks!

Here are two per_instance_stats.json files for Chinese and Hebrew in FLEURS.

per_instance_stats_hebrew.json
per_instance_stats_chinese.json

@ImKeTT ImKeTT requested review from teetone and yifanmai February 14, 2025 00:12
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the right way to do tokenization because:

  1. You're creating temporary files that don't get cleaned up afterwards
  2. The files aren't used across runs, so there's no need to keep them in storage
  3. You're calling a private method tokenizer._tokenize_do_it, which goes against Python conventions

You have a couple of options here:

  1. If you only need tiktoken, just import tiktoken directly (preferred)
  2. If you want to use HELM tokenizers, create one using TiktokenTokenizer(BlackHoleCacheConfig()) (or AutoTokenizer({}, BlackHoleCacheConfig()) if you want other things besides tiktoken), and then use the public tokenize() method

Also, if you're using a tokenizer for metrics, the properties of the tokenizer may affect your measurements, especially for non-English languages, so you may want to select your tokenizer carefully.

@ImKeTT
Copy link
Collaborator Author

ImKeTT commented Feb 18, 2025

Thanks @yifanmai, I just fixed the tokenizer. For token-level audio metrics, we just want to propose a new paradigm that avoids using different word tokenizers for different languages in audio metrics.

@ImKeTT ImKeTT requested a review from yifanmai February 18, 2025 00:27
@yifanmai
Copy link
Collaborator

For token-level audio metrics, we just want to propose a new paradigm that avoids using different word tokenizers for different languages in audio metrics.

The problem is that tiktoken (and byte-pair encoding in general) falls back to individual bytes, and Unicode characters in non-Latin alphabets may be represented by multiple bytes. Since characters from the same language are likely to start with the same bytes in Unicode space, you get lots of spurious overlap.

Consider the following example:

from tiktoken import get_encoding

tokenizer = get_encoding("cl100k_base")
tokenizer.encode("狐")  # [163, 233, 238]
tokenizer.encode("狸")  # [163, 233, 116]

By your old definition of get_mer_score / get_wer_score / get_chinese_mer_score / get_chinese_wer_score, the score was 1.0.

By the new definition, the score is 0.33 because the bytes [163, 233] overlap, which doesn't seem right.

I'd suggest looking at nltk or spacy and seeing if they cover the languages that you need.

@ImKeTT
Copy link
Collaborator Author

ImKeTT commented Feb 18, 2025

For token-level audio metrics, we just want to propose a new paradigm that avoids using different word tokenizers for different languages in audio metrics.

The problem is that tiktoken (and byte-pair encoding in general) falls back to individual bytes, and Unicode characters in non-Latin alphabets may be represented by multiple bytes. Since characters from the same language are likely to start with the same bytes in Unicode space, you get lots of spurious overlap.

Consider the following example:

from tiktoken import get_encoding

tokenizer = get_encoding("cl100k_base")
tokenizer.encode("狐")  # [163, 233, 238]
tokenizer.encode("狸")  # [163, 233, 116]

By your old definition of get_mer_score / get_wer_score / get_chinese_mer_score / get_chinese_wer_score, the score was 1.0.

By the new definition, the score is 0.33 because the bytes [163, 233] overlap, which doesn't seem right.

I'd suggest looking at nltk or spacy and seeing if they cover the languages that you need.

Thanks! Let me check again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants