Change word-level audio ASR metrics to the token level #3342

ImKeTT · 2025-02-14T00:12:18Z

Using tiktoken to tokenize texts in different languages for ASR-related metrics.

Not sure if there's a better way to create the tokenizer for this, could you take a look? Thanks!

Here are two per_instance_stats.json files for Chinese and Hebrew in FLEURS.

per_instance_stats_hebrew.json
per_instance_stats_chinese.json

yifanmai

This isn't the right way to do tokenization because:

You're creating temporary files that don't get cleaned up afterwards
The files aren't used across runs, so there's no need to keep them in storage
You're calling a private method tokenizer._tokenize_do_it, which goes against Python conventions

You have a couple of options here:

If you only need tiktoken, just import tiktoken directly (preferred)
If you want to use HELM tokenizers, create one using TiktokenTokenizer(BlackHoleCacheConfig()) (or AutoTokenizer({}, BlackHoleCacheConfig()) if you want other things besides tiktoken), and then use the public tokenize() method

Also, if you're using a tokenizer for metrics, the properties of the tokenizer may affect your measurements, especially for non-English languages, so you may want to select your tokenizer carefully.

ImKeTT · 2025-02-18T00:27:31Z

Thanks @yifanmai, I just fixed the tokenizer. For token-level audio metrics, we just want to propose a new paradigm that avoids using different word tokenizers for different languages in audio metrics.

yifanmai · 2025-02-18T21:50:58Z

For token-level audio metrics, we just want to propose a new paradigm that avoids using different word tokenizers for different languages in audio metrics.

The problem is that tiktoken (and byte-pair encoding in general) falls back to individual bytes, and Unicode characters in non-Latin alphabets may be represented by multiple bytes. Since characters from the same language are likely to start with the same bytes in Unicode space, you get lots of spurious overlap.

Consider the following example:

from tiktoken import get_encoding

tokenizer = get_encoding("cl100k_base")
tokenizer.encode("狐")  # [163, 233, 238]
tokenizer.encode("狸")  # [163, 233, 116]

By your old definition of get_mer_score / get_wer_score / get_chinese_mer_score / get_chinese_wer_score, the score was 1.0.

By the new definition, the score is 0.33 because the bytes [163, 233] overlap, which doesn't seem right.

I'd suggest looking at nltk or spacy and seeing if they cover the languages that you need.

ImKeTT · 2025-02-18T21:59:21Z

For token-level audio metrics, we just want to propose a new paradigm that avoids using different word tokenizers for different languages in audio metrics.

The problem is that tiktoken (and byte-pair encoding in general) falls back to individual bytes, and Unicode characters in non-Latin alphabets may be represented by multiple bytes. Since characters from the same language are likely to start with the same bytes in Unicode space, you get lots of spurious overlap.

Consider the following example:
from tiktoken import get_encoding

tokenizer = get_encoding("cl100k_base")
tokenizer.encode("狐")  # [163, 233, 238]
tokenizer.encode("狸")  # [163, 233, 116]
By your old definition of get_mer_score / get_wer_score / get_chinese_mer_score / get_chinese_wer_score, the score was 1.0.

By the new definition, the score is 0.33 because the bytes [163, 233] overlap, which doesn't seem right.

I'd suggest looking at nltk or spacy and seeing if they cover the languages that you need.

Thanks! Let me check again.

change word-level asr metrics to the token level

6ade3bd

ImKeTT requested review from teetone and yifanmai February 14, 2025 00:12

update dependency

2bca247

yifanmai reviewed Feb 14, 2025

View reviewed changes

ImKeTT and others added 2 commits February 17, 2025 15:58

Merge branch 'stanford-crfm:main' into add_twr

f7658a2

update

ba14dd4

ImKeTT requested a review from yifanmai February 18, 2025 00:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change word-level audio ASR metrics to the token level #3342

Change word-level audio ASR metrics to the token level #3342

ImKeTT commented Feb 14, 2025

yifanmai left a comment

ImKeTT commented Feb 18, 2025

yifanmai commented Feb 18, 2025

ImKeTT commented Feb 18, 2025

Change word-level audio ASR metrics to the token level #3342

Are you sure you want to change the base?

Change word-level audio ASR metrics to the token level #3342

Conversation

ImKeTT commented Feb 14, 2025

yifanmai left a comment

Choose a reason for hiding this comment

ImKeTT commented Feb 18, 2025

yifanmai commented Feb 18, 2025

ImKeTT commented Feb 18, 2025