-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change word-level audio ASR metrics to the token level #3342
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the right way to do tokenization because:
- You're creating temporary files that don't get cleaned up afterwards
- The files aren't used across runs, so there's no need to keep them in storage
- You're calling a private method
tokenizer._tokenize_do_it
, which goes against Python conventions
You have a couple of options here:
- If you only need tiktoken, just
import tiktoken
directly (preferred) - If you want to use HELM tokenizers, create one using
TiktokenTokenizer(BlackHoleCacheConfig())
(orAutoTokenizer({}, BlackHoleCacheConfig())
if you want other things besides tiktoken), and then use the publictokenize()
method
Also, if you're using a tokenizer for metrics, the properties of the tokenizer may affect your measurements, especially for non-English languages, so you may want to select your tokenizer carefully.
Thanks @yifanmai, I just fixed the tokenizer. For token-level audio metrics, we just want to propose a new paradigm that avoids using different word tokenizers for different languages in audio metrics. |
The problem is that tiktoken (and byte-pair encoding in general) falls back to individual bytes, and Unicode characters in non-Latin alphabets may be represented by multiple bytes. Since characters from the same language are likely to start with the same bytes in Unicode space, you get lots of spurious overlap. Consider the following example: from tiktoken import get_encoding
tokenizer = get_encoding("cl100k_base")
tokenizer.encode("狐") # [163, 233, 238]
tokenizer.encode("狸") # [163, 233, 116] By your old definition of By the new definition, the score is I'd suggest looking at nltk or spacy and seeing if they cover the languages that you need. |
Thanks! Let me check again. |
Using tiktoken to tokenize texts in different languages for ASR-related metrics.
Not sure if there's a better way to create the tokenizer for this, could you take a look? Thanks!
Here are two
per_instance_stats.json
files for Chinese and Hebrew in FLEURS.per_instance_stats_hebrew.json
per_instance_stats_chinese.json