Skip to content

[audio utils] fix fft_bin_width computation #1274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 9, 2025

Conversation

xenova
Copy link
Collaborator

@xenova xenova commented Apr 6, 2025

Copied from huggingface/transformers#36603:

When computing triangular mel filter matrices and when we triangularize in mel space, if we have e.g. num_frequency_bins being 257, meaning real-valued fft as been computed on 512 points (257 = n_fft // 2 + 1), then the fft_bin_width should be (sampling rate / 2) / number_of_bins with number_of_bins = num_frequency_bins - 1.

This was very likely introduced and not seen by tests for the reason that mel_filter_bank was called with, following the above example, num_frequency_bins = 256 when using triangularize_in_mel_space=True then padded with 0s to retrieve the correct expected shape (257), which is a bad practice and misleading for the user as we do not respect the method API!

I also updated the expected outputs with the one I got from running torchaudio kaldi implementation, confirming our implem was incorrect.

Thanks @eustlb for the original fix!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@xenova xenova merged commit 10c09fb into main Apr 9, 2025
4 checks passed
@xenova xenova deleted the fix-triangularise-mel-space branch April 9, 2025 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants