Fix ACE-Step audio sample saving (bf16 dtype + waveform shape)#910
Open
SanDiegoDude wants to merge 1 commit into
Open
Fix ACE-Step audio sample saving (bf16 dtype + waveform shape)#910SanDiegoDude wants to merge 1 commit into
SanDiegoDude wants to merge 1 commit into
Conversation
When saving generated audio samples, torchaudio/ffmpeg cannot encode bfloat16/float16 waveforms (raises "No format found for dtype ..."); cast the waveform to float32 first. Also, the ACE-Step pipeline already returns a [channels, time] tensor, so indexing image[0] dropped the channel dimension and produced a 1D tensor, causing ffmpeg to fail with "channel_layout=0x0". Normalize the waveform to a 2D [channels, time] tensor before saving. With the default bf16 training dtype these two issues prevented any ACE-Step sample from being written. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Saving generated audio samples for ACE-Step (1.5 and 1.5 XL) currently fails before any sample is written, due to two issues in
GenerateImageConfig.save_imageintoolkit/config_modules.py:train.dtype: bf16, the generated waveform isbfloat16, which torchaudio/ffmpeg cannot encode:[channels, time]tensor (it squeezes the batch dim internally), butsave_imageindexesimage[0]again, collapsing it to 1D. ffmpeg then fails with:The fix casts the waveform to
float32and normalizes it to a 2D[channels, time]tensor (handling 1D/2D/3D inputs) before callingtorchaudio.save.Test plan
low_vram: true,train.dtype: bf16, and sampling enabled; baseline samples now write as valid 180s MP3s and training proceeds.output_extpath).Made with Cursor