Skip to content

Fix ACE-Step audio sample saving (bf16 dtype + waveform shape)#910

Open
SanDiegoDude wants to merge 1 commit into
ostris:mainfrom
SanDiegoDude:fix/acestep-audio-sample-save
Open

Fix ACE-Step audio sample saving (bf16 dtype + waveform shape)#910
SanDiegoDude wants to merge 1 commit into
ostris:mainfrom
SanDiegoDude:fix/acestep-audio-sample-save

Conversation

@SanDiegoDude

Copy link
Copy Markdown

Summary

Saving generated audio samples for ACE-Step (1.5 and 1.5 XL) currently fails before any sample is written, due to two issues in GenerateImageConfig.save_image in toolkit/config_modules.py:

  1. bf16 dtype — with the default train.dtype: bf16, the generated waveform is bfloat16, which torchaudio/ffmpeg cannot encode:
    ValueError: No format found for dtype torch.bfloat16; dtype must be one of
    [torch.uint8, torch.int16, torch.int32, torch.int64, torch.float32, torch.float64].
    
  2. Waveform shape — the ACE-Step pipeline already returns a [channels, time] tensor (it squeezes the batch dim internally), but save_image indexes image[0] again, collapsing it to 1D. ffmpeg then fails with:
    RuntimeError: Failed to create input filter:
    "time_base=1/48000:sample_rate=48000:sample_fmt=flt:channel_layout=0x0" (Invalid argument)
    

The fix casts the waveform to float32 and normalizes it to a 2D [channels, time] tensor (handling 1D/2D/3D inputs) before calling torchaudio.save.

Test plan

  • Train ACE-Step 1.5 XL with low_vram: true, train.dtype: bf16, and sampling enabled; baseline samples now write as valid 180s MP3s and training proceeds.
  • Sanity-check a non-audio (image) model still saves correctly (this branch only touches the audio output_ext path).

Made with Cursor

When saving generated audio samples, torchaudio/ffmpeg cannot encode
bfloat16/float16 waveforms (raises "No format found for dtype ...");
cast the waveform to float32 first. Also, the ACE-Step pipeline already
returns a [channels, time] tensor, so indexing image[0] dropped the
channel dimension and produced a 1D tensor, causing ffmpeg to fail with
"channel_layout=0x0". Normalize the waveform to a 2D [channels, time]
tensor before saving. With the default bf16 training dtype these two
issues prevented any ACE-Step sample from being written.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant