fix: adapt MiMo STT to V2.5 models and reject non-WAV audio payloads#9118
fix: adapt MiMo STT to V2.5 models and reject non-WAV audio payloads#9118tangtaizong666 wants to merge 1 commit into
Conversation
The MiMo-V2 series went offline on 2026-06-30, so the default STT model mimo-v2-omni fails for every default configuration. Switch the default to mimo-v2.5-asr, the dedicated speech recognition model whose official docs use exactly the bare input_audio payload this provider sends. For non-ASR multimodal models such as mimo-v2.5, the audio understanding docs require a text instruction alongside the audio, so restore the system/user transcription prompts for that model family only. Also validate that the resolved audio payload really is RIFF/WAVE before calling the API: when a platform voice file (e.g. Tencent SILK from QQ) slips through the WAV conversion chain unchanged, fail locally with an actionable error instead of the opaque HTTP 400 from the API. Fixes AstrBotDevs#9113
There was a problem hiding this comment.
Code Review
This pull request updates the default MiMo STT model to mimo-v2.5-asr following the deprecation of the v2 series, and implements local validation to reject non-WAV/MP3 payloads (such as Tencent SILK data) before sending them to the API. It also updates the payload construction to dynamically handle dedicated ASR models versus multimodal models, accompanied by comprehensive unit tests. The review feedback correctly identifies a potential bug in the base64 decoding logic of the audio header, where slicing the base64 string can cause padding errors and lead to false-positive validation failures for short audio files.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| try: | ||
| header = base64.b64decode(base64_data[:64]) | ||
| except Exception: | ||
| header = b"" |
There was a problem hiding this comment.
Slicing base64_data[:64] can result in a string whose length is not a multiple of 4 (especially if the audio payload is very short). When the length is not a multiple of 4, base64.b64decode raises a binascii.Error due to incorrect padding. This causes the try-except block to catch the exception and set header = b"", which ultimately raises a false-positive MiMoAPIError even for valid short WAV files. Adding proper padding to the sliced chunk before decoding resolves this issue.
try:
chunk = base64_data[:64]
padding = len(chunk) % 4
if padding:
chunk += "=" * (4 - padding)
header = base64.b64decode(chunk)
except Exception:
header = b""There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The
_is_asr_modelcheck relies on a substring match for"asr"in the model name, which is brittle; consider using an explicit allowlist or a config flag so future model names (e.g.mimo-v3-speech) don’t accidentally get misclassified. _validate_wav_payloadenforces a RIFF/WAVE header on all audio, which will break if the resolver ever emits valid non-WAV formats (mp3/ogg) that the MiMo API accepts; it may be safer to pass in and branch on the resolvedmime_type/format rather than assuming WAV.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `_is_asr_model` check relies on a substring match for `"asr"` in the model name, which is brittle; consider using an explicit allowlist or a config flag so future model names (e.g. `mimo-v3-speech`) don’t accidentally get misclassified.
- `_validate_wav_payload` enforces a RIFF/WAVE header on all audio, which will break if the resolver ever emits valid non-WAV formats (mp3/ogg) that the MiMo API accepts; it may be safer to pass in and branch on the resolved `mime_type`/format rather than assuming WAV.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Fixes #9113
The MiMo STT provider is currently broken for every default configuration, in three stacking ways:
mimo-v2-omni(and the whole MiMo-V2 series) went offline on 2026-06-30 per the official deprecation notice, so requests with the default config always fail.mimo-v2.5, the audio understanding docs require a text instruction alongsideinput_audio; sending audio alone gets HTTP 400.audio/wavand sent anyway — producing exactly the opaque400 ... invalid audio format, only mp3/flac/m4a/wav/ogg are supportederror shown in the issue logs, with no hint about the real local cause.Modifications / 改动点
astrbot/core/provider/sources/mimo_api_common.pyDEFAULT_MIMO_STT_MODEL:mimo-v2-omni→mimo-v2.5-asr, the dedicated speech recognition model in the current model lineup. Chosen overmimo-v2.5because the official ASR docs use exactly the bareinput_audiopayload this provider already sends, and ASR is billed per audio hour rather than per token.DEFAULT_MIMO_STT_SYSTEM_PROMPT/DEFAULT_MIMO_STT_USER_PROMPT(the same strings removed in fix: handle MiMo STT audio and reasoning output #8938), now used only for non-ASR multimodal models — see below. No config surface is re-added.prepare_audio_input()now validates that the resolved payload really is RIFF/WAVE before calling the API. If not, it raisesMiMoAPIErrorlocally with the actual reason, including a dedicated message when the bytes are still un-converted Tencent SILK data (pointing at silk-python).astrbot/core/provider/sources/mimo_stt_api_source.py*asr*models keep the bareinput_audiopayload, matching the ASR docs and preserving the prompt-removal decision from fix: handle MiMo STT audio and reasoning output #8938;mimo-v2.5) get the system prompt + text instruction required by the audio understanding docs.astrbot/core/config/default.pymimo-v2.5-asr.tests/test_mimo_api_sources.pyNotes:
DEFAULT_MIMO_TTS_MODEL = "mimo-v2-tts"is hit by the same V2 shutdown, but this issue and PR are scoped to STT; TTS can be handled in a follow-up.The deeper conversion-chain issue (a
.wav-suffixed file with non-WAV content passesconvert_audio_format()untouched via its extension short-circuit) lives in sharedmedia_utilscode used by all platforms; this PR deliberately guards at the MiMo provider boundary instead of changing shared conversion semantics.This is NOT a breaking change. / 这不是一个破坏性变更。
(Users who explicitly configured a model keep it — only the default value changes, and the old default is already dead upstream.)
Screenshots or Test Results / 运行截图或测试结果
Verification steps:
python -m pytest tests/test_mimo_api_sources.py -q— the 3 new tests fail onmaster(old default model, no prompts formimo-v2.5, SILK bytes not rejected) and pass with this fix:make pr-test-neo(uv sync, repo-wide ruff format/check, Neo tests, startup smoke test):MediaResolverconversion chain (no mocks):Demo script output (default model, both payload shapes, mis-labeled .wav rejected, genuine .wav passes)
Step 4 reproduces the silent failure path from the issue logs (a
.wav-named file whose content never got converted): previously those bytes were sent to the API and came back as the confusing HTTP 400; now the provider fails locally with the real reason before any network call.Checklist / 检查清单
😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
/ 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。
👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
/ 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”。
🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in
requirements.txtandpyproject.toml./ 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到
requirements.txt和pyproject.toml文件相应位置。😮 My changes do not introduce malicious code.
/ 我的更改没有引入恶意代码。
Summary by Sourcery
Update MiMo STT defaults and request shaping to work with current v2.5 models while validating audio payloads before sending them to the API.
Bug Fixes:
Enhancements:
Tests: