Skip to content

fix: adapt MiMo STT to V2.5 models and reject non-WAV audio payloads#9118

Open
tangtaizong666 wants to merge 1 commit into
AstrBotDevs:masterfrom
tangtaizong666:fix/9113
Open

fix: adapt MiMo STT to V2.5 models and reject non-WAV audio payloads#9118
tangtaizong666 wants to merge 1 commit into
AstrBotDevs:masterfrom
tangtaizong666:fix/9113

Conversation

@tangtaizong666

@tangtaizong666 tangtaizong666 commented Jul 2, 2026

Copy link
Copy Markdown

Fixes #9113

The MiMo STT provider is currently broken for every default configuration, in three stacking ways:

  1. Dead default model. The default mimo-v2-omni (and the whole MiMo-V2 series) went offline on 2026-06-30 per the official deprecation notice, so requests with the default config always fail.
  2. Bare audio is rejected by multimodal models. For the multimodal mimo-v2.5, the audio understanding docs require a text instruction alongside input_audio; sending audio alone gets HTTP 400.
  3. Non-WAV bytes are silently shipped to the API. When a platform voice file (e.g. Tencent SILK from QQ/NapCat) slips through the WAV conversion chain unchanged, the raw bytes are base64-encoded as audio/wav and sent anyway — producing exactly the opaque 400 ... invalid audio format, only mp3/flac/m4a/wav/ogg are supported error shown in the issue logs, with no hint about the real local cause.

Modifications / 改动点

astrbot/core/provider/sources/mimo_api_common.py

  • DEFAULT_MIMO_STT_MODEL: mimo-v2-omnimimo-v2.5-asr, the dedicated speech recognition model in the current model lineup. Chosen over mimo-v2.5 because the official ASR docs use exactly the bare input_audio payload this provider already sends, and ASR is billed per audio hour rather than per token.
  • Restored DEFAULT_MIMO_STT_SYSTEM_PROMPT / DEFAULT_MIMO_STT_USER_PROMPT (the same strings removed in fix: handle MiMo STT audio and reasoning output #8938), now used only for non-ASR multimodal models — see below. No config surface is re-added.
  • prepare_audio_input() now validates that the resolved payload really is RIFF/WAVE before calling the API. If not, it raises MiMoAPIError locally with the actual reason, including a dedicated message when the bytes are still un-converted Tencent SILK data (pointing at silk-python).

astrbot/core/provider/sources/mimo_stt_api_source.py

  • Payload construction is now split by model family:
    • dedicated *asr* models keep the bare input_audio payload, matching the ASR docs and preserving the prompt-removal decision from fix: handle MiMo STT audio and reasoning output #8938;
    • non-ASR multimodal models (e.g. mimo-v2.5) get the system prompt + text instruction required by the audio understanding docs.

astrbot/core/config/default.py

  • STT provider template default model → mimo-v2.5-asr.

tests/test_mimo_api_sources.py

  • Three new regression tests: default model value, multimodal payload shape, and non-WAV payload rejection. STT test fixtures now use genuine RIFF/WAVE header bytes instead of arbitrary base64.

Notes:

  • DEFAULT_MIMO_TTS_MODEL = "mimo-v2-tts" is hit by the same V2 shutdown, but this issue and PR are scoped to STT; TTS can be handled in a follow-up.

  • The deeper conversion-chain issue (a .wav-suffixed file with non-WAV content passes convert_audio_format() untouched via its extension short-circuit) lives in shared media_utils code used by all platforms; this PR deliberately guards at the MiMo provider boundary instead of changing shared conversion semantics.

  • This is NOT a breaking change. / 这不是一个破坏性变更。

(Users who explicitly configured a model keep it — only the default value changes, and the old default is already dead upstream.)

Screenshots or Test Results / 运行截图或测试结果

Verification steps:

  1. python -m pytest tests/test_mimo_api_sources.py -q — the 3 new tests fail on master (old default model, no prompts for mimo-v2.5, SILK bytes not rejected) and pass with this fix:
master:   3 failed, 15 passed
this PR: 18 passed, 1 warning in 2.95s
  1. Related audio/media suites stay green:
$ python -m pytest tests/test_media_utils.py tests/test_platform_audio_media_resolver.py tests/test_agent_runner_media_resolver.py -q
56 passed, 3 warnings in 6.59s
  1. make pr-test-neo (uv sync, repo-wide ruff format/check, Neo tests, startup smoke test):
==> Running Ruff format check
478 files already formatted
==> Running Ruff lint check
All checks passed!
==> Running pytest
12 passed, 1 warning in 4.47s
==> Starting smoke test on http://localhost:6185
==> Smoke test passed
==> PR checks completed successfully
  1. End-to-end behavior demo through the real MediaResolver conversion chain (no mocks):
Demo script output (default model, both payload shapes, mis-labeled .wav rejected, genuine .wav passes)
=== 1) Default STT model (was: mimo-v2-omni, offline since 2026-06-30) ===
default model: mimo-v2.5-asr

=== 2) Payload for dedicated ASR model (bare audio per ASR docs) ===
[
  {
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "input_audio": {
          "data": "data:audio/wav;base64,UklGRiQAAABXQVZF"
        }
      }
    ]
  }
]

=== 3) Payload for multimodal mimo-v2.5 (system + text per audio-understanding docs) ===
[
  {
    "role": "system",
    "content": "You are a speech transcription assistant. Transcribe the spoken content from the audio exactly and return only the transcription text."
  },
  {
    "role": "user",
    "content": [
      {
        "type": "input_audio",
        "input_audio": {
          "data": "data:audio/wav;base64,UklGRiQAAABXQVZF"
        }
      },
      {
        "type": "text",
        "text": "Please transcribe the content of the audio and return only the transcription text."
      }
    ]
  }
]

=== 4) Mis-labeled .wav (unrecognized bytes) now fails fast locally ===
MiMoAPIError: Audio for MiMo STT could not be converted to WAV (unrecognized audio bytes): local media path name='tmpiy6u1x1s.wav' len=20

=== 5) Genuine WAV still passes the real conversion chain ===
OK, data url head: data:audio/wav;base64,UklGRqQMAABXQVZFZm10IBAAAAABAAEAgD4AAA

Step 4 reproduces the silent failure path from the issue logs (a .wav-named file whose content never got converted): previously those bytes were sent to the API and came back as the confusing HTTP 400; now the provider fails locally with the real reason before any network call.


Checklist / 检查清单

  • 😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
    / 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。

  • 👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
    / 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”

  • 🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in requirements.txt and pyproject.toml.
    / 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到 requirements.txtpyproject.toml 文件相应位置。

  • 😮 My changes do not introduce malicious code.
    / 我的更改没有引入恶意代码。

Summary by Sourcery

Update MiMo STT defaults and request shaping to work with current v2.5 models while validating audio payloads before sending them to the API.

Bug Fixes:

  • Switch the default MiMo STT model to the v2.5 ASR model so default configurations use a live, dedicated speech recognition model.
  • Adjust multimodal MiMo STT requests so non-ASR models receive both audio and required transcription prompts, preventing API rejections.
  • Validate outgoing STT audio as RIFF/WAVE and raise local errors, including a specific message for unconverted Tencent SILK data, instead of silently sending invalid bytes to MiMo.

Enhancements:

  • Introduce shared STT system and user prompts for use with non-ASR multimodal MiMo models.

Tests:

  • Add regression tests for the new default STT model, multimodal payload structure, and rejection of non-WAV audio payloads, updating fixtures to use real WAV headers.

The MiMo-V2 series went offline on 2026-06-30, so the default STT model
mimo-v2-omni fails for every default configuration. Switch the default
to mimo-v2.5-asr, the dedicated speech recognition model whose official
docs use exactly the bare input_audio payload this provider sends.

For non-ASR multimodal models such as mimo-v2.5, the audio understanding
docs require a text instruction alongside the audio, so restore the
system/user transcription prompts for that model family only.

Also validate that the resolved audio payload really is RIFF/WAVE before
calling the API: when a platform voice file (e.g. Tencent SILK from QQ)
slips through the WAV conversion chain unchanged, fail locally with an
actionable error instead of the opaque HTTP 400 from the API.

Fixes AstrBotDevs#9113
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. labels Jul 2, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the default MiMo STT model to mimo-v2.5-asr following the deprecation of the v2 series, and implements local validation to reject non-WAV/MP3 payloads (such as Tencent SILK data) before sending them to the API. It also updates the payload construction to dynamically handle dedicated ASR models versus multimodal models, accompanied by comprehensive unit tests. The review feedback correctly identifies a potential bug in the base64 decoding logic of the audio header, where slicing the base64 string can cause padding errors and lead to false-positive validation failures for short audio files.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +98 to +101
try:
header = base64.b64decode(base64_data[:64])
except Exception:
header = b""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Slicing base64_data[:64] can result in a string whose length is not a multiple of 4 (especially if the audio payload is very short). When the length is not a multiple of 4, base64.b64decode raises a binascii.Error due to incorrect padding. This causes the try-except block to catch the exception and set header = b"", which ultimately raises a false-positive MiMoAPIError even for valid short WAV files. Adding proper padding to the sliced chunk before decoding resolves this issue.

    try:
        chunk = base64_data[:64]
        padding = len(chunk) % 4
        if padding:
            chunk += "=" * (4 - padding)
        header = base64.b64decode(chunk)
    except Exception:
        header = b""

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The _is_asr_model check relies on a substring match for "asr" in the model name, which is brittle; consider using an explicit allowlist or a config flag so future model names (e.g. mimo-v3-speech) don’t accidentally get misclassified.
  • _validate_wav_payload enforces a RIFF/WAVE header on all audio, which will break if the resolver ever emits valid non-WAV formats (mp3/ogg) that the MiMo API accepts; it may be safer to pass in and branch on the resolved mime_type/format rather than assuming WAV.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `_is_asr_model` check relies on a substring match for `"asr"` in the model name, which is brittle; consider using an explicit allowlist or a config flag so future model names (e.g. `mimo-v3-speech`) don’t accidentally get misclassified.
- `_validate_wav_payload` enforces a RIFF/WAVE header on all audio, which will break if the resolver ever emits valid non-WAV formats (mp3/ogg) that the MiMo API accepts; it may be safer to pass in and branch on the resolved `mime_type`/format rather than assuming WAV.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:provider The bug / feature is about AI Provider, Models, LLM Agent, LLM Agent Runner. size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]mimo-v2.5 STT API 请求缺少 system prompt 和 user text,导致音频转写返回 400

1 participant