Fix VAD eou detection is exiting early#4988
Fix VAD eou detection is exiting early#4988hudson-worden wants to merge 7 commits intolivekit:mainfrom
Conversation
…ng transcript, even though vad doesn't operate on transcripts.
|
we're currently working on addressing the test failures |
41bcd7f to
43096b7
Compare
|
|
||
| chat_ctx = chat_ctx.copy() | ||
| chat_ctx.add_message(role="user", content=self._audio_transcript) | ||
| if self._audio_transcript: |
There was a problem hiding this comment.
I believe chat_ctx only really ends up in observability / otel traces so that's the impact here
|
|
||
| if self._session._closing: | ||
| # add user input to chat context | ||
| if self._session._closing and info.new_transcript: |
There was a problem hiding this comment.
This is what actually fixed the tests.
There was a problem hiding this comment.
🟡 Missing empty-transcript guard at second _scheduling_paused check during session closing
The PR adds an info.new_transcript guard at line 1456 to prevent injecting blank user messages into the chat context when scheduling is paused and the session is closing. However, an analogous code path at line 1549 in _user_turn_completed_task was not updated with the same guard.
Root Cause and Impact
With the audio_recognition.py change (line 525), _run_eou_detection no longer returns early in VAD mode when STT is enabled but no transcript has arrived yet. This means on_end_of_turn can now be called with info.new_transcript == "". The PR correctly guards line 1456:
if self._session._closing and info.new_transcript:But the same pattern at agent_activity.py:1549 is left unguarded:
if self._session._closing:
self._agent._chat_ctx.items.append(user_message)
self._session._conversation_item_added(user_message)This path is reachable when _scheduling_paused is False during the synchronous on_end_of_turn (line 1449), allowing the async _user_turn_completed_task to be created at line 1484, but then _scheduling_paused becomes True before the async task executes line 1544. In that case, a user_message with empty content (content=[""]) is appended to the chat context, which is exactly the blank-message problem the PR is trying to fix.
Note that line 1523 already has the correct guard (if info.new_transcript != "":), confirming this is the intended pattern.
(Refers to lines 1549-1551)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
happy to add it, but need some more context from a reviewer if this is necessary
|
|
||
| def _run_eou_detection(self, chat_ctx: llm.ChatContext, skip_reply: bool = False) -> None: | ||
| if self._stt and not self._audio_transcript and self._turn_detection_mode != "manual": | ||
| if self._stt and not self._audio_transcript and self._turn_detection_mode == "stt": |
There was a problem hiding this comment.
we have to wait for transcript because our turn detector utilizes text input to help determine end of turn.
There was a problem hiding this comment.
Hi @davidzhao, thanks for the reply are you referring to this?
There was a problem hiding this comment.
If that's the case, then I think based on this I can make a change
There was a problem hiding this comment.
yes, you can see it here, where it performs a EOU inference depending on recent chat context.
There was a problem hiding this comment.
cool then I think tightening up the guard should work!
| if ( | ||
| self._stt | ||
| and not self._audio_transcript | ||
| and ( | ||
| # if turn detection is based on stt | ||
| # OR if a turn detector is provided (e.g the MultilingualModel) | ||
| self._turn_detection_mode == "stt" or self._turn_detector is not None | ||
| ) | ||
| ): |
There was a problem hiding this comment.
this isn't quite right.. I would recommend writing it like this for readability:
if self._stt and not self.audio_transcript:
if self._turn_detection_mode == "stt":
return
if self._turn_detection_mode == "vad" and self._turn_detector is None:
returnThere was a problem hiding this comment.
I updated the if statement to be more readable, though it's different than the way you have it
if self._stt and not self._audio_transcript:
if self._turn_detection_mode == "stt":
# stt enabled but no transcript yet
return
if self._turn_detector is not None:
# a turn detector like (MultilingualModel) is provided but no transcript yet
return
in this case you mentioned
if self._turn_detection_mode == "vad" and self._turn_detector is None:
return
That is the case that we do want to pass through since no transcript is needed in that case. Unless I'm mistaken?
There was a problem hiding this comment.
with manual mode, it should always proceed immediately. so either explicitly checking for vad, or doing != "manual" like it was doing before
0d055ed to
a25deee
Compare
| def _eou_requires_transcript(self) -> bool: | ||
| if self._stt: | ||
| # while we aren't checking _turn_detector here, | ||
| # _turn_detector and _turn_detection_mode are mutually exclusive (such that if one is provided, the other must be None) | ||
| # e.g. if _turn_detector is provided, _turn_detection_mode is None, and vice versa | ||
| match self._turn_detection_mode: | ||
| case "stt" | "realtime_llm" | None: | ||
| return True | ||
| case "manual" | "vad": | ||
| return False | ||
| case _: | ||
| # If not specified then we assume it requires transcript | ||
| return True | ||
| else: | ||
| return False |
There was a problem hiding this comment.
Ok made this into it's own function to enumerate all of the possible values
There was a problem hiding this comment.
Before everything except for manual was essentially going through
the first case statement.
| match self._turn_detection_mode: | ||
| case "stt" | "realtime_llm" | None: | ||
| return True | ||
| case "manual" | "vad": |
There was a problem hiding this comment.
The addition of vad here is the fix
|
Maybe I am missing something. The current code skips EOU for VAD EOS without changing start speaking time, because STT transcript for that VAD speech might arrive later. When it arrives later, we call the same EOU function with the new transcript, at which time, it will be committed with the same start speaking time and reset correctly. |
| case "manual" | "vad": | ||
| return False |
There was a problem hiding this comment.
🟡 VAD mode with STT can send empty user messages to the LLM
The new _eou_requires_transcript() method correctly returns False for "vad" mode (allowing EOU detection to proceed without a transcript), but downstream code in _user_turn_completed_task (agent_activity.py:1511-1515) unconditionally creates a user_message with content=[info.new_transcript] where info.new_transcript can be "". This message is then passed to _generate_reply, potentially sending an empty user message to the LLM.
How this path is reached
When VAD mode is used with STT enabled:
- VAD detects END_OF_SPEECH before any STT transcript arrives →
_run_eou_detectionis called with_audio_transcript = "" _eou_requires_transcript()returnsFalsefor mode"vad"→ no early return (old code DID early return here because"vad" != "manual"was True)_run_eou_detectioncorrectly skips adding a blank message tochat_ctx(audio_recognition.py:548), but the_bounce_eou_taskstill fireson_end_of_turnwithnew_transcript=""on_end_of_turncreates_user_turn_completed_taskwhich builds auser_messagewith content[""]and proceeds to generate a reply
The min_interruption_words guard at agent_activity.py:1469-1481 only catches this when there IS a current interruptible speech. When the agent is idle, the empty message flows through unchecked.
Prompt for agents
In livekit-agents/livekit/agents/voice/audio_recognition.py, the _eou_requires_transcript() method returns False for 'vad' mode (line 535-536), which allows _run_eou_detection to proceed with an empty transcript. While this is correct for allowing VAD-based turn detection, the downstream consumer on_end_of_turn in agent_activity.py (line 1445) and _user_turn_completed_task (line 1491) don't guard against empty transcripts. Either:
1. In _eou_requires_transcript, return True for 'vad' mode when STT exists (reverting to old behavior for the 'no transcript yet' case), OR
2. In agent_activity.py _user_turn_completed_task (around line 1511-1515), add a guard to skip generating a reply when info.new_transcript is empty and the LLM is not a RealtimeModel, OR
3. In _run_eou_detection around line 548, when self._audio_transcript is empty and mode is 'vad', pass the empty transcript info but set skip_reply=True in the _EndOfTurnInfo so that on_end_of_turn knows not to trigger LLM generation.
Was this helpful? React with 👍 or 👎 to provide feedback.
|
I'm currently testing a few hypotheses after this comment. I still think something is up, but it may be b/c of something different than what I'm proposing here. I'm thinking if we went ahead with allowing VAD to pass through and do EOU detection (like this PR), we'd run into situations where STT would have provided a more complete I'm thinking the issue may be related to this instead. |
|
@chenghao-mou @davidzhao - Thanks for the review. This may have been misguided so I'll go ahead and close this PR. I've opened a separate PR with the fix for this particular issue. |
Issue:
There seems to be a case where _stt is defined and there's no transcript yet, but the
_turn_detection_modeis set to "vad". VAD doesn't need a transcript for eou detection, so the consequence is we're exiting earlier than necessary. And the consequence of that is thestarted_speaking_atis older than it should be b/c it's never reset hereHere's what impact the bug has. See the context below on how I obtained this.
It's showing that we invoked the skipped branch
Context:
Related issue
I've been looking at ways to improve these two attributes with
MetricsReport(started_speaking_atandstopped_speaking_at). I've been auditing discrepancies between recordings and where these values are set by using some custom otel tracing to visualize thestarted_speaking_atandstopped_speaking_atof the MetricsReport on the conversation items.pseudo-code
I'm also visualizing key decision points using custom spans in livekit.