Skip to content

Fix VAD eou detection is exiting early#4988

Closed
hudson-worden wants to merge 7 commits intolivekit:mainfrom
hudson-worden:fix_eou_early_exit_vad_stt
Closed

Fix VAD eou detection is exiting early#4988
hudson-worden wants to merge 7 commits intolivekit:mainfrom
hudson-worden:fix_eou_early_exit_vad_stt

Conversation

@hudson-worden
Copy link
Contributor

@hudson-worden hudson-worden commented Mar 3, 2026

Issue:
There seems to be a case where _stt is defined and there's no transcript yet, but the _turn_detection_mode is set to "vad". VAD doesn't need a transcript for eou detection, so the consequence is we're exiting earlier than necessary. And the consequence of that is the started_speaking_at is older than it should be b/c it's never reset here

Here's what impact the bug has. See the context below on how I obtained this.

image

It's showing that we invoked the skipped branch

Context:
Related issue
I've been looking at ways to improve these two attributes with MetricsReport (started_speaking_at and stopped_speaking_at). I've been auditing discrepancies between recordings and where these values are set by using some custom otel tracing to visualize the started_speaking_at and stopped_speaking_at of the MetricsReport on the conversation items.

pseudo-code

from opentelemetry import trace

...

def _on_conversation_item_added(self, msg_ev: ConversationItemAddedEvent) -> None:
    if msg_ev.item.type == "message":
        tracer = trace.get_tracer(__name__)
        item_copy = msg_ev.item.copy()
        started_at = item_copy.metrics.get("started_speaking_at")
        stopped_at = item_copy.metrics.get("stopped_speaking_at")
        if started_at is not None and stopped_at is not None:
            start_ns = int(started_at * 1_000_000_000)
            end_ns = int(stopped_at * 1_000_000_000)

            span_name = "chat_item_message_assistant" if item_copy.role == "assistant" else "chat_item_message_user"
            if item_copy.interrupted:
                span_name += "_interrupted"
            span = tracer.start_span(
                span_name,
                start_time=start_ns,
            )
            span.set_attribute("item_role", item_copy.role)
            span.set_attribute("item_id", item_copy.id)
            span.set_attribute("item_type", item_copy.type)
            span.set_attribute("item_content", item_copy.content)
            span.set_attribute("item_interrupted", item_copy.interrupted)
            span.end(end_time=end_ns)

I'm also visualizing key decision points using custom spans in livekit.

…ng transcript, even though vad doesn't operate on transcripts.
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

@hudson-worden
Copy link
Contributor Author

hudson-worden commented Mar 3, 2026

we're currently working on addressing the test failures

@hudson-worden hudson-worden force-pushed the fix_eou_early_exit_vad_stt branch from 41bcd7f to 43096b7 Compare March 4, 2026 00:25

chat_ctx = chat_ctx.copy()
chat_ctx.add_message(role="user", content=self._audio_transcript)
if self._audio_transcript:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe chat_ctx only really ends up in observability / otel traces so that's the impact here


if self._session._closing:
# add user input to chat context
if self._session._closing and info.new_transcript:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what actually fixed the tests.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing empty-transcript guard at second _scheduling_paused check during session closing

The PR adds an info.new_transcript guard at line 1456 to prevent injecting blank user messages into the chat context when scheduling is paused and the session is closing. However, an analogous code path at line 1549 in _user_turn_completed_task was not updated with the same guard.

Root Cause and Impact

With the audio_recognition.py change (line 525), _run_eou_detection no longer returns early in VAD mode when STT is enabled but no transcript has arrived yet. This means on_end_of_turn can now be called with info.new_transcript == "". The PR correctly guards line 1456:

if self._session._closing and info.new_transcript:

But the same pattern at agent_activity.py:1549 is left unguarded:

if self._session._closing:
    self._agent._chat_ctx.items.append(user_message)
    self._session._conversation_item_added(user_message)

This path is reachable when _scheduling_paused is False during the synchronous on_end_of_turn (line 1449), allowing the async _user_turn_completed_task to be created at line 1484, but then _scheduling_paused becomes True before the async task executes line 1544. In that case, a user_message with empty content (content=[""]) is appended to the chat context, which is exactly the blank-message problem the PR is trying to fix.

Note that line 1523 already has the correct guard (if info.new_transcript != "":), confirming this is the intended pattern.

(Refers to lines 1549-1551)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to add it, but need some more context from a reviewer if this is necessary

@chenghao-mou chenghao-mou self-assigned this Mar 4, 2026

def _run_eou_detection(self, chat_ctx: llm.ChatContext, skip_reply: bool = False) -> None:
if self._stt and not self._audio_transcript and self._turn_detection_mode != "manual":
if self._stt and not self._audio_transcript and self._turn_detection_mode == "stt":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have to wait for transcript because our turn detector utilizes text input to help determine end of turn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @davidzhao, thanks for the reply are you referring to this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, then I think based on this I can make a change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you can see it here, where it performs a EOU inference depending on recent chat context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +525 to +533
if (
self._stt
and not self._audio_transcript
and (
# if turn detection is based on stt
# OR if a turn detector is provided (e.g the MultilingualModel)
self._turn_detection_mode == "stt" or self._turn_detector is not None
)
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't quite right.. I would recommend writing it like this for readability:

if self._stt and not self.audio_transcript:
    if self._turn_detection_mode == "stt":
        return
    if self._turn_detection_mode == "vad" and self._turn_detector is None:
        return

Copy link
Contributor Author

@hudson-worden hudson-worden Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the if statement to be more readable, though it's different than the way you have it

a25deee

  if self._stt and not self._audio_transcript:
      if self._turn_detection_mode == "stt":
          # stt enabled but no transcript yet
          return
      if self._turn_detector is not None:
          # a turn detector like (MultilingualModel) is provided but no transcript yet
          return

in this case you mentioned

if self._turn_detection_mode == "vad" and self._turn_detector is None:
    return

That is the case that we do want to pass through since no transcript is needed in that case. Unless I'm mistaken?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with manual mode, it should always proceed immediately. so either explicitly checking for vad, or doing != "manual" like it was doing before

@hudson-worden hudson-worden force-pushed the fix_eou_early_exit_vad_stt branch from 0d055ed to a25deee Compare March 5, 2026 16:16
Comment on lines +527 to +541
def _eou_requires_transcript(self) -> bool:
if self._stt:
# while we aren't checking _turn_detector here,
# _turn_detector and _turn_detection_mode are mutually exclusive (such that if one is provided, the other must be None)
# e.g. if _turn_detector is provided, _turn_detection_mode is None, and vice versa
match self._turn_detection_mode:
case "stt" | "realtime_llm" | None:
return True
case "manual" | "vad":
return False
case _:
# If not specified then we assume it requires transcript
return True
else:
return False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok made this into it's own function to enumerate all of the possible values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before everything except for manual was essentially going through
the first case statement.

match self._turn_detection_mode:
case "stt" | "realtime_llm" | None:
return True
case "manual" | "vad":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of vad here is the fix

@chenghao-mou
Copy link
Member

Maybe I am missing something.

The current code skips EOU for VAD EOS without changing start speaking time, because STT transcript for that VAD speech might arrive later. When it arrives later, we call the same EOU function with the new transcript, at which time, it will be committed with the same start speaking time and reset correctly.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment on lines +535 to +536
case "manual" | "vad":
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 VAD mode with STT can send empty user messages to the LLM

The new _eou_requires_transcript() method correctly returns False for "vad" mode (allowing EOU detection to proceed without a transcript), but downstream code in _user_turn_completed_task (agent_activity.py:1511-1515) unconditionally creates a user_message with content=[info.new_transcript] where info.new_transcript can be "". This message is then passed to _generate_reply, potentially sending an empty user message to the LLM.

How this path is reached

When VAD mode is used with STT enabled:

  1. VAD detects END_OF_SPEECH before any STT transcript arrives → _run_eou_detection is called with _audio_transcript = ""
  2. _eou_requires_transcript() returns False for mode "vad" → no early return (old code DID early return here because "vad" != "manual" was True)
  3. _run_eou_detection correctly skips adding a blank message to chat_ctx (audio_recognition.py:548), but the _bounce_eou_task still fires on_end_of_turn with new_transcript=""
  4. on_end_of_turn creates _user_turn_completed_task which builds a user_message with content [""] and proceeds to generate a reply

The min_interruption_words guard at agent_activity.py:1469-1481 only catches this when there IS a current interruptible speech. When the agent is idle, the empty message flows through unchecked.

Prompt for agents
In livekit-agents/livekit/agents/voice/audio_recognition.py, the _eou_requires_transcript() method returns False for 'vad' mode (line 535-536), which allows _run_eou_detection to proceed with an empty transcript. While this is correct for allowing VAD-based turn detection, the downstream consumer on_end_of_turn in agent_activity.py (line 1445) and _user_turn_completed_task (line 1491) don't guard against empty transcripts. Either:

1. In _eou_requires_transcript, return True for 'vad' mode when STT exists (reverting to old behavior for the 'no transcript yet' case), OR
2. In agent_activity.py _user_turn_completed_task (around line 1511-1515), add a guard to skip generating a reply when info.new_transcript is empty and the LLM is not a RealtimeModel, OR
3. In _run_eou_detection around line 548, when self._audio_transcript is empty and mode is 'vad', pass the empty transcript info but set skip_reply=True in the _EndOfTurnInfo so that on_end_of_turn knows not to trigger LLM generation.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@hudson-worden
Copy link
Contributor Author

hudson-worden commented Mar 5, 2026

I'm currently testing a few hypotheses after this comment.

I still think something is up, but it may be b/c of something different than what I'm proposing here. I'm thinking if we went ahead with allowing VAD to pass through and do EOU detection (like this PR), we'd run into situations where STT would have provided a more complete _audio_transcript after VAD declared END_OF_SPEECH?

I'm thinking the issue may be related to this instead.

@hudson-worden
Copy link
Contributor Author

@chenghao-mou @davidzhao - Thanks for the review. This may have been misguided so I'll go ahead and close this PR.

I've opened a separate PR with the fix for this particular issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants