Fix VAD eou detection is exiting early by hudson-worden · Pull Request #4988 · livekit/agents

hudson-worden · 2026-03-03T16:58:36Z

Issue:
There seems to be a case where _stt is defined and there's no transcript yet, but the _turn_detection_mode is set to "vad". VAD doesn't need a transcript for eou detection, so the consequence is we're exiting earlier than necessary. And the consequence of that is the started_speaking_at is older than it should be b/c it's never reset here

Here's what impact the bug has. See the context below on how I obtained this.

It's showing that we invoked the skipped branch

Context:
Related issue
I've been looking at ways to improve these two attributes with MetricsReport (started_speaking_at and stopped_speaking_at). I've been auditing discrepancies between recordings and where these values are set by using some custom otel tracing to visualize the started_speaking_at and stopped_speaking_at of the MetricsReport on the conversation items.

pseudo-code

from opentelemetry import trace

...

def _on_conversation_item_added(self, msg_ev: ConversationItemAddedEvent) -> None:
    if msg_ev.item.type == "message":
        tracer = trace.get_tracer(__name__)
        item_copy = msg_ev.item.copy()
        started_at = item_copy.metrics.get("started_speaking_at")
        stopped_at = item_copy.metrics.get("stopped_speaking_at")
        if started_at is not None and stopped_at is not None:
            start_ns = int(started_at * 1_000_000_000)
            end_ns = int(stopped_at * 1_000_000_000)

            span_name = "chat_item_message_assistant" if item_copy.role == "assistant" else "chat_item_message_user"
            if item_copy.interrupted:
                span_name += "_interrupted"
            span = tracer.start_span(
                span_name,
                start_time=start_ns,
            )
            span.set_attribute("item_role", item_copy.role)
            span.set_attribute("item_id", item_copy.id)
            span.set_attribute("item_type", item_copy.type)
            span.set_attribute("item_content", item_copy.content)
            span.set_attribute("item_interrupted", item_copy.interrupted)
            span.end(end_time=end_ns)

I'm also visualizing key decision points using custom spans in livekit.

…ng transcript, even though vad doesn't operate on transcripts.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

hudson-worden · 2026-03-03T20:20:27Z

we're currently working on addressing the test failures

hudson-worden · 2026-03-04T00:27:40Z

livekit-agents/livekit/agents/voice/audio_recognition.py


        chat_ctx = chat_ctx.copy()
-        chat_ctx.add_message(role="user", content=self._audio_transcript)
+        if self._audio_transcript:


I believe chat_ctx only really ends up in observability / otel traces so that's the impact here

hudson-worden · 2026-03-04T00:27:50Z

livekit-agents/livekit/agents/voice/agent_activity.py


-            if self._session._closing:
-                # add user input to chat context
+            if self._session._closing and info.new_transcript:


This is what actually fixed the tests.

devin-ai-integration

Devin Review found 1 new potential issue.

View 4 additional findings in Devin Review.

devin-ai-integration · 2026-03-04T00:30:31Z

livekit-agents/livekit/agents/voice/agent_activity.py

🟡 Missing empty-transcript guard at second _scheduling_paused check during session closing

The PR adds an info.new_transcript guard at line 1456 to prevent injecting blank user messages into the chat context when scheduling is paused and the session is closing. However, an analogous code path at line 1549 in _user_turn_completed_task was not updated with the same guard.

Root Cause and Impact

With the audio_recognition.py change (line 525), _run_eou_detection no longer returns early in VAD mode when STT is enabled but no transcript has arrived yet. This means on_end_of_turn can now be called with info.new_transcript == "". The PR correctly guards line 1456:

if self._session._closing and info.new_transcript:

But the same pattern at agent_activity.py:1549 is left unguarded:

if self._session._closing: self._agent._chat_ctx.items.append(user_message) self._session._conversation_item_added(user_message)

This path is reachable when _scheduling_paused is False during the synchronous on_end_of_turn (line 1449), allowing the async _user_turn_completed_task to be created at line 1484, but then _scheduling_paused becomes True before the async task executes line 1544. In that case, a user_message with empty content (content=[""]) is appended to the chat context, which is exactly the blank-message problem the PR is trying to fix.

Note that line 1523 already has the correct guard (if info.new_transcript != "":), confirming this is the intended pattern.

(Refers to lines 1549-1551)

Was this helpful? React with 👍 or 👎 to provide feedback.

happy to add it, but need some more context from a reviewer if this is necessary

davidzhao · 2026-03-05T06:32:05Z

livekit-agents/livekit/agents/voice/audio_recognition.py


    def _run_eou_detection(self, chat_ctx: llm.ChatContext, skip_reply: bool = False) -> None:
-        if self._stt and not self._audio_transcript and self._turn_detection_mode != "manual":
+        if self._stt and not self._audio_transcript and self._turn_detection_mode == "stt":


we have to wait for transcript because our turn detector utilizes text input to help determine end of turn.

Hi @davidzhao, thanks for the reply are you referring to this?

If that's the case, then I think based on this I can make a change

yes, you can see it here, where it performs a EOU inference depending on recent chat context.

cool then I think tightening up the guard should work!

davidzhao · 2026-03-05T15:58:53Z

livekit-agents/livekit/agents/voice/audio_recognition.py

+        if (
+            self._stt
+            and not self._audio_transcript
+            and (
+                # if turn detection is based on stt
+                # OR if a turn detector is provided (e.g the MultilingualModel)
+                self._turn_detection_mode == "stt" or self._turn_detector is not None
+            )
+        ):


this isn't quite right.. I would recommend writing it like this for readability:

if self._stt and not self.audio_transcript: if self._turn_detection_mode == "stt": return if self._turn_detection_mode == "vad" and self._turn_detector is None: return

I updated the if statement to be more readable, though it's different than the way you have it

a25deee

if self._stt and not self._audio_transcript: if self._turn_detection_mode == "stt": # stt enabled but no transcript yet return if self._turn_detector is not None: # a turn detector like (MultilingualModel) is provided but no transcript yet return

in this case you mentioned

if self._turn_detection_mode == "vad" and self._turn_detector is None: return

That is the case that we do want to pass through since no transcript is needed in that case. Unless I'm mistaken?

with manual mode, it should always proceed immediately. so either explicitly checking for vad, or doing != "manual" like it was doing before

hudson-worden · 2026-03-05T17:28:02Z

livekit-agents/livekit/agents/voice/audio_recognition.py

+    def _eou_requires_transcript(self) -> bool:
+        if self._stt:
+            # while we aren't checking _turn_detector here,
+            #   _turn_detector and _turn_detection_mode are mutually exclusive (such that if one is provided, the other must be None)
+            # e.g. if _turn_detector is provided, _turn_detection_mode is None, and vice versa
+            match self._turn_detection_mode:
+                case "stt" | "realtime_llm" | None:
+                    return True
+                case "manual" | "vad":
+                    return False
+                case _:
+                    # If not specified then we assume it requires transcript
+                    return True
+        else:
+            return False


Ok made this into it's own function to enumerate all of the possible values

Before everything except for manual was essentially going through
the first case statement.

hudson-worden · 2026-03-05T17:28:21Z

livekit-agents/livekit/agents/voice/audio_recognition.py

+            match self._turn_detection_mode:
+                case "stt" | "realtime_llm" | None:
+                    return True
+                case "manual" | "vad":


The addition of vad here is the fix

chenghao-mou · 2026-03-05T19:42:57Z

Maybe I am missing something.

The current code skips EOU for VAD EOS without changing start speaking time, because STT transcript for that VAD speech might arrive later. When it arrives later, we call the same EOU function with the new transcript, at which time, it will be committed with the same start speaking time and reset correctly.

devin-ai-integration

Devin Review found 1 new potential issue.

View 5 additional findings in Devin Review.

devin-ai-integration · 2026-03-05T19:47:46Z

livekit-agents/livekit/agents/voice/audio_recognition.py

+                case "manual" | "vad":
+                    return False


🟡 VAD mode with STT can send empty user messages to the LLM

The new _eou_requires_transcript() method correctly returns False for "vad" mode (allowing EOU detection to proceed without a transcript), but downstream code in _user_turn_completed_task (agent_activity.py:1511-1515) unconditionally creates a user_message with content=[info.new_transcript] where info.new_transcript can be "". This message is then passed to _generate_reply, potentially sending an empty user message to the LLM.

How this path is reached

When VAD mode is used with STT enabled:

VAD detects END_OF_SPEECH before any STT transcript arrives → _run_eou_detection is called with _audio_transcript = ""

_eou_requires_transcript() returns False for mode "vad" → no early return (old code DID early return here because "vad" != "manual" was True)

_run_eou_detection correctly skips adding a blank message to chat_ctx (audio_recognition.py:548), but the _bounce_eou_task still fires on_end_of_turn with new_transcript=""

on_end_of_turn creates _user_turn_completed_task which builds a user_message with content [""] and proceeds to generate a reply

The min_interruption_words guard at agent_activity.py:1469-1481 only catches this when there IS a current interruptible speech. When the agent is idle, the empty message flows through unchecked.

Prompt for agents

In livekit-agents/livekit/agents/voice/audio_recognition.py, the _eou_requires_transcript() method returns False for 'vad' mode (line 535-536), which allows _run_eou_detection to proceed with an empty transcript. While this is correct for allowing VAD-based turn detection, the downstream consumer on_end_of_turn in agent_activity.py (line 1445) and _user_turn_completed_task (line 1491) don't guard against empty transcripts. Either: 1. In _eou_requires_transcript, return True for 'vad' mode when STT exists (reverting to old behavior for the 'no transcript yet' case), OR 2. In agent_activity.py _user_turn_completed_task (around line 1511-1515), add a guard to skip generating a reply when info.new_transcript is empty and the LLM is not a RealtimeModel, OR 3. In _run_eou_detection around line 548, when self._audio_transcript is empty and mode is 'vad', pass the empty transcript info but set skip_reply=True in the _EndOfTurnInfo so that on_end_of_turn knows not to trigger LLM generation.

Was this helpful? React with 👍 or 👎 to provide feedback.

hudson-worden · 2026-03-05T21:15:22Z

I'm currently testing a few hypotheses after this comment.

I still think something is up, but it may be b/c of something different than what I'm proposing here. I'm thinking if we went ahead with allowing VAD to pass through and do EOU detection (like this PR), we'd run into situations where STT would have provided a more complete _audio_transcript after VAD declared END_OF_SPEECH?

I'm thinking the issue may be related to this instead.

hudson-worden · 2026-03-06T15:52:35Z

@chenghao-mou @davidzhao - Thanks for the review. This may have been misguided so I'll go ahead and close this PR.

I've opened a separate PR with the fix for this particular issue.

Fix situation where VAD eou detection is exiting early due to a missi…

044a8a2

…ng transcript, even though vad doesn't operate on transcripts.

hudson-worden mentioned this pull request Mar 3, 2026

llm.ChatMessage MetricsReport.started_speaking_at misalignment #4969

Open

devin-ai-integration bot reviewed Mar 3, 2026

View reviewed changes

Fix the empty message issue revealed in tests.

43096b7

hudson-worden force-pushed the fix_eou_early_exit_vad_stt branch from 41bcd7f to 43096b7 Compare March 4, 2026 00:25

hudson-worden commented Mar 4, 2026

View reviewed changes

devin-ai-integration bot reviewed Mar 4, 2026

View reviewed changes

Make the statements more explicit

b2221c0

chenghao-mou self-assigned this Mar 4, 2026

davidzhao reviewed Mar 5, 2026

View reviewed changes

Make sure to address the case where a turn_detector is specified

a25deee

hudson-worden force-pushed the fix_eou_early_exit_vad_stt branch from 0d055ed to a25deee Compare March 5, 2026 16:16

hudson-worden added 2 commits March 5, 2026 11:50

Break out the logic into it's own function.

6f0da8e

Fold int _stt check

222f92a

hudson-worden commented Mar 5, 2026

View reviewed changes

Empty commit to retrigger build after minor github outage

5b9b4d9

devin-ai-integration bot reviewed Mar 5, 2026

View reviewed changes

hudson-worden closed this Mar 6, 2026

Conversation

hudson-worden commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

hudson-worden commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudson-worden Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenghao-mou commented Mar 5, 2026

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

hudson-worden commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hudson-worden commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hudson-worden commented Mar 3, 2026 •

edited

Loading

hudson-worden commented Mar 3, 2026 •

edited

Loading

hudson-worden Mar 5, 2026 •

edited

Loading

hudson-worden commented Mar 5, 2026 •

edited

Loading