Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions livekit-agents/livekit/agents/voice/agent_activity.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Missing empty-transcript guard at second _scheduling_paused check during session closing

The PR adds an info.new_transcript guard at line 1456 to prevent injecting blank user messages into the chat context when scheduling is paused and the session is closing. However, an analogous code path at line 1549 in _user_turn_completed_task was not updated with the same guard.

Root Cause and Impact

With the audio_recognition.py change (line 525), _run_eou_detection no longer returns early in VAD mode when STT is enabled but no transcript has arrived yet. This means on_end_of_turn can now be called with info.new_transcript == "". The PR correctly guards line 1456:

if self._session._closing and info.new_transcript:

But the same pattern at agent_activity.py:1549 is left unguarded:

if self._session._closing:
    self._agent._chat_ctx.items.append(user_message)
    self._session._conversation_item_added(user_message)

This path is reachable when _scheduling_paused is False during the synchronous on_end_of_turn (line 1449), allowing the async _user_turn_completed_task to be created at line 1484, but then _scheduling_paused becomes True before the async task executes line 1544. In that case, a user_message with empty content (content=[""]) is appended to the chat context, which is exactly the blank-message problem the PR is trying to fix.

Note that line 1523 already has the correct guard (if info.new_transcript != "":), confirming this is the intended pattern.

(Refers to lines 1549-1551)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to add it, but need some more context from a reviewer if this is necessary

Original file line number Diff line number Diff line change
Expand Up @@ -1453,8 +1453,8 @@ def on_end_of_turn(self, info: _EndOfTurnInfo) -> bool:
extra={"user_input": info.new_transcript},
)

if self._session._closing:
# add user input to chat context
if self._session._closing and info.new_transcript != "":
# add user input to chat context and skip blank messages
user_message = llm.ChatMessage(
role="user",
content=[info.new_transcript],
Expand Down
30 changes: 25 additions & 5 deletions livekit-agents/livekit/agents/voice/audio_recognition.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,8 @@ async def predict_end_of_turn(
) -> float: ...


TurnDetectionMode = Literal["stt", "vad", "realtime_llm", "manual"] | _TurnDetector
TurnDetectionType = Literal["stt", "vad", "realtime_llm", "manual"]
TurnDetectionMode = TurnDetectionType | _TurnDetector
"""
The mode of turn detection to use.

Expand Down Expand Up @@ -121,7 +122,9 @@ def __init__(
self._turn_detector = turn_detection if not isinstance(turn_detection, str) else None
self._stt = stt
self._vad = vad
self._turn_detection_mode = turn_detection if isinstance(turn_detection, str) else None
self._turn_detection_mode: TurnDetectionType | None = (
turn_detection if isinstance(turn_detection, str) else None
)
self._vad_base_turn_detection = self._turn_detection_mode in ("vad", None)
self._user_turn_committed = False # true if user turn ended but EOU task not done

Expand Down Expand Up @@ -521,13 +524,30 @@ async def _on_vad_event(self, ev: vad.VADEvent) -> None:
chat_ctx = self._hooks.retrieve_chat_ctx().copy()
self._run_eou_detection(chat_ctx)

def _eou_requires_transcript(self) -> bool:
if self._stt:
# while we aren't checking _turn_detector here,
# _turn_detector and _turn_detection_mode are mutually exclusive (such that if one is provided, the other must be None)
# e.g. if _turn_detector is provided, _turn_detection_mode is None, and vice versa
match self._turn_detection_mode:
case "stt" | "realtime_llm" | None:
return True
case "manual" | "vad":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of vad here is the fix

return False
Comment on lines +535 to +536
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 VAD mode with STT can send empty user messages to the LLM

The new _eou_requires_transcript() method correctly returns False for "vad" mode (allowing EOU detection to proceed without a transcript), but downstream code in _user_turn_completed_task (agent_activity.py:1511-1515) unconditionally creates a user_message with content=[info.new_transcript] where info.new_transcript can be "". This message is then passed to _generate_reply, potentially sending an empty user message to the LLM.

How this path is reached

When VAD mode is used with STT enabled:

  1. VAD detects END_OF_SPEECH before any STT transcript arrives → _run_eou_detection is called with _audio_transcript = ""
  2. _eou_requires_transcript() returns False for mode "vad" → no early return (old code DID early return here because "vad" != "manual" was True)
  3. _run_eou_detection correctly skips adding a blank message to chat_ctx (audio_recognition.py:548), but the _bounce_eou_task still fires on_end_of_turn with new_transcript=""
  4. on_end_of_turn creates _user_turn_completed_task which builds a user_message with content [""] and proceeds to generate a reply

The min_interruption_words guard at agent_activity.py:1469-1481 only catches this when there IS a current interruptible speech. When the agent is idle, the empty message flows through unchecked.

Prompt for agents
In livekit-agents/livekit/agents/voice/audio_recognition.py, the _eou_requires_transcript() method returns False for 'vad' mode (line 535-536), which allows _run_eou_detection to proceed with an empty transcript. While this is correct for allowing VAD-based turn detection, the downstream consumer on_end_of_turn in agent_activity.py (line 1445) and _user_turn_completed_task (line 1491) don't guard against empty transcripts. Either:

1. In _eou_requires_transcript, return True for 'vad' mode when STT exists (reverting to old behavior for the 'no transcript yet' case), OR
2. In agent_activity.py _user_turn_completed_task (around line 1511-1515), add a guard to skip generating a reply when info.new_transcript is empty and the LLM is not a RealtimeModel, OR
3. In _run_eou_detection around line 548, when self._audio_transcript is empty and mode is 'vad', pass the empty transcript info but set skip_reply=True in the _EndOfTurnInfo so that on_end_of_turn knows not to trigger LLM generation.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

case _:
# If not specified then we assume it requires transcript
return True
else:
return False
Comment on lines +527 to +541
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok made this into it's own function to enumerate all of the possible values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before everything except for manual was essentially going through
the first case statement.


def _run_eou_detection(self, chat_ctx: llm.ChatContext, skip_reply: bool = False) -> None:
if self._stt and not self._audio_transcript and self._turn_detection_mode != "manual":
# stt enabled but no transcript yet
if not self._audio_transcript and self._eou_requires_transcript():
return

chat_ctx = chat_ctx.copy()
chat_ctx.add_message(role="user", content=self._audio_transcript)
if self._audio_transcript != "":
# only append when we have a transcript so we don't inject blank user messages
chat_ctx.add_message(role="user", content=self._audio_transcript)
turn_detector = (
self._turn_detector
if self._audio_transcript and self._turn_detection_mode != "manual"
Expand Down