fix(room_io): disconnect locally before server delete to suppress spurious ERROR logs#6252
fix(room_io): disconnect locally before server delete to suppress spurious ERROR logs#6252tsushanth wants to merge 1 commit into
Conversation
…rious ERROR logs
When delete_room_on_close=True, the room is deleted from the server before the
agent's local RTC session closes its own connection. The rust-sdk introduced a
check (rust-sdks#1137) that logs ERROR when a publisher data channel closes while
the local session still considers itself connected. This produces three
misleading ERROR lines on every normal teardown:
publisher data channel '_reliable' closed unexpectedly
publisher data channel '_lossy' closed unexpectedly
publisher data channel '_data_track' closed unexpectedly
Fix: call room.disconnect() first so the local session transitions to the closed
state before the API delete triggers the server-side channel teardown. The
delete_room call is chained immediately after in the same async task, so the
net observable behaviour (room deleted on session close) is unchanged.
Fixes livekit#6250
There was a problem hiding this comment.
🚩 Pre-existing dead code: _close_session_atask is never assigned
The field self._close_session_atask is initialized to None at line 84 and used as a guard at line 406 (and not self._close_session_atask), but it is never assigned anywhere else in the codebase. This means the guard is always True (never prevents the close). This appears to be a leftover from a previous refactor and is not related to this PR's changes.
(Refers to line 84)
Was this helpful? React with 👍 or 👎 to provide feedback.
| async def _disconnect_then_delete() -> api.DeleteRoomResponse: | ||
| # Disconnect locally before issuing the server-side delete so the | ||
| # rust-sdk's connection-closed flag is set before the server | ||
| # closes the publisher data channels. Without this the channels | ||
| # are torn down from the remote side while the local session is | ||
| # still "connected", which triggers spurious ERROR-level logs: | ||
| # "publisher data channel '_reliable' closed unexpectedly" | ||
| # See https://github.com/livekit/agents/issues/6250 | ||
| await self._room.disconnect() | ||
| delete_fut = job_ctx.delete_room(room_name=self._room.name) | ||
| return await delete_fut |
There was a problem hiding this comment.
🟡 Room deletion task can now propagate unexpected exceptions, preventing cleanup of pending tasks
The disconnect-then-delete coroutine does not guard the disconnect call (await self._room.disconnect() at livekit-agents/livekit/agents/voice/room_io/room_io.py:491) against exceptions, so any failure there causes the task to reject with an unhandled exception.
Impact: If disconnect raises, the subsequent cleanup of pending tasks in aclose() is skipped, leaving background tasks uncancelled until process exit.
Old code guaranteed no exception; new code does not
The old code used job_ctx.delete_room() which internally wraps all operations in try/except (see livekit-agents/livekit/agents/job.py:632-643), so self._delete_room_task could never raise. In aclose() at livekit-agents/livekit/agents/voice/room_io/room_io.py:247-249, await asyncio.wait_for(task, timeout=...) only catches asyncio.TimeoutError. If _disconnect_then_delete propagates an exception from self._room.disconnect(), it bubbles up from aclose(), skipping the cleanup at lines 257-258 (await utils.aio.cancel_and_wait(*self._tasks)).
| async def _disconnect_then_delete() -> api.DeleteRoomResponse: | |
| # Disconnect locally before issuing the server-side delete so the | |
| # rust-sdk's connection-closed flag is set before the server | |
| # closes the publisher data channels. Without this the channels | |
| # are torn down from the remote side while the local session is | |
| # still "connected", which triggers spurious ERROR-level logs: | |
| # "publisher data channel '_reliable' closed unexpectedly" | |
| # See https://github.com/livekit/agents/issues/6250 | |
| await self._room.disconnect() | |
| delete_fut = job_ctx.delete_room(room_name=self._room.name) | |
| return await delete_fut | |
| async def _disconnect_then_delete() -> api.DeleteRoomResponse: | |
| # Disconnect locally before issuing the server-side delete so the | |
| # rust-sdk's connection-closed flag is set before the server | |
| # closes the publisher data channels. Without this the channels | |
| # are torn down from the remote side while the local session is | |
| # still "connected", which triggers spurious ERROR-level logs: | |
| # "publisher data channel '_reliable' closed unexpectedly" | |
| # See https://github.com/livekit/agents/issues/6250 | |
| try: | |
| await self._room.disconnect() | |
| except Exception: | |
| logger.warning("error disconnecting room before delete", exc_info=True) | |
| delete_fut = job_ctx.delete_room(room_name=self._room.name) | |
| return await delete_fut |
Was this helpful? React with 👍 or 👎 to provide feedback.
|
Thanks for creating the PR! The exception handling issue from Devin looks valid, could you fix that? type checking is failing too. |
chenghao-mou
left a comment
There was a problem hiding this comment.
Happy to merge it once the exception handling comment and CI error are fixed
Fixes #6250
Root cause
When
delete_room_on_close=True,_on_agent_session_closecallsjob_ctx.delete_room()while the local RTC session is still in theconnected state. The server-side delete tears down the publisher data
channels from the remote end. The rust-sdk introduced a check in
rust-sdks#1137 that logs at ERROR level whenever a publisher data channel
closes while the local session has not yet set its own closed flag.
The result is three misleading ERROR lines on every clean teardown:
The call completes cleanly — these are false positives triggered by a race
between the server-initiated close and the local state.
Fix
Call
room.disconnect()first so the local RTC session marks itselfclosed before the API delete sends the server-side channel teardown.
The delete is chained immediately after in the same async task, so the
observable behaviour (room is deleted on session close) is unchanged.
The existing
aclosepath already awaits_delete_room_task, so thenew coroutine-backed task is drained correctly there too.
Testing
Reproduce with
delete_room_on_close=True(the default) andctx.shutdown()called a few seconds after room creation. Before thispatch the three ERROR lines appear on every run; after it they are gone.