Skip to content

fix(semantic): ensure memory processing always reports completion status#951

Open
deepakdevp wants to merge 2 commits intovolcengine:mainfrom
deepakdevp:fix/memory-semantic-queue-stall
Open

fix(semantic): ensure memory processing always reports completion status#951
deepakdevp wants to merge 2 commits intovolcengine:mainfrom
deepakdevp:fix/memory-semantic-queue-stall

Conversation

@deepakdevp
Copy link
Contributor

Summary

  • Fix memory semantic queue stalls where pending backlog grows while processed stays at 0
  • Root cause: _process_memory_directory() had error paths that caught exceptions and returned silently, but these errors need to propagate to on_dequeue()'s error handling which calls report_error() and handles circuit breaker logic
  • Changed 2 silent early returns (ls failure, write failure) to re-raise as RuntimeError, properly caught by on_dequeue()'s existing exception handler

Changes

  • semantic_processor.py: Error paths in _process_memory_directory() now re-raise instead of silently returning, so on_dequeue() can call report_error() for permanent errors or re-enqueue for transient ones
  • 2 new tests verifying empty dir reports success, ls error reports error

Fixes #864.

Test plan

  • 2 new tests pass (pytest tests/storage/test_memory_semantic_stall.py)
  • Ruff check and format clean
  • Empty directory path still correctly reports success (no regression)

_process_memory_directory() had early return paths that could bypass
report_success()/report_error() in on_dequeue(), leaving the queue's
in_progress counter permanently stuck. This caused the semantic queue
to appear stalled with pending items never being processed.

All code paths now properly propagate to the completion callbacks.

Fixes volcengine#864.
@github-actions
Copy link

Failed to generate code suggestions for PR

Copy link
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for chasing this down. The direction is correct: the silent early returns in _process_memory_directory() really can bypass the queue completion callbacks and leave in_progress stuck.

I found one blocking issue and one follow-up test gap below.

except Exception as e:
logger.warning(f"Failed to list memory directory {dir_uri}: {e}")
return
raise RuntimeError(f"Failed to list memory directory {dir_uri}: {e}") from e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking) Re-raising here fixes the silent early return, but it does not actually make this failure path report an error in production. on_dequeue() still routes non-permanent exceptions through the re-enqueue branch, and classify_api_error() only recognizes 401/403/5xx/timeout patterns. That means common filesystem failures here, such as FileNotFoundError, Permission denied, or local I/O errors, are classified as unknown, re-enqueued, and ultimately counted as success instead of report_error(). So this PR removes the stuck in_progress symptom, but it does not guarantee the intended issue behavior from the PR description, and it can turn invalid memory URIs into infinite retries. Please either classify these directory read/write failures as permanent at the source, or extend the error-classification path so local filesystem failures are reported as queue errors rather than retried forever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b8b504b. Added _PERMANENT_IO_ERRORS = (FileNotFoundError, PermissionError, IsADirectoryError, NotADirectoryError) with an isinstance check at the top of classify_api_error(), so filesystem errors are classified as "permanent" and hit report_error() instead of being re-enqueued. This prevents both the infinite retry loop and the false success counting.

return_value=None,
),
patch(
"openviking.storage.queuefs.semantic_processor.classify_api_error",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking) This test currently proves the desired behavior only because classify_api_error() is mocked to return "permanent". In the real code path, OSError("disk read failed") is classified as unknown, so on_dequeue() re-enqueues it instead of calling report_error(). Please add at least one test that exercises the real classifier behavior, and ideally a second one for the new write_file() failure path as well, so the tests match production semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b8b504b. Removed the classify_api_error mock — tests now use real FileNotFoundError and PermissionError which the updated classifier handles as permanent. Also added a write-failure test path and 2 new tests in test_circuit_breaker.py verifying all 4 filesystem error types + chained cause detection.

@qin-ctx qin-ctx self-assigned this Mar 25, 2026
…inite retry

Address review feedback: filesystem errors (FileNotFoundError,
PermissionError, IsADirectoryError, NotADirectoryError) are now
classified as permanent by classify_api_error(), so they hit
report_error() instead of being infinitely re-enqueued.

Tests updated to exercise real classifier behavior without mocking.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

Memory semantic queue stalls on context_type=memory jobs; pending backlog grows while processed stays at 0

2 participants