fix(semantic): ensure memory processing always reports completion status by deepakdevp · Pull Request #951 · volcengine/OpenViking

deepakdevp · 2026-03-25T02:35:54Z

Summary

Fix memory semantic queue stalls where pending backlog grows while processed stays at 0
Root cause: _process_memory_directory() had error paths that caught exceptions and returned silently, but these errors need to propagate to on_dequeue()'s error handling which calls report_error() and handles circuit breaker logic
Changed 2 silent early returns (ls failure, write failure) to re-raise as RuntimeError, properly caught by on_dequeue()'s existing exception handler

Changes

semantic_processor.py: Error paths in _process_memory_directory() now re-raise instead of silently returning, so on_dequeue() can call report_error() for permanent errors or re-enqueue for transient ones
2 new tests verifying empty dir reports success, ls error reports error

Fixes #864.

Test plan

2 new tests pass (pytest tests/storage/test_memory_semantic_stall.py)
Ruff check and format clean
Empty directory path still correctly reports success (no regression)

_process_memory_directory() had early return paths that could bypass report_success()/report_error() in on_dequeue(), leaving the queue's in_progress counter permanently stuck. This caused the semantic queue to appear stalled with pending items never being processed. All code paths now properly propagate to the completion callbacks. Fixes volcengine#864.

github-actions · 2026-03-25T02:36:39Z

Failed to generate code suggestions for PR

qin-ctx

Thanks for chasing this down. The direction is correct: the silent early returns in _process_memory_directory() really can bypass the queue completion callbacks and leave in_progress stuck.

I found one blocking issue and one follow-up test gap below.

qin-ctx · 2026-03-25T04:26:25Z

openviking/storage/queuefs/semantic_processor.py

        except Exception as e:
-            logger.warning(f"Failed to list memory directory {dir_uri}: {e}")
-            return
+            raise RuntimeError(f"Failed to list memory directory {dir_uri}: {e}") from e


[Bug] (blocking) Re-raising here fixes the silent early return, but it does not actually make this failure path report an error in production. on_dequeue() still routes non-permanent exceptions through the re-enqueue branch, and classify_api_error() only recognizes 401/403/5xx/timeout patterns. That means common filesystem failures here, such as FileNotFoundError, Permission denied, or local I/O errors, are classified as unknown, re-enqueued, and ultimately counted as success instead of report_error(). So this PR removes the stuck in_progress symptom, but it does not guarantee the intended issue behavior from the PR description, and it can turn invalid memory URIs into infinite retries. Please either classify these directory read/write failures as permanent at the source, or extend the error-classification path so local filesystem failures are reported as queue errors rather than retried forever.

Addressed in b8b504b. Added _PERMANENT_IO_ERRORS = (FileNotFoundError, PermissionError, IsADirectoryError, NotADirectoryError) with an isinstance check at the top of classify_api_error(), so filesystem errors are classified as "permanent" and hit report_error() instead of being re-enqueued. This prevents both the infinite retry loop and the false success counting.

qin-ctx · 2026-03-25T04:26:25Z

tests/storage/test_memory_semantic_stall.py

+            return_value=None,
+        ),
+        patch(
+            "openviking.storage.queuefs.semantic_processor.classify_api_error",


[Suggestion] (non-blocking) This test currently proves the desired behavior only because classify_api_error() is mocked to return "permanent". In the real code path, OSError("disk read failed") is classified as unknown, so on_dequeue() re-enqueues it instead of calling report_error(). Please add at least one test that exercises the real classifier behavior, and ideally a second one for the new write_file() failure path as well, so the tests match production semantics.

Fixed in b8b504b. Removed the classify_api_error mock — tests now use real FileNotFoundError and PermissionError which the updated classifier handles as permanent. Also added a write-failure test path and 2 new tests in test_circuit_breaker.py verifying all 4 filesystem error types + chained cause detection.

…inite retry Address review feedback: filesystem errors (FileNotFoundError, PermissionError, IsADirectoryError, NotADirectoryError) are now classified as permanent by classify_api_error(), so they hit report_error() instead of being infinitely re-enqueued. Tests updated to exercise real classifier behavior without mocking.

github-project-automation bot added this to OpenViking project Mar 25, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 25, 2026

qin-ctx requested changes Mar 25, 2026

View reviewed changes

qin-ctx self-assigned this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(semantic): ensure memory processing always reports completion status#951

fix(semantic): ensure memory processing always reports completion status#951
deepakdevp wants to merge 2 commits intovolcengine:mainfrom
deepakdevp:fix/memory-semantic-queue-stall

deepakdevp commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

qin-ctx left a comment

Uh oh!

qin-ctx Mar 25, 2026

Uh oh!

deepakdevp Mar 25, 2026

Uh oh!

qin-ctx Mar 25, 2026

Uh oh!

deepakdevp Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

deepakdevp commented Mar 25, 2026

Summary

Changes

Test plan

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

qin-ctx left a comment

Choose a reason for hiding this comment

Uh oh!

qin-ctx Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

deepakdevp Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

qin-ctx Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

deepakdevp Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants