docs: at-least-once delivery research and competing RFCs for Cosmo Streams#2826
docs: at-least-once delivery research and competing RFCs for Cosmo Streams#2826
Conversation
Research document covering how Kafka, NATS JetStream, WebSockets, and GraphQL subscriptions handle at-least-once delivery, with an analysis of the current Cosmo streams delivery gap (broker→router is at-least-once via JetStream ack; router→client is best-effort). Four competing RFCs, each as a separate markdown file: - RFC-001: JetStream cursor-based resumption — expose stream sequence numbers as opaque cursors; honor them on reconnect to replay from the broker. Zero router state, JetStream-only. - RFC-002: SSE transport with Last-Event-ID — new SSE subscription endpoint; browser-native reconnection + replay via Last-Event-ID. Covers JetStream and Kafka; no client library changes needed. - RFC-003: Router-managed session buffer — router writes events to shared storage (JetStream KV or Redis) before acking the broker; client reconnects with session token for replay. Covers all providers including Redis and engine subscriptions. - RFC-004: Schema-level @stream directive — schema-first delivery guarantee declaration; auto-injects cursor envelope and afterCursor argument; acknowledgeEvent mutation for application-level acks. Introspectable and codegen-friendly. https://claude.ai/code/session_0147P9CLn4HLMquNjoVbyKfS
WalkthroughThis PR adds foundational research and four RFC proposals for implementing at-least-once delivery guarantees in a GraphQL subscription router. It covers existing system patterns (Kafka, NATS, WebSockets, MQTT), outlines four distinct mechanisms (JetStream cursor resumption, SSE with Last-Event-ID, router session buffering, schema delivery directive), and documents failure modes and tradeoffs. ChangesAt-Least-Once Delivery: Research & RFC Proposals
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rfcs/rfc-002-sse-last-event-id.md`:
- Line 154: Update the RFC text to explicitly separate per-event IDs from
reconnection cursors: state that each SSE event's id field MUST be the
single-message position formatted as "<partition>:<offset>" (e.g., "0:1042");
the client is responsible for tracking the highest offset seen per partition
locally; on reconnect the client sends Last-Event-ID as a composite cursor
encoding per-partition progress like "0:100,1:55,2:220"; clarify that the server
MUST NOT emit a multi-partition composite cursor as the id for individual events
and add a short subsection titled "Kafka multi-partition cursor semantics"
describing these responsibilities and examples.
In `@rfcs/rfc-003-router-session-buffer.md`:
- Around line 290-295: Clarify the deduplication semantics in the "Router Crash
Mid-Buffer-Write" section: state explicitly whether the router does buffer-level
deduplication or always appends duplicates with new session sequence numbers; if
the router deduplicates, describe the exact matching mechanism (e.g., compare
incoming message's broker sequence and broker ID against an indexed field on the
session buffer entries and skip write if a match exists), reference the session
buffer, x-cosmo-seq header and broker sequence as the keys used for matching,
and add a brief note about how the router handles ack/ack-retry state when it
finds a duplicate to ensure consistent session sequencing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3e66ccd2-ce74-465e-9688-4cf7b2610e2e
📒 Files selected for processing (5)
rfcs/at-least-once-research.mdrfcs/rfc-001-jetstream-cursor-resumption.mdrfcs/rfc-002-sse-last-event-id.mdrfcs/rfc-003-router-session-buffer.mdrfcs/rfc-004-schema-delivery-directive.md
|
|
||
| For JetStream, the SSE `id` is the raw uint64 stream sequence number (decimal string): `"42"`. | ||
|
|
||
| For Kafka, where messages are identified by partition + offset, the cursor is `"<partition>:<offset>"`: `"0:1042"`. If the subscription covers multiple partitions, a cursor encodes the minimum offset per partition that the client has confirmed: `"0:100,1:55,2:220"`. |
There was a problem hiding this comment.
Clarify Kafka multi-partition cursor encoding for SSE.
Line 154 states that for multi-partition subscriptions, the cursor encodes "the minimum offset per partition" as "0:100,1:55,2:220". However, there's ambiguity about when this format is used:
- Per-event ID: Each SSE event's
id:field should represent that individual message's position (e.g.,id: 0:1042for a message from partition 0, offset 1042) - Reconnection cursor: The
Last-Event-IDheader sent on reconnect needs to encode the client's progress across all partitions
These serve different purposes. For multi-partition Kafka subscriptions:
- Individual events should have
id: <partition>:<offset> - The client must track max offset per partition locally
- On reconnect, the client should construct a composite cursor like
"0:100,1:55,2:220"from its tracking state
If the server is expected to send multi-partition cursors as individual event IDs, this needs clarification, as it's unclear which partition offsets would be included in events that originate from only one partition.
Consider adding a subsection detailing multi-partition Kafka cursor semantics and client responsibilities.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@rfcs/rfc-002-sse-last-event-id.md` at line 154, Update the RFC text to
explicitly separate per-event IDs from reconnection cursors: state that each SSE
event's id field MUST be the single-message position formatted as
"<partition>:<offset>" (e.g., "0:1042"); the client is responsible for tracking
the highest offset seen per partition locally; on reconnect the client sends
Last-Event-ID as a composite cursor encoding per-partition progress like
"0:100,1:55,2:220"; clarify that the server MUST NOT emit a multi-partition
composite cursor as the id for individual events and add a short subsection
titled "Kafka multi-partition cursor semantics" describing these
responsibilities and examples.
| ### Router Crash Mid-Buffer-Write | ||
|
|
||
| If the router crashes after writing the event to the session buffer but before acking the broker: | ||
| - The broker redelivers the message to another router instance (or the same, after restart). | ||
| - The duplicate event is written to the buffer with a new session sequence number. | ||
| - The client sees a duplicate. Client deduplication via `x-cosmo-seq` (if the event was already in the buffer with the same broker sequence, the router deduplicates before writing). |
There was a problem hiding this comment.
Clarify deduplication behavior after router crash.
Lines 294-295 appear contradictory. Line 294 states "The duplicate event is written to the buffer with a new session sequence number," but line 295 states "(if the event was already in the buffer with the same broker sequence, the router deduplicates before writing)."
Please clarify whether:
- The router always writes duplicates with new session sequence numbers (requiring client-side dedup), or
- The router performs buffer-level deduplication based on broker sequence before writing
If option 2, the mechanism for matching broker sequence to existing buffer entries should be detailed in the design section.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@rfcs/rfc-003-router-session-buffer.md` around lines 290 - 295, Clarify the
deduplication semantics in the "Router Crash Mid-Buffer-Write" section: state
explicitly whether the router does buffer-level deduplication or always appends
duplicates with new session sequence numbers; if the router deduplicates,
describe the exact matching mechanism (e.g., compare incoming message's broker
sequence and broker ID against an indexed field on the session buffer entries
and skip write if a match exists), reference the session buffer, x-cosmo-seq
header and broker sequence as the keys used for matching, and add a brief note
about how the router handles ack/ack-retry state when it finds a duplicate to
ensure consistent session sequencing.
$(cat <<'EOF'
Summary
This PR adds a research document and four competing RFC proposals for implementing at-least-once delivery guarantees in Cosmo Streams (GraphQL subscriptions over NATS JetStream, Kafka, and other providers).
Files added under
rfcs/at-least-once-research.md— comprehensive prior art research covering Kafka, NATS JetStream, WebSocket patterns (MQTT, STOMP, Azure Web PubSub, Ably), GraphQL subscription protocols (graphql-ws, graphql-sse, Hasura, AppSync), SSE Last-Event-ID, and an analysis of the current Cosmo delivery gap.rfc-001-jetstream-cursor-resumption.md— expose JetStream stream sequence numbers as opaque cursors inNextmessage extensions; honor on reconnect to replay from the broker; zero router state.rfc-002-sse-last-event-id.md— new SSE subscription transport using browser-nativeLast-Event-IDreconnection; covers JetStream and Kafka; no client library changes required.rfc-003-router-session-buffer.md— router writes events to a shared buffer (JetStream KV or Redis) before acking the broker; session token replay on reconnect; covers all providers including Redis and engine subscriptions.rfc-004-schema-delivery-directive.md— schema-first@streamdirective; auto-generatesStreamEventenvelope withcursor,eventId,afterCursorarg, and optionalacknowledgeEventmutation; introspectable and codegen-friendly.Current delivery gap identified
msg.Ack()post-dispatch)RFC comparison
EventSourceconnectionParamsacknowledgeEventmutationTest plan
https://claude.ai/code/session_0147P9CLn4HLMquNjoVbyKfS
EOF
)
Generated by Claude Code
Summary by CodeRabbit