Skip to content

docs: at-least-once delivery research and competing RFCs for Cosmo Streams#2826

Draft
jensneuse wants to merge 1 commit intomainfrom
claude/at-least-once-delivery-research-uN1ap
Draft

docs: at-least-once delivery research and competing RFCs for Cosmo Streams#2826
jensneuse wants to merge 1 commit intomainfrom
claude/at-least-once-delivery-research-uN1ap

Conversation

@jensneuse
Copy link
Copy Markdown
Member

@jensneuse jensneuse commented May 5, 2026

$(cat <<'EOF'

Summary

This PR adds a research document and four competing RFC proposals for implementing at-least-once delivery guarantees in Cosmo Streams (GraphQL subscriptions over NATS JetStream, Kafka, and other providers).

Files added under rfcs/

  • at-least-once-research.md — comprehensive prior art research covering Kafka, NATS JetStream, WebSocket patterns (MQTT, STOMP, Azure Web PubSub, Ably), GraphQL subscription protocols (graphql-ws, graphql-sse, Hasura, AppSync), SSE Last-Event-ID, and an analysis of the current Cosmo delivery gap.
  • rfc-001-jetstream-cursor-resumption.md — expose JetStream stream sequence numbers as opaque cursors in Next message extensions; honor on reconnect to replay from the broker; zero router state.
  • rfc-002-sse-last-event-id.md — new SSE subscription transport using browser-native Last-Event-ID reconnection; covers JetStream and Kafka; no client library changes required.
  • rfc-003-router-session-buffer.md — router writes events to a shared buffer (JetStream KV or Redis) before acking the broker; session token replay on reconnect; covers all providers including Redis and engine subscriptions.
  • rfc-004-schema-delivery-directive.md — schema-first @stream directive; auto-generates StreamEvent envelope with cursor, eventId, afterCursor arg, and optional acknowledgeEvent mutation; introspectable and codegen-friendly.

Current delivery gap identified

Leg Status
Broker → Router (JetStream) ✅ At-least-once (manual msg.Ack() post-dispatch)
Broker → Router (Kafka/Redis) ❌ At-most-once
Router → Client (all providers) ❌ Best-effort (ack before WebSocket write)

RFC comparison

RFC-001 RFC-002 RFC-003 RFC-004
Provider coverage JetStream only JetStream + Kafka All providers All providers
Transport WebSocket (existing) SSE (new) WebSocket Any
Router state None None Shared storage None (or RFC-003 hybrid)
Client changes Cursor in extensions Switch to EventSource Session token in connectionParams Standard GraphQL variables
Schema visible No No No Yes (introspectable)
Ack mechanism None None Explicit message or implicit acknowledgeEvent mutation

Test plan

  • Review research document for accuracy and completeness
  • Review each RFC for architectural soundness
  • Discuss which RFC(s) to pursue for implementation
  • Identify open questions to resolve before implementation begins

https://claude.ai/code/session_0147P9CLn4HLMquNjoVbyKfS
EOF
)


Generated by Claude Code

Summary by CodeRabbit

  • Documentation
    • Added RFC specifications for reliable GraphQL subscription delivery mechanisms including cursor-based event resumption for WebSocket connections, server-sent event replay via Last-Event-ID, router-managed session buffers for client reconnection, and schema-level delivery guarantee directives
    • Added reference research compiling at-least-once delivery patterns and implementation primitives from industry systems

Research document covering how Kafka, NATS JetStream, WebSockets,
and GraphQL subscriptions handle at-least-once delivery, with an
analysis of the current Cosmo streams delivery gap (broker→router
is at-least-once via JetStream ack; router→client is best-effort).

Four competing RFCs, each as a separate markdown file:

- RFC-001: JetStream cursor-based resumption — expose stream sequence
  numbers as opaque cursors; honor them on reconnect to replay from
  the broker. Zero router state, JetStream-only.

- RFC-002: SSE transport with Last-Event-ID — new SSE subscription
  endpoint; browser-native reconnection + replay via Last-Event-ID.
  Covers JetStream and Kafka; no client library changes needed.

- RFC-003: Router-managed session buffer — router writes events to
  shared storage (JetStream KV or Redis) before acking the broker;
  client reconnects with session token for replay. Covers all
  providers including Redis and engine subscriptions.

- RFC-004: Schema-level @stream directive — schema-first delivery
  guarantee declaration; auto-injects cursor envelope and afterCursor
  argument; acknowledgeEvent mutation for application-level acks.
  Introspectable and codegen-friendly.

https://claude.ai/code/session_0147P9CLn4HLMquNjoVbyKfS
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

Walkthrough

This PR adds foundational research and four RFC proposals for implementing at-least-once delivery guarantees in a GraphQL subscription router. It covers existing system patterns (Kafka, NATS, WebSockets, MQTT), outlines four distinct mechanisms (JetStream cursor resumption, SSE with Last-Event-ID, router session buffering, schema delivery directive), and documents failure modes and tradeoffs.

Changes

At-Least-Once Delivery: Research & RFC Proposals

Layer / File(s) Summary
Foundational Research
rfcs/at-least-once-research.md
Compiles delivery guarantee taxonomy, prior art from Kafka (producer/consumer/transactions), NATS JetStream (streams/consumers/ack policies/EOS), WebSocket/MQTT/STOMP/Socket.IO reliability patterns, GraphQL subscription limitations, SSE replay via Last-Event-ID, analysis of current Cosmo gaps, general implementation building blocks (IDs, persistence, visibility/ack timeouts, backoff, DLQ, idempotency, session recovery), and references.
RFC-001: JetStream Cursor Resumption
rfcs/rfc-001-jetstream-cursor-resumption.md
Specifies opaque base64 JSON cursor format (stream name + sequence), delivery via extensions.x-cosmo-cursor on Next messages, client resume via x-cosmo-resume-cursor in ConnectionInit, router consumer creation at cursor.seq + 1 with gap signaling, ephemeral consumer lifecycle, ack timing tied to WebSocket write success, heartbeat cursor updates, and edge case handling (retention gaps, restarts, multi-instance, regression, mismatch).
RFC-002: SSE Last-Event-ID Transport
rfcs/rfc-002-sse-last-event-id.md
Defines new GET /graphql/stream SSE endpoint with native browser Last-Event-ID reconnection for at-least-once replay, cursor semantics per broker (JetStream sequence vs Kafka partition:offset), keep-alive comments, JetStream pull consumer replay and Kafka consumer seek, required HTTP/2 and CORS/proxy configuration, client integration examples, failure modes (retention gaps, large backlogs, duplicates, non-sticky routers, timeouts), and comparison versus RFC-001 WebSocket approach.
RFC-003: Router-Managed Session Buffer
rfcs/rfc-003-router-session-buffer.md
Specifies per-client session buffer (JetStream KV, stream-per-session, or Redis) with x-cosmo-session-id token, monotonic x-cosmo-seq per session, optional explicit x-cosmo-ack, TTL expiry, cross-router replay via session token with embedded broker cursor, hard cap (1000 events) triggering close code 4400, soft cap (80%) backpressure pings, broker-to-buffer persistence ordering, failure handling (storage unavailability, mid-write crashes), and backward compatibility (session-absent = at-most-once).
RFC-004: Schema Delivery Directive
rfcs/rfc-004-schema-delivery-directive.md
Introduces @stream directive for subscription fields with AT_LEAST_ONCE mode, router schema transformation adding afterCursor: String argument and wrapping payloads in auto-generated {FieldName}StreamEvent envelope (cursor, eventId, metadata), opaque base64url cursor semantics, optional acknowledgeEvent mutation for buffer GC, provider-specific seek/replay on afterCursor, failure modes (missing cursor, type collisions, large replays), opt-in backward compatibility, and comparisons with RFC-001/002/003.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main changes: adding research documentation and multiple RFC proposals for at-least-once delivery in Cosmo Streams. It accurately summarizes the primary purpose of the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rfcs/rfc-002-sse-last-event-id.md`:
- Line 154: Update the RFC text to explicitly separate per-event IDs from
reconnection cursors: state that each SSE event's id field MUST be the
single-message position formatted as "<partition>:<offset>" (e.g., "0:1042");
the client is responsible for tracking the highest offset seen per partition
locally; on reconnect the client sends Last-Event-ID as a composite cursor
encoding per-partition progress like "0:100,1:55,2:220"; clarify that the server
MUST NOT emit a multi-partition composite cursor as the id for individual events
and add a short subsection titled "Kafka multi-partition cursor semantics"
describing these responsibilities and examples.

In `@rfcs/rfc-003-router-session-buffer.md`:
- Around line 290-295: Clarify the deduplication semantics in the "Router Crash
Mid-Buffer-Write" section: state explicitly whether the router does buffer-level
deduplication or always appends duplicates with new session sequence numbers; if
the router deduplicates, describe the exact matching mechanism (e.g., compare
incoming message's broker sequence and broker ID against an indexed field on the
session buffer entries and skip write if a match exists), reference the session
buffer, x-cosmo-seq header and broker sequence as the keys used for matching,
and add a brief note about how the router handles ack/ack-retry state when it
finds a duplicate to ensure consistent session sequencing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3e66ccd2-ce74-465e-9688-4cf7b2610e2e

📥 Commits

Reviewing files that changed from the base of the PR and between e17a5d7 and bc9dc4a.

📒 Files selected for processing (5)
  • rfcs/at-least-once-research.md
  • rfcs/rfc-001-jetstream-cursor-resumption.md
  • rfcs/rfc-002-sse-last-event-id.md
  • rfcs/rfc-003-router-session-buffer.md
  • rfcs/rfc-004-schema-delivery-directive.md


For JetStream, the SSE `id` is the raw uint64 stream sequence number (decimal string): `"42"`.

For Kafka, where messages are identified by partition + offset, the cursor is `"<partition>:<offset>"`: `"0:1042"`. If the subscription covers multiple partitions, a cursor encodes the minimum offset per partition that the client has confirmed: `"0:100,1:55,2:220"`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify Kafka multi-partition cursor encoding for SSE.

Line 154 states that for multi-partition subscriptions, the cursor encodes "the minimum offset per partition" as "0:100,1:55,2:220". However, there's ambiguity about when this format is used:

  1. Per-event ID: Each SSE event's id: field should represent that individual message's position (e.g., id: 0:1042 for a message from partition 0, offset 1042)
  2. Reconnection cursor: The Last-Event-ID header sent on reconnect needs to encode the client's progress across all partitions

These serve different purposes. For multi-partition Kafka subscriptions:

  • Individual events should have id: <partition>:<offset>
  • The client must track max offset per partition locally
  • On reconnect, the client should construct a composite cursor like "0:100,1:55,2:220" from its tracking state

If the server is expected to send multi-partition cursors as individual event IDs, this needs clarification, as it's unclear which partition offsets would be included in events that originate from only one partition.

Consider adding a subsection detailing multi-partition Kafka cursor semantics and client responsibilities.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rfcs/rfc-002-sse-last-event-id.md` at line 154, Update the RFC text to
explicitly separate per-event IDs from reconnection cursors: state that each SSE
event's id field MUST be the single-message position formatted as
"<partition>:<offset>" (e.g., "0:1042"); the client is responsible for tracking
the highest offset seen per partition locally; on reconnect the client sends
Last-Event-ID as a composite cursor encoding per-partition progress like
"0:100,1:55,2:220"; clarify that the server MUST NOT emit a multi-partition
composite cursor as the id for individual events and add a short subsection
titled "Kafka multi-partition cursor semantics" describing these
responsibilities and examples.

Comment on lines +290 to +295
### Router Crash Mid-Buffer-Write

If the router crashes after writing the event to the session buffer but before acking the broker:
- The broker redelivers the message to another router instance (or the same, after restart).
- The duplicate event is written to the buffer with a new session sequence number.
- The client sees a duplicate. Client deduplication via `x-cosmo-seq` (if the event was already in the buffer with the same broker sequence, the router deduplicates before writing).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify deduplication behavior after router crash.

Lines 294-295 appear contradictory. Line 294 states "The duplicate event is written to the buffer with a new session sequence number," but line 295 states "(if the event was already in the buffer with the same broker sequence, the router deduplicates before writing)."

Please clarify whether:

  1. The router always writes duplicates with new session sequence numbers (requiring client-side dedup), or
  2. The router performs buffer-level deduplication based on broker sequence before writing

If option 2, the mechanism for matching broker sequence to existing buffer entries should be detailed in the design section.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rfcs/rfc-003-router-session-buffer.md` around lines 290 - 295, Clarify the
deduplication semantics in the "Router Crash Mid-Buffer-Write" section: state
explicitly whether the router does buffer-level deduplication or always appends
duplicates with new session sequence numbers; if the router deduplicates,
describe the exact matching mechanism (e.g., compare incoming message's broker
sequence and broker ID against an indexed field on the session buffer entries
and skip write if a match exists), reference the session buffer, x-cosmo-seq
header and broker sequence as the keys used for matching, and add a brief note
about how the router handles ack/ack-retry state when it finds a duplicate to
ensure consistent session sequencing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants