Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 21, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 21, 2025 1:27am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 21, 2025 1:27am
rivet-inspector Ignored Ignored Preview Nov 21, 2025 1:27am
rivet-site Ignored Ignored Preview Nov 21, 2025 1:27am

Copy link
Contributor Author

MasterPtato commented Nov 21, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry changed the base branch from 11-20-fix_gateway_prevent_gc_from_removing_hibernating_in_flight_req_check_actor_started_after_sub_when_hibernating to graphite-base/3508 November 21, 2025 00:50
@claude
Copy link

claude bot commented Nov 21, 2025

Pull Request Review

This PR refactors message ID handling in the runner protocol. Overall the architectural changes are sound, but there is a critical bug that needs fixing.

Critical Issue - Protocol Conversion Bug

Location: engine/sdks/rust/runner-protocol/src/versioned.rs lines 264 and 597

In the v3_to_v2 conversion functions for both ToClient and ToServer, there is incorrect slice copying logic. The code copies from request_id into message_id before request_id is fully initialized, and writes message_index to the wrong array.

Current problematic code at line 264:

  • message_id[..8].copy_from_slice uses request_id[0..8] before indices 8-10 are set
  • message_index is written to request_id[8..10] instead of message_id[8..10]

This will break v2/v3 protocol compatibility.

Code Quality

Strengths:

  • Clean consolidation of utilities into rivet-runner-protocol crate
  • Good use of structured MessageId type instead of raw byte arrays
  • Proper error handling with context
  • Eliminates 86 lines of duplicate code

Minor Issues:

  • Inconsistent error macro usage in util.rs (line 25 vs 48)
  • Per CLAUDE.md, avoid glob imports from anyhow

Other Observations

Security: ID collision handling is reasonable with 100 retry attempts, though 32-bit request IDs may see collisions around 65k concurrent requests. Current retry logic handles this gracefully.

Performance: Moving to structured types and fixed-size arrays is good for performance.

Test Coverage: No tests added for the new util.rs functions or protocol conversion logic. Recommend adding unit tests for these critical paths.

Architecture: Excellent separation of concerns moving ID utilities to protocol crate. Dependency changes appropriately reduce coupling.

Recommendations

  1. Fix the critical protocol conversion bug in versioned.rs lines 264 and 597
  2. Add unit tests for protocol conversion and ID utility functions
  3. Standardize anyhow error macro usage
  4. Verify end-to-end v2/v3 protocol interop after fixing the bug

Verdict

⚠️ Needs Changes - Fix the protocol conversion bug before merging.

@NathanFlurry NathanFlurry force-pushed the 11-20-chore_remove_msg_id_parts_from_protocol_move_id_utils_to_runner_protocol_crate branch from 37a57d1 to 66d58b1 Compare November 21, 2025 01:27
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3508 to 11-20-docs_runner_lifecycle November 21, 2025 01:28
@NathanFlurry NathanFlurry marked this pull request as ready for review November 21, 2025 01:28
@claude
Copy link

claude bot commented Nov 21, 2025

Pull Request Review

This PR refactors message ID handling by removing parsing functions from the pegboard crate and consolidating ID utilities into the runner-protocol crate. Overall, this is a good architectural improvement that better encapsulates protocol concerns.

Positive Changes

  1. Better Separation of Concerns: Moving ID utilities to runner-protocol is the right call - these are protocol-level concerns, not pegboard-specific logic.

  2. Simplified Message ID Structure: The shift from building/parsing message IDs which required error handling to using a structured MessageId type directly is cleaner and safer.

  3. Reduced Complexity: Eliminating 86 lines of ID parsing/building code removes potential error paths and simplifies the codebase.

  4. Consistent Naming: The rename from inflight_requests to in_flight_requests improves consistency.

Code Quality Observations

Good practices:

  • Proper use of structured types instead of raw byte arrays with parsing
  • Clean migration path in version converters
  • Import consolidation in several files

Minor concerns:

  1. Error Message Change in tunnel_to_ws_task.rs:76 - Changed from "websocket message timeout" to "garbage collected". The original was clearer about why the connection was garbage collected.

  2. Protocol Version Compatibility: The versioned.rs conversion logic loses message_index information when converting from v2 to v3 hardcoded to 0 on line 189. This could cause issues if old clients reconnect.

  3. versioned.rs:264 places message_index at bytes 8-10 of request_id, but the comment on line 257 says v2.request_id should be "gateway_id 4 + request_id 4 + padding 8 zeros". The comment should be updated to reflect this.

Testing Recommendations

  1. Test message ID round-trip conversions between protocol versions
  2. Verify WebSocket reconnection after hibernation with old runners v2 protocol
  3. Ensure all id_to_string calls produce the expected hex format

Security and Performance

No security concerns identified. Performance should improve slightly by removing string parsing operations.

Summary

This is a solid refactoring that improves code organization. Main suggestions: clarify log messages and verify v2/v3 conversion comments match implementation.

Recommendation: Approve with minor suggestions

@claude
Copy link

claude bot commented Nov 21, 2025

Pull Request Review

This PR refactors the message ID handling by consolidating ID utilities into the runner-protocol crate and introducing a structured MessageId type. Overall, this is a solid improvement that reduces code duplication and improves type safety. Here's my detailed feedback:

✅ Strengths

  1. Better code organization: Moving ID generation and parsing utilities from pegboard::tunnel::id to rivet_runner_protocol::util is a good consolidation that reduces coupling.

  2. Improved type safety: The new structured MessageId type (with gateway_id, request_id, message_index fields) is clearer and safer than the previous byte array manipulation approach.

  3. Cleaner dependencies: Removing the pegboard dependency from guard-core is a good architectural improvement, reducing circular dependencies.

  4. Consistent naming: The rename from inflight_requests to in_flight_requests improves consistency (line 353 in shared_state.rs).

  5. Protocol version compatibility: The versioned conversion logic properly handles the migration between v2 and v3 message formats.

🔍 Areas for Improvement

1. Missing error context in versioned.rs:264

// Line 264 in versioned.rs
request_id[8..10].copy_from_slice(&msg.message_id.message_index.to_le_bytes());

This line copies 2 bytes of message_index into positions 8-10 of a 16-byte array, but the variable name is request_id. This is confusing - should this be message_id instead? The same pattern appears at line 597.

Recommendation: Add a comment explaining why message_index is being written to the request_id buffer, or refactor for clarity.

2. Incomplete util.rs documentation

The id_to_string function at line 12 has a comment saying "Convert a GatewayId to a hex string" but it accepts a generic &crate::GatewayId parameter named gateway_id. However, this function can be used for any 4-byte ID type (RequestId, GatewayId, etc.).

Recommendation:

  • Either rename the function to something more generic like id_to_hex_string
  • Or update the documentation to clarify it works with any 4-byte ID
  • Consider adding a similar string_to_id utility for parsing

3. Hard-coded message_index in version conversion

In versioned.rs lines 189 and 525, when converting from v2 to v3, the code sets message_index: 0:

message_index: 0,  // Hard-coded

Question: Is this correct behavior? If we're converting an existing v2 message that might have been mid-stream, should we preserve some message ordering information? This could potentially cause issues with message ordering during protocol version transitions.

4. Silent message drop without logging

In shared_state.rs:310, when a message is received for a disconnected request, it only logs at debug level:

tracing::debug!(
    request_id=%protocol::util::id_to_string(&message_id.request_id),
    "in flight has already been disconnected, dropping message"
);

Recommendation: Consider using tracing::warn! instead since dropping messages could indicate a timing issue or bug.

5. Removed debug logging

Line 188 in shared_state.rs removed this debug log:

- tracing::debug!(?message_id, ?payload, "shared state send message");

Question: Was this intentional? This could make debugging tunnel message issues harder. If the payload is too large to log, consider adding structured logging with just the message ID and payload size.

6. Error message change

Line 76 in tunnel_to_ws_task.rs changes the error message from:

- tracing::warn!("websocket message timeout");
+ tracing::warn!("garbage collected");

Recommendation: "garbage collected" is less descriptive than "websocket message timeout". Consider keeping the original message or using "websocket garbage collected due to timeout" for clarity.

🔐 Security Considerations

  1. Random ID generation: The use of rand::random::<u32>() in util.rs is appropriate for generating unique IDs. However, these IDs may be predictable if used in security-sensitive contexts. Since they appear to be used for internal request tracking (not authentication), this should be acceptable.

  2. No validation on ID parsing: The code doesn't validate that IDs are well-formed when converting between versions. Consider adding bounds checking in the conversion functions.

🧪 Testing Recommendations

  1. Add unit tests for the new util::generate_gateway_id(), generate_request_id(), and id_to_string() functions
  2. Test the versioned protocol conversion logic, especially the v2↔v3 MessageId transformations
  3. Test edge cases like message_index overflow (u16 max value)
  4. Test that in-flight request tracking works correctly with the new MessageId structure

📝 Minor Nitpicks

  1. Import cleanup: Line 1 in versioned.rs has use anyhow::{Ok, Result, bail}; - importing Ok from anyhow is unusual. Consider using anyhow::Result and the standard Ok.

  2. Typo in comment: The commit message says "runner pro…" which appears truncated. The full message should be "runner protocol crate".

  3. Cargo.toml sorting: The guard-core Cargo.toml reorders some dependencies (rustls-pemfile before rustls). While this works, it breaks alphabetical ordering slightly.

✅ What's Working Well

  • All dependency changes look correct
  • The removal of the deleted tunnel/id.rs file is clean
  • Import paths are updated consistently across all affected files
  • The protocol version compatibility layer is well-structured
  • Database key structures properly updated to use new types

🎯 Summary

This is a well-executed refactoring that improves code organization and type safety. The main concerns are:

  1. Clarify the confusing buffer manipulation in version conversion (lines 264, 597)
  2. Consider the implications of hard-coded message_index values during conversion
  3. Review the removed debug logging and downgraded error messages

After addressing these items, this PR will be in excellent shape. Great work on maintaining backwards compatibility through the versioned protocol system!

Recommendation: Approve with minor revisions to address the confusing buffer manipulation and consider the logging changes.


Review generated with assistance from Claude Code

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 21, 2025

Merge activity

  • Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
  • Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

graphite-app bot pushed a commit that referenced this pull request Nov 21, 2025
@graphite-app graphite-app bot closed this Nov 21, 2025
@graphite-app graphite-app bot deleted the 11-20-chore_remove_msg_id_parts_from_protocol_move_id_utils_to_runner_protocol_crate branch November 21, 2025 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants