Skip to content

Conversation

lxin
Copy link
Owner

@lxin lxin commented Aug 29, 2025

The 1st patch introduces a new socket option, QUIC_SOCKOPT_STREAM_PEELOFF, allowing applications to "peel off" an individual QUIC stream from a connection and obtain a dedicated socket for it.

The 2nd patch adds a sample test case demonstrating the use of the QUIC_SOCKOPT_STREAM_PEELOFF socket option, and the APIs of stream sockets.

lxin added 2 commits August 29, 2025 17:53
This patch introduces a new socket option, QUIC_SOCKOPT_STREAM_PEELOFF,
allowing applications to "peel off" an individual QUIC stream from a
connection and obtain a dedicated socket for it.

1. Socket Option:

  The patch defines QUIC_SOCKOPT_STREAM_PEELOFF in uapi/linux/quic.h
  along with the quic_stream_peeloff structure, and implements
  quic_sock_get_stream_peeloff() to allocate a new quic_stream_sock,
  move frames from the parent socket's recv_list to the peeled stream,
  and return a file descriptor to userspace.

2. Function ajustment:

  Inqueue handling is extended with a frame_list in struct quic_stream,
  quic_inq_stream_tail() is updated to enqueue frames to either the
  parent recv_list or the stream's frame_list, quic_recvmsg and
  quic_sendmsg are updated to target individual streams via stream_id,
  and cleanup paths (STOP_SENDING, RESET_STREAM) now also purge frames
  from peeled stream frame lists.

3. Socket APIs:

  A dedicated quic_stream_proto and socket operations (sendmsg, recvmsg,
  poll, shutdown, release) are added to support peeled stream sockets.

This allows applications to manage high-level stream-level sockets
similar to TCP substreams, enabling simpler integration and
fine-grained control over QUIC streams.

Signed-off-by: Xin Long <[email protected]>
This patch adds a sample testcase demonstrating the use of the
QUIC_SOCKOPT_STREAM_PEELOFF socket option.

On the send side, the testcase opens a stream using
getsockopt(QUIC_SOCKOPT_STREAM_OPEN), then peels it off with
getsockopt(QUIC_SOCKOPT_STREAM_PEELOFF), and finally sends data
through the new file descriptor.

On the receive side, the testcase detects stream creation via the
QUIC_EVENT_STREAM_UPDATE event, peels off the stream using
getsockopt(QUIC_SOCKOPT_STREAM_PEELOFF), and receives data through
the new file descriptor.

Signed-off-by: Xin Long <[email protected]>
struct quic_stream_sock {
struct sock common;
struct quic_stream *stream;
struct sock *sk;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could that be a struct quic_sock * instead?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could, but in these quic_stream_ops functions, it only needs to see it as 'struct sock'. If use struct quic_sock, we still need do struct sock *sk = &qs->inet.sk; first.

Thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still need do struct sock *sk = &qs->inet.sk; first.

Just to be clear, because of the quic_sk function, I thought one just downcasts struct sock directly into struct quick_sock. Are you saying that in fact we need an offset to a different field?

it only needs to see it as struct sock

I think that is just because it is calling functions that will do the downcast anyways? Ultimately I defer to you on what the kernel idioms are. I am just thinking that if this socket must be a quic socket, and not any other kind of socket, we can use the narrower type to document that invariant.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no offset to get sock from quic_sock, just the kernel protocols uses &qs->inet.sk; more commonly than (struct sock *)qs;

That is a good point, technically it doesn't really matter if it's sock or quic_sock. Using quic_sock makes things more clear.

Comment on lines +2436 to +2438
/* Do not peel off from one netns to another one. */
if (!net_eq(current->nsproxy->net_ns, sock_net(sk)))
return -EINVAL;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I would think that this would not be needed, as holding the "master socket" is enough privileges. But maybe there is some other interaction I am not aware of?

Sending the peeled off socket to a process in another net namespace is still fine, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restated perhaps more clearly: I view namespace as existing to manage ambient authority for "legacy programs" that are used to having it. But in this case, this is a new interface, and holding the "master socket" strikes me as a sufficient capability such that there is no ambient authority being utilized in peeling off a socket.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, we don't need this check. I kinda mixed with SCTP:

https://lore.kernel.org/netdev/e637d68ce6f4f94dce8cb30c647e672ebb1f0b7b.1508253970.git.lucien.xin@gmail.com/

which is totally a different case.

Sending the peeled off socket to a process in another net namespace is still fine, right?

I think it should be fine, I need to try it out sometime.

Comment on lines +2441 to +2442
if (!stream || stream->peeled)
return -EINVAL;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dup on the peeled-off socket get around this, right?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dup() will not come to socket level, it only increase a refcnt to the file struct and related it with a new fd:

        struct file *file = fget_raw(fildes);

        if (file) {
                ret = get_unused_fd_flags(0);
                if (ret >= 0)
                        fd_install(ret, file);
                else
                        fput(file);
        }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for sure, cannot have multiple separate underlying states associated with the same stream, that doesn't really make sense in theory either.

What I was getting at (perhaps too Socratically, sorry):

  1. It is fine to dup a stream socket, right? (assume the answer is yes, just like any other socket)
  2. If so, instead of making this an error, we could instead just find the existing socket do what dup does (bump ref count, assign new fd number)

Since ultimately the stream socket is beholden to the master connection socket, I don't think it matters so much whether one is getting a "unique" (first time peal off) or "non-unique" (bump ref count for prior peel off) fd for the stream?

Only counter-argument I have thought of: Maybe it would make close confusing? (I honestly forget, sorry, but close follows the refcount, whereas shutdown doesn't care about other referrers, right?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Another way to look at the argument above is that if peeling off creating a "projection or view" from the underlying multi-stream connection, it should be a monotonic and stateless (as far as observable behavior is concerned, not kernel-private data structure) operation:

  • You can peel off the same stream as many times as you like
  • Refcount based operations like "close" don't care about peel off order, just total count --- the peel offs commute
  • Operations on the underlying connection don't care about how many times peel off has occurred because they don't see peeled-off "child" sockets, and their ref counts, at all.

Copy link
Owner Author

@lxin lxin Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is: socket and file must be 1:1, see sock_alloc_file():

        sock->file = file;
        file->private_data = sock;

unlike fd and file, can be N:1.

Besides, does an application really need to peel off a stream multiple times?

Copy link

@Ericson2314 Ericson2314 Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be confused but I thought the analogy was:

file descriptor : struct file : file in the VFS
file descriptor : struct socket

so dup in both cases will increase the ref count, but never make a new struct file or struct socket. Only reopening the file in the VFS gets a new struct file.

(it would be fun have such a "deep dup" to make a new file for things that support it, I guess the next best thing is open("/dev/fd/<N>", ...) or similar. But this is not what I meant by my suggestion.)

My suggestion was just to make peeloff monotonic via:

  • If there is no struct socket (and associated struct file) for a peeled off stream, make a new one (like today)
  • If there is is one already, bump the refcount, allocate a new FD (but not struct socket!) and return that

continue;
list_move_tail(&frame->list, &stream->recv.frame_list);
}
sock_hold(sk); /* Keep the parent sk alive while the peeled stream is in use. */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah very nice that is already exists. Are there other "parent-child socket" relationships like this already, or does it just exist for other purposes?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed, in TCP, request sk hold a refcnt to listen sk in reqsk_alloc_noprof() and put it in __reqsk_free(), but the difference is TCP request sk doesn't related to a fd until it becomes a full sk which will not hold a refcnt to listen sk any longer.

We have to keep the parent sk alive, because these quic_stream_ops functions need to check if the parent sk is closed via quic_is_close(sk), and as you know there's no way to check if a sk memory/pointer is freed or not.

TBH, I don't really want to hold a refcnf to the parent sk, which is much larger than stream socket. It should be freed ASAP after close(). But I don't have a safe way to access it in these quic_stream_ops functions. Please let me know if you have a better idea on this.

Copy link

@Ericson2314 Ericson2314 Sep 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not actually a Linux kernel developer, so I am afraid I am not in the weeds enough to offer informed concrete suggestions. I'll offer some conceptual ideas, but you might have already thought of them:

  • This is the classic use-case for a weak reference, but I have no idea whether the socket ref counting infra supports that. Making it support that for the first time would be a big pain, so I am quite sympathetic on punting on that for now.

  • There is a more radical design where the only way to work with multiple streams is pealed-off sockets, and then each stream socket can "own" its state a bit more. The master connection sill need to be able to do unliteral closures, which would mean that it needs to be able to iterate through all the stream sockets and close them / reclaim their state.

    • Having the references instead go parent->child, for easy "get all stream sockets" means intermediate collection data structure, which adds more complexity. (c.f. SQL foreign keys are usually children -> parent like the PR currently for the same reason of avoiding the collection.)
    • But instead of having weak references from parent to child (same problem as before) we can instead make it be that closing a child socket means finding its reference in the parent's collection, and removing it.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The master connection sill need to be able to do unliteral closures, which would mean that it needs to be able to iterate through all the stream sockets and close them / reclaim their state.

Yes, I think in this way close() on master connection will be able to free its socket source, but this indeed will increase the complexity.

@lxin
Copy link
Owner Author

lxin commented Sep 18, 2025

@Ericson2314 I added this socket option to the I-D doc: https://www.ietf.org/archive/id/draft-lxin-quic-socket-apis-02.html#name-quic_sockopt_stream_peeloff

I'm planning to post this feature as a follow-up after upstream accepts the main QUIC code, since there is still some testing work to be done against peeloff sockets.

Thanks.

@Ericson2314
Copy link

Left some emojis before already, but just wanted to say, very excited for this!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants