Add stream peeloff support #53

lxin · 2025-08-29T23:42:39Z

The 1st patch introduces a new socket option, QUIC_SOCKOPT_STREAM_PEELOFF, allowing applications to "peel off" an individual QUIC stream from a connection and obtain a dedicated socket for it.

The 2nd patch adds a sample test case demonstrating the use of the QUIC_SOCKOPT_STREAM_PEELOFF socket option, and the APIs of stream sockets.

This patch introduces a new socket option, QUIC_SOCKOPT_STREAM_PEELOFF, allowing applications to "peel off" an individual QUIC stream from a connection and obtain a dedicated socket for it. 1. Socket Option: The patch defines QUIC_SOCKOPT_STREAM_PEELOFF in uapi/linux/quic.h along with the quic_stream_peeloff structure, and implements quic_sock_get_stream_peeloff() to allocate a new quic_stream_sock, move frames from the parent socket's recv_list to the peeled stream, and return a file descriptor to userspace. 2. Function ajustment: Inqueue handling is extended with a frame_list in struct quic_stream, quic_inq_stream_tail() is updated to enqueue frames to either the parent recv_list or the stream's frame_list, quic_recvmsg and quic_sendmsg are updated to target individual streams via stream_id, and cleanup paths (STOP_SENDING, RESET_STREAM) now also purge frames from peeled stream frame lists. 3. Socket APIs: A dedicated quic_stream_proto and socket operations (sendmsg, recvmsg, poll, shutdown, release) are added to support peeled stream sockets. This allows applications to manage high-level stream-level sockets similar to TCP substreams, enabling simpler integration and fine-grained control over QUIC streams. Signed-off-by: Xin Long <[email protected]>

This patch adds a sample testcase demonstrating the use of the QUIC_SOCKOPT_STREAM_PEELOFF socket option. On the send side, the testcase opens a stream using getsockopt(QUIC_SOCKOPT_STREAM_OPEN), then peels it off with getsockopt(QUIC_SOCKOPT_STREAM_PEELOFF), and finally sends data through the new file descriptor. On the receive side, the testcase detects stream creation via the QUIC_EVENT_STREAM_UPDATE event, peels off the stream using getsockopt(QUIC_SOCKOPT_STREAM_PEELOFF), and receives data through the new file descriptor. Signed-off-by: Xin Long <[email protected]>

Ericson2314 · 2025-09-10T12:44:07Z

modules/net/quic/socket.h

+struct quic_stream_sock {
+	struct sock		common;
+	struct quic_stream	*stream;
+	struct sock		*sk;


Could that be a struct quic_sock * instead?

It could, but in these quic_stream_ops functions, it only needs to see it as 'struct sock'. If use struct quic_sock, we still need do struct sock *sk = &qs->inet.sk; first.

Thanks!

we still need do struct sock *sk = &qs->inet.sk; first.

Just to be clear, because of the quic_sk function, I thought one just downcasts struct sock directly into struct quick_sock. Are you saying that in fact we need an offset to a different field?

it only needs to see it as struct sock

I think that is just because it is calling functions that will do the downcast anyways? Ultimately I defer to you on what the kernel idioms are. I am just thinking that if this socket must be a quic socket, and not any other kind of socket, we can use the narrower type to document that invariant.

There is no offset to get sock from quic_sock, just the kernel protocols uses &qs->inet.sk; more commonly than (struct sock *)qs;

That is a good point, technically it doesn't really matter if it's sock or quic_sock. Using quic_sock makes things more clear.

Ericson2314 · 2025-09-10T12:47:13Z

modules/net/quic/socket.c

+	/* Do not peel off from one netns to another one. */
+	if (!net_eq(current->nsproxy->net_ns, sock_net(sk)))
+		return -EINVAL;


Interesting. I would think that this would not be needed, as holding the "master socket" is enough privileges. But maybe there is some other interaction I am not aware of?

Sending the peeled off socket to a process in another net namespace is still fine, right?

Restated perhaps more clearly: I view namespace as existing to manage ambient authority for "legacy programs" that are used to having it. But in this case, this is a new interface, and holding the "master socket" strikes me as a sufficient capability such that there is no ambient authority being utilized in peeling off a socket.

I think you're right, we don't need this check. I kinda mixed with SCTP:

https://lore.kernel.org/netdev/e637d68ce6f4f94dce8cb30c647e672ebb1f0b7b.1508253970.git.lucien.xin@gmail.com/

which is totally a different case.

Sending the peeled off socket to a process in another net namespace is still fine, right?

I think it should be fine, I need to try it out sometime.

Ericson2314 · 2025-09-10T12:48:07Z

modules/net/quic/socket.c

+	if (!stream || stream->peeled)
+		return -EINVAL;


dup on the peeled-off socket get around this, right?

dup() will not come to socket level, it only increase a refcnt to the file struct and related it with a new fd:

struct file *file = fget_raw(fildes); if (file) { ret = get_unused_fd_flags(0); if (ret >= 0) fd_install(ret, file); else fput(file); }

Yes, for sure, cannot have multiple separate underlying states associated with the same stream, that doesn't really make sense in theory either.

What I was getting at (perhaps too Socratically, sorry):

It is fine to dup a stream socket, right? (assume the answer is yes, just like any other socket)

If so, instead of making this an error, we could instead just find the existing socket do what dup does (bump ref count, assign new fd number)

Since ultimately the stream socket is beholden to the master connection socket, I don't think it matters so much whether one is getting a "unique" (first time peal off) or "non-unique" (bump ref count for prior peel off) fd for the stream?

Only counter-argument I have thought of: Maybe it would make close confusing? (I honestly forget, sorry, but close follows the refcount, whereas shutdown doesn't care about other referrers, right?)

(Another way to look at the argument above is that if peeling off creating a "projection or view" from the underlying multi-stream connection, it should be a monotonic and stateless (as far as observable behavior is concerned, not kernel-private data structure) operation:

You can peel off the same stream as many times as you like

Refcount based operations like "close" don't care about peel off order, just total count --- the peel offs commute

Operations on the underlying connection don't care about how many times peel off has occurred because they don't see peeled-off "child" sockets, and their ref counts, at all.

The problem is: socket and file must be 1:1, see sock_alloc_file():

sock->file = file; file->private_data = sock;

unlike fd and file, can be N:1.

Besides, does an application really need to peel off a stream multiple times?

I might be confused but I thought the analogy was:

file descriptor : struct file : file in the VFS
file descriptor : struct socket

so dup in both cases will increase the ref count, but never make a new struct file or struct socket. Only reopening the file in the VFS gets a new struct file.

(it would be fun have such a "deep dup" to make a new file for things that support it, I guess the next best thing is open("/dev/fd/<N>", ...) or similar. But this is not what I meant by my suggestion.)

My suggestion was just to make peeloff monotonic via:

If there is no struct socket (and associated struct file) for a peeled off stream, make a new one (like today)

If there is is one already, bump the refcount, allocate a new FD (but not struct socket!) and return that

Ericson2314 · 2025-09-10T12:54:38Z

modules/net/quic/socket.c

+			continue;
+		list_move_tail(&frame->list, &stream->recv.frame_list);
+	}
+	sock_hold(sk); /* Keep the parent sk alive while the peeled stream is in use. */


ah very nice that is already exists. Are there other "parent-child socket" relationships like this already, or does it just exist for other purposes?

I noticed, in TCP, request sk hold a refcnt to listen sk in reqsk_alloc_noprof() and put it in __reqsk_free(), but the difference is TCP request sk doesn't related to a fd until it becomes a full sk which will not hold a refcnt to listen sk any longer.

We have to keep the parent sk alive, because these quic_stream_ops functions need to check if the parent sk is closed via quic_is_close(sk), and as you know there's no way to check if a sk memory/pointer is freed or not.

TBH, I don't really want to hold a refcnf to the parent sk, which is much larger than stream socket. It should be freed ASAP after close(). But I don't have a safe way to access it in these quic_stream_ops functions. Please let me know if you have a better idea on this.

I am not actually a Linux kernel developer, so I am afraid I am not in the weeds enough to offer informed concrete suggestions. I'll offer some conceptual ideas, but you might have already thought of them:

This is the classic use-case for a weak reference, but I have no idea whether the socket ref counting infra supports that. Making it support that for the first time would be a big pain, so I am quite sympathetic on punting on that for now.

There is a more radical design where the only way to work with multiple streams is pealed-off sockets, and then each stream socket can "own" its state a bit more. The master connection sill need to be able to do unliteral closures, which would mean that it needs to be able to iterate through all the stream sockets and close them / reclaim their state.

Having the references instead go parent->child, for easy "get all stream sockets" means intermediate collection data structure, which adds more complexity. (c.f. SQL foreign keys are usually children -> parent like the PR currently for the same reason of avoiding the collection.)

But instead of having weak references from parent to child (same problem as before) we can instead make it be that closing a child socket means finding its reference in the parent's collection, and removing it.

The master connection sill need to be able to do unliteral closures, which would mean that it needs to be able to iterate through all the stream sockets and close them / reclaim their state.

Yes, I think in this way close() on master connection will be able to free its socket source, but this indeed will increase the complexity.

lxin · 2025-09-18T22:11:48Z

@Ericson2314 I added this socket option to the I-D doc: https://www.ietf.org/archive/id/draft-lxin-quic-socket-apis-02.html#name-quic_sockopt_stream_peeloff

I'm planning to post this feature as a follow-up after upstream accepts the main QUIC code, since there is still some testing work to be done against peeloff sockets.

Thanks.

Ericson2314 · 2025-09-24T19:07:06Z

Left some emojis before already, but just wanted to say, very excited for this!!

lxin added 2 commits August 29, 2025 17:53

Ericson2314 reviewed Sep 10, 2025

View reviewed changes

Add stream peeloff support #53

Are you sure you want to change the base?

Add stream peeloff support #53

Uh oh!

Conversation

lxin commented Aug 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lxin Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ericson2314 Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ericson2314 Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lxin commented Sep 18, 2025

Uh oh!

Ericson2314 commented Sep 24, 2025

Uh oh!

Uh oh!

lxin Sep 18, 2025 •

edited

Loading

Ericson2314 Sep 24, 2025 •

edited

Loading

Ericson2314 Sep 13, 2025 •

edited

Loading