-
Notifications
You must be signed in to change notification settings - Fork 18
Add stream peeloff support #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This patch introduces a new socket option, QUIC_SOCKOPT_STREAM_PEELOFF, allowing applications to "peel off" an individual QUIC stream from a connection and obtain a dedicated socket for it. 1. Socket Option: The patch defines QUIC_SOCKOPT_STREAM_PEELOFF in uapi/linux/quic.h along with the quic_stream_peeloff structure, and implements quic_sock_get_stream_peeloff() to allocate a new quic_stream_sock, move frames from the parent socket's recv_list to the peeled stream, and return a file descriptor to userspace. 2. Function ajustment: Inqueue handling is extended with a frame_list in struct quic_stream, quic_inq_stream_tail() is updated to enqueue frames to either the parent recv_list or the stream's frame_list, quic_recvmsg and quic_sendmsg are updated to target individual streams via stream_id, and cleanup paths (STOP_SENDING, RESET_STREAM) now also purge frames from peeled stream frame lists. 3. Socket APIs: A dedicated quic_stream_proto and socket operations (sendmsg, recvmsg, poll, shutdown, release) are added to support peeled stream sockets. This allows applications to manage high-level stream-level sockets similar to TCP substreams, enabling simpler integration and fine-grained control over QUIC streams. Signed-off-by: Xin Long <[email protected]>
This patch adds a sample testcase demonstrating the use of the QUIC_SOCKOPT_STREAM_PEELOFF socket option. On the send side, the testcase opens a stream using getsockopt(QUIC_SOCKOPT_STREAM_OPEN), then peels it off with getsockopt(QUIC_SOCKOPT_STREAM_PEELOFF), and finally sends data through the new file descriptor. On the receive side, the testcase detects stream creation via the QUIC_EVENT_STREAM_UPDATE event, peels off the stream using getsockopt(QUIC_SOCKOPT_STREAM_PEELOFF), and receives data through the new file descriptor. Signed-off-by: Xin Long <[email protected]>
struct quic_stream_sock { | ||
struct sock common; | ||
struct quic_stream *stream; | ||
struct sock *sk; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could that be a struct quic_sock *
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could, but in these quic_stream_ops functions, it only needs to see it as 'struct sock'. If use struct quic_sock
, we still need do struct sock *sk = &qs->inet.sk;
first.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need do
struct sock *sk = &qs->inet.sk;
first.
Just to be clear, because of the quic_sk
function, I thought one just downcasts struct sock
directly into struct quick_sock
. Are you saying that in fact we need an offset to a different field?
it only needs to see it as
struct sock
I think that is just because it is calling functions that will do the downcast anyways? Ultimately I defer to you on what the kernel idioms are. I am just thinking that if this socket must be a quic socket, and not any other kind of socket, we can use the narrower type to document that invariant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no offset to get sock from quic_sock, just the kernel protocols uses &qs->inet.sk; more commonly than (struct sock *)qs;
That is a good point, technically it doesn't really matter if it's sock or quic_sock. Using quic_sock makes things more clear.
/* Do not peel off from one netns to another one. */ | ||
if (!net_eq(current->nsproxy->net_ns, sock_net(sk))) | ||
return -EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I would think that this would not be needed, as holding the "master socket" is enough privileges. But maybe there is some other interaction I am not aware of?
Sending the peeled off socket to a process in another net namespace is still fine, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restated perhaps more clearly: I view namespace as existing to manage ambient authority for "legacy programs" that are used to having it. But in this case, this is a new interface, and holding the "master socket" strikes me as a sufficient capability such that there is no ambient authority being utilized in peeling off a socket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, we don't need this check. I kinda mixed with SCTP:
which is totally a different case.
Sending the peeled off socket to a process in another net namespace is still fine, right?
I think it should be fine, I need to try it out sometime.
if (!stream || stream->peeled) | ||
return -EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dup
on the peeled-off socket get around this, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dup() will not come to socket level, it only increase a refcnt to the file struct and related it with a new fd:
struct file *file = fget_raw(fildes);
if (file) {
ret = get_unused_fd_flags(0);
if (ret >= 0)
fd_install(ret, file);
else
fput(file);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for sure, cannot have multiple separate underlying states associated with the same stream, that doesn't really make sense in theory either.
What I was getting at (perhaps too Socratically, sorry):
- It is fine to dup a stream socket, right? (assume the answer is yes, just like any other socket)
- If so, instead of making this an error, we could instead just find the existing socket do what dup does (bump ref count, assign new fd number)
Since ultimately the stream socket is beholden to the master connection socket, I don't think it matters so much whether one is getting a "unique" (first time peal off) or "non-unique" (bump ref count for prior peel off) fd for the stream?
Only counter-argument I have thought of: Maybe it would make close
confusing? (I honestly forget, sorry, but close
follows the refcount, whereas shutdown
doesn't care about other referrers, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Another way to look at the argument above is that if peeling off creating a "projection or view" from the underlying multi-stream connection, it should be a monotonic and stateless (as far as observable behavior is concerned, not kernel-private data structure) operation:
- You can peel off the same stream as many times as you like
- Refcount based operations like "close" don't care about peel off order, just total count --- the peel offs commute
- Operations on the underlying connection don't care about how many times peel off has occurred because they don't see peeled-off "child" sockets, and their ref counts, at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is: socket and file must be 1:1, see sock_alloc_file():
sock->file = file;
file->private_data = sock;
unlike fd and file, can be N:1.
Besides, does an application really need to peel off a stream multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be confused but I thought the analogy was:
file descriptor : struct file
: file in the VFS
file descriptor : struct socket
so dup
in both cases will increase the ref count, but never make a new struct file
or struct socket
. Only reopening the file in the VFS gets a new struct file
.
(it would be fun have such a "deep dup" to make a new file for things that support it, I guess the next best thing is open("/dev/fd/<N>", ...)
or similar. But this is not what I meant by my suggestion.)
My suggestion was just to make peeloff monotonic via:
- If there is no
struct socket
(and associatedstruct file
) for a peeled off stream, make a new one (like today) - If there is is one already, bump the refcount, allocate a new FD (but not
struct socket
!) and return that
continue; | ||
list_move_tail(&frame->list, &stream->recv.frame_list); | ||
} | ||
sock_hold(sk); /* Keep the parent sk alive while the peeled stream is in use. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah very nice that is already exists. Are there other "parent-child socket" relationships like this already, or does it just exist for other purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed, in TCP, request sk hold a refcnt to listen sk in reqsk_alloc_noprof()
and put it in __reqsk_free()
, but the difference is TCP request sk doesn't related to a fd until it becomes a full sk which will not hold a refcnt to listen sk any longer.
We have to keep the parent sk alive, because these quic_stream_ops functions need to check if the parent sk is closed via quic_is_close(sk), and as you know there's no way to check if a sk memory/pointer is freed or not.
TBH, I don't really want to hold a refcnf to the parent sk, which is much larger than stream socket. It should be freed ASAP after close(). But I don't have a safe way to access it in these quic_stream_ops functions. Please let me know if you have a better idea on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not actually a Linux kernel developer, so I am afraid I am not in the weeds enough to offer informed concrete suggestions. I'll offer some conceptual ideas, but you might have already thought of them:
-
This is the classic use-case for a weak reference, but I have no idea whether the socket ref counting infra supports that. Making it support that for the first time would be a big pain, so I am quite sympathetic on punting on that for now.
-
There is a more radical design where the only way to work with multiple streams is pealed-off sockets, and then each stream socket can "own" its state a bit more. The master connection sill need to be able to do unliteral closures, which would mean that it needs to be able to iterate through all the stream sockets and close them / reclaim their state.
- Having the references instead go parent->child, for easy "get all stream sockets" means intermediate collection data structure, which adds more complexity. (c.f. SQL foreign keys are usually children -> parent like the PR currently for the same reason of avoiding the collection.)
- But instead of having weak references from parent to child (same problem as before) we can instead make it be that closing a child socket means finding its reference in the parent's collection, and removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The master connection sill need to be able to do unliteral closures, which would mean that it needs to be able to iterate through all the stream sockets and close them / reclaim their state.
Yes, I think in this way close() on master connection will be able to free its socket source, but this indeed will increase the complexity.
@Ericson2314 I added this socket option to the I-D doc: https://www.ietf.org/archive/id/draft-lxin-quic-socket-apis-02.html#name-quic_sockopt_stream_peeloff I'm planning to post this feature as a follow-up after upstream accepts the main QUIC code, since there is still some testing work to be done against peeloff sockets. Thanks. |
Left some emojis before already, but just wanted to say, very excited for this!! |
The 1st patch introduces a new socket option, QUIC_SOCKOPT_STREAM_PEELOFF, allowing applications to "peel off" an individual QUIC stream from a connection and obtain a dedicated socket for it.
The 2nd patch adds a sample test case demonstrating the use of the QUIC_SOCKOPT_STREAM_PEELOFF socket option, and the APIs of stream sockets.