Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cubic needs multiple stabilization steps after Hystart #1830

Open
huitema opened this issue Feb 3, 2025 · 3 comments
Open

Cubic needs multiple stabilization steps after Hystart #1830

huitema opened this issue Feb 3, 2025 · 3 comments

Comments

@huitema
Copy link
Collaborator

huitema commented Feb 3, 2025

Cubic connections start with an initial 'slow start' phase running Hystart. In theory, Hystart stops after sensing delay increases, and then the connection should continue from a safe point. In practice, we see something else, as shown in this snapshot of the initial connection phase in the qlog traces of a cubic run:

Image

We see three steps down, and we also see a series of packet losses continuing well after the first two steps. In this example, the Hystart process exits after the notification of delay increase, and immediately sets "ssthresh", "W_max" and "W_last_max" to that value. This proves over optimistic, and we see an immediate need to further reduce the transmission window.

The delay increase is not caused by the current value of the CWIN, but rather by the value of CWIN when the last acknowledged packet was sent. The CWIN will increase as acknowledgements are received after this point, but those increase are still "in the pipe line". We should set the safe value to one of:

  • the number of bytes in flight after the delay-causing acknowledgement packet was sent,
  • or, the total number of bytes acknowledged since that packet was sent,
  • or, if we cannot keep track of that, half the current value of the CWIN.

Instead, the apply the coefficient "beta cubic", which may be adequate during congestion avoidance but is not enough to stabilize the connection. This is probably wrong.

There is of course a risk that this is "too drastic", which means we should consider allowing a correction with HyStart++, see issue #1694 and comments from @hfstco.

@huitema
Copy link
Collaborator Author

huitema commented Feb 3, 2025

Tried implementing the divide by 2 simplification. Almost all tests pass, but the satellite cubic test completes in 8 seconds instead of 6.1 -- not quite an acceptable performance loss.

@huitema
Copy link
Collaborator Author

huitema commented Feb 3, 2025

The first attempt was too simple: we already had code managing the start of long delay connections, already lowering CWIN to a manageable value. We should not divide the window by 2 if that code is already applied. With that fix, all the tests pass. We do not observe performance regression. In fact, avoiding the "rebounds" also avoids packet losses, which improves performance overall. After the changes in PR #1831, we get the following:

Image

The PR makes two changes:

  1. use CWIN/2 as the target upong exit with HyStart
  2. enter recovery upon exit, to avoid reacting to packet losses caused by excess CWIN.

We can see on the traces that the "bounces" are gone, and the connection proceeds normally.

@hfstco
Copy link
Collaborator

hfstco commented Feb 4, 2025

Ok, just want to share my thoughts.

In general congestion feedback usually arrives 1 RTT later.

In the cubic_test test case the buffer is set to 1 BDP. (20ms RTT + 20ms buffer) The first drop highly depends on the buffer size of the path.

...and immediately sets "ssthresh", "W_max" and "W_last_max" to that value. This proves over optimistic...

In cubic_test HyStart triggers at 142ms. Maybe too late?

If we increase the buffer size of cubic_test in the master branch, these two steps will disappear without any changes.
Image

Otherwise, if we decrease the buffer size of cubic_test including the changes of #1831, these steps appear again.
Image

However, in my opinion these steps depend on the buffer size of the path. If the buffer can absorb the overshoot or not.

Instead, the apply the coefficient "beta cubic", which may be adequate during congestion avoidance but is not enough to stabilize the connection. This is probably wrong.

Currently, picoquic reduces the CWND by cubic_beta = 7/8 (W_max), right? RFC 9438 recommends 0.7 for CUBIC. NewReno sets the ssthresh by 0.5 or /2. However, even if we change the cubic_beta these would not avoid the first step. But maybe could avoid the second one.

When I compare the two versions of cubic_test (master vs. fix-cubic-start-bounce), we increase the gap sending more data after HyStart, which allows our buffer to drain. And this gap depends reduction of CWND and on the ACKs received (until bytes_in_flight <= CWND).
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants