-
Notifications
You must be signed in to change notification settings - Fork 21
fix: write i/o timeout on SSH connection #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix: write i/o timeout on SSH connection #39
Conversation
b335a61
to
80dfa95
Compare
Windows linting seems to have timed out. I cannot find where to restart the job manually. |
|
||
c.Conn = &writeTimeoutConn{ | ||
Conn: conn, | ||
timeout: timeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using what the user believes is the connect timeout and not the read / write timeout, same for the ssh version.
Given the use is passing in a read / write / dial timeout the best option might be to just clarify it's use, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using NewClient()
with the option Timeout()
the behavior didn't change. This is also documented for the Timeout
option. "Timeout sets read / write / dial timeout for a TeamSpeak 3 Client."
I added some documentation to the Connect
methods for clarification though.
33eced8
to
2cea98c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this might have caused a deadlock on windows see:
https://github.com/Unity-Technologies/go-ts3/actions/runs/4244839889/jobs/7379651078
Yeah just noticed that too. Yesterday all test ran on my windows machine. Now they don't. I'll investigate. |
I think I found the issue. From what I can tell it had nothing to do with my changes, though they might have amplified the problem. It could happen in the rare case when the server connection was closed while already executing a command. In the test it was freezing on I added a select around the write to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right, as there should never to need to timeout writes to that channel.
Looking at the previous error log I see there's 3 active clients and the one running the close doesn't have any handlers which looks like the underlying cause.
goroutine 41 [chan send]: (client: 0xc000097540)
github.com/multiplay/go-ts3.(*Client).ExecCmd(0xc000097540, 0xc000095b40)
D:/a/go-ts3/go-ts3/client.go:271 +0xc5
c.work <- cmd.String() // cmd = quit from c.Close()
goroutine 23 [chan send]: (client: 0xc000097400)
github.com/multiplay/go-ts3.(*Client).messageHandler(0xc000097400)
D:/a/go-ts3/go-ts3/client.go:231 +0x445
c.err <- err
goroutine 24 [chan send]: (client: 0xc000097400)
github.com/multiplay/go-ts3.(*Client).process(...)
D:/a/go-ts3/go-ts3/client.go:256
c.err <- err
goroutine 13 [chan send]: (client: 0xc0000974a0)
github.com/multiplay/go-ts3.(*Client).messageHandler(0xc0000974a0)
D:/a/go-ts3/go-ts3/client.go:231 +0x445
c.err <- err
goroutine 14 [chan send]: (client: 0xc0000974a0)
github.com/multiplay/go-ts3.(*Client).process(...)
D:/a/go-ts3/go-ts3/client.go:256
c.err <- err
Given this I wonder if the real cause is a race processing the error written to c.err in messageHandler for the disconnect case. If so this might fix it:
diff --git a/client.go b/client.go
index d9c5a7d..ed369a5 100644
--- a/client.go
+++ b/client.go
@@ -228,11 +228,12 @@ func (c *Client) messageHandler() {
}
} else {
err := c.scanErr()
- c.err <- err
if errors.Is(err, io.ErrUnexpectedEOF) {
close(c.disconnect)
+ c.err <- err
return
}
+ c.err <- err
}
}
}
Can you check that?
You are right, your suggestion seems to have fixed the issue as well. Digging into it, it pretty much boils down to what you suspected. I ran the tests a few hundred times and they never failed. I also noticed when an SSH connection disconnects, it returns a wrapped |
Since linting seems to always time out for Windows, I'm wondering if the step is even necessary. Shouldn't linting be independent of OS or do you have a different experience @stevenh? Otherwise I would suggest disabling the step. |
Something odd going on there, might want to add a Linting can be OS dependent so do want to keep it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately still seeing random failures, so that change to disconnect handling doesn't fully fix it.
This is with some additional error wrapping to determine where the EOF was being triggered.
=== RUN TestClientDeadlineSSH
client_test.go:223:
Error Trace: multiplay/go-ts3/client_test.go:223
Error: Received unexpected error:
header: client: scan: ssh connection: read: EOF
Test: TestClientDeadlineSSH
--- FAIL: TestClientDeadlineSSH (0.11s)
panic: test timed out after 10s
running tests:
TestClientDisconnectSSH (10s)
goroutine 25 [running]:
testing.(*M).startAlarm.func1()
/usr/local/go/src/testing/testing.go:2241 +0x219
created by time.goFunc
/usr/local/go/src/time/sleep.go:176 +0x48
goroutine 1 [chan receive]:
testing.(*T).Run(0xc00013cb60, {0x8357b0, 0x17}, 0x85fb78)
/usr/local/go/src/testing/testing.go:1630 +0x82e
testing.runTests.func1(0x0?)
/usr/local/go/src/testing/testing.go:2036 +0x8e
testing.tRunner(0xc00013cb60, 0xc000167b48)
/usr/local/go/src/testing/testing.go:1576 +0x217
testing.runTests(0xc0001212c0?, {0xa86440, 0x1f, 0x1f}, {0x1c?, 0x1e?, 0xa8b020?})
/usr/local/go/src/testing/testing.go:2034 +0x87d
testing.(*M).Run(0xc0001212c0)
/usr/local/go/src/testing/testing.go:1906 +0xb45
main.main()
_testmain.go:107 +0x2ea
goroutine 6 [chan send]:
github.com/multiplay/go-ts3.(*Client).ExecCmd(0xc0001214a0, 0xc00010de48)
multiplay/go-ts3/client.go:272 +0xc5
github.com/multiplay/go-ts3.(*Client).Exec(...)
multiplay/go-ts3/client.go:263
github.com/multiplay/go-ts3.TestClientDisconnectSSH(0xc00013cd00)
multiplay/go-ts3/client_test.go:117 +0x245
testing.tRunner(0xc00013cd00, 0x85fb78)
/usr/local/go/src/testing/testing.go:1576 +0x217
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:1629 +0x806
goroutine 7 [IO wait]:
internal/poll.runtime_pollWait(0x7fb289939230, 0x72)
/usr/local/go/src/runtime/netpoll.go:306 +0x89
internal/poll.(*pollDesc).wait(0xc000028498, 0x514e7035306a6e01?, 0x0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0xbd
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000028480)
/usr/local/go/src/internal/poll/fd_unix.go:614 +0x425
net.(*netFD).accept(0xc000028480)
/usr/local/go/src/net/fd_unix.go:172 +0x4a
net.(*TCPListener).accept(0xc000012438)
/usr/local/go/src/net/tcpsock_posix.go:148 +0x45
net.(*TCPListener).Accept(0xc000012438)
/usr/local/go/src/net/tcpsock.go:297 +0x68
github.com/multiplay/go-ts3.(*server).serve(0xc000016360)
multiplay/go-ts3/mockserver_test.go:181 +0xdb
created by github.com/multiplay/go-ts3.(*server).Start
multiplay/go-ts3/mockserver_test.go:156 +0xa7
goroutine 40 [chan send]:
github.com/multiplay/go-ts3.(*Client).messageHandler(0xc0001214a0)
multiplay/go-ts3/client.go:233 +0x51e
created by github.com/multiplay/go-ts3.NewClient
multiplay/go-ts3/client.go:203 +0x1033
exit status 2
FAIL github.com/multiplay/go-ts3 10.020s
The more I look into it, the more possible concurrency issues I find. Also debugging concurrent code is not that easy, especially if errors only happen sometimes. As an example: |
Also I am not sure whether the failing tests are because of the code or maybe the mockserver. |
Yes I spotted that too. There is also an issue with the association of the error returned from I'll have a look to see what else I can find. |
I've had a look and I think this PR addresses all the concurrency issues introduced by the notification PR. If you can try that out @HalloTschuess that would be great. |
With #40 the freezing issues have been fixed. However, i/o timeouts are still a little bit confusing. Here's what I found out (using the fix/cmd-processing branch from #40):
I thought the read should never time out, because the deadline is disabled with I'm wondering what would be a good approach to make an application that keeps listening for incoming events without timing out. I can think of the following options:
|
I noticed freezing when using notifications without regularly sending commands on a SSH connection.
Simple example: This code will hang after the default timeout of 10s.
Some digging revealed, that when reading from a
ssh.Channel
there is actually also some data written to the underlying connection. Unfortunately, this write times out when no command was sent within the timeout period. The reason is that the write deadline is not reset when reading, since this wasn't necessary until now.This problem is fixed by always setting the write deadline when writing to the underlying connection.