Skip to content

fix: write i/o timeout on SSH connection #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

HalloTschuess
Copy link
Contributor

I noticed freezing when using notifications without regularly sending commands on a SSH connection.

Simple example: This code will hang after the default timeout of 10s.

func main() {
	client, err := ts3.NewClient("localhost:10022", ts3.SSH(&ssh.ClientConfig{
		User:            "serveradmin",
		Auth:            []ssh.AuthMethod{ssh.Password("password")},
		HostKeyCallback: ssh.InsecureIgnoreHostKey(),
	}))
	if err != nil {
		panic(err)
	}

	if err := client.Use(1); err != nil {
		panic(err)
	}

	if err := client.Register(ts3.TextServerEvents); err != nil {
		panic(err)
	}

	for n := range client.Notifications() {
		fmt.Printf("%+v\n", n)
	}
}

Some digging revealed, that when reading from a ssh.Channel there is actually also some data written to the underlying connection. Unfortunately, this write times out when no command was sent within the timeout period. The reason is that the write deadline is not reset when reading, since this wasn't necessary until now.

This problem is fixed by always setting the write deadline when writing to the underlying connection.

@HalloTschuess HalloTschuess changed the title fix: write i/o timeout on ssh connection fix: write i/o timeout on SSH connection Feb 21, 2023
@HalloTschuess
Copy link
Contributor Author

Windows linting seems to have timed out. I cannot find where to restart the job manually.


c.Conn = &writeTimeoutConn{
Conn: conn,
timeout: timeout,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using what the user believes is the connect timeout and not the read / write timeout, same for the ssh version.

Given the use is passing in a read / write / dial timeout the best option might be to just clarify it's use, thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using NewClient() with the option Timeout() the behavior didn't change. This is also documented for the Timeout option. "Timeout sets read / write / dial timeout for a TeamSpeak 3 Client."

I added some documentation to the Connect methods for clarification though.

Copy link
Contributor

@stevenh stevenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this might have caused a deadlock on windows see:
https://github.com/Unity-Technologies/go-ts3/actions/runs/4244839889/jobs/7379651078

@HalloTschuess
Copy link
Contributor Author

Yeah just noticed that too. Yesterday all test ran on my windows machine. Now they don't. I'll investigate.

@HalloTschuess
Copy link
Contributor Author

I think I found the issue. From what I can tell it had nothing to do with my changes, though they might have amplified the problem.

It could happen in the rare case when the server connection was closed while already executing a command. In the test it was freezing on c.Close() which executes the command quit. Since the server is closed shortly before that, there might be a short overlap where the connection is not yet disconnected. This leads to a blocking write to channel c.work in c.ExecCmd().

I added a select around the write to c.work with an additional check to for c.disconnect and a timeout.
This might also fix #21 but I'm not sure on that since I cannot reproduce the freezing problem stated there.

Copy link
Contributor

@stevenh stevenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right, as there should never to need to timeout writes to that channel.

Looking at the previous error log I see there's 3 active clients and the one running the close doesn't have any handlers which looks like the underlying cause.

goroutine 41 [chan send]: (client: 0xc000097540)
github.com/multiplay/go-ts3.(*Client).ExecCmd(0xc000097540, 0xc000095b40)
	D:/a/go-ts3/go-ts3/client.go:271 +0xc5
	c.work <- cmd.String() // cmd = quit from c.Close()

goroutine 23 [chan send]: (client: 0xc000097400)
github.com/multiplay/go-ts3.(*Client).messageHandler(0xc000097400)
	D:/a/go-ts3/go-ts3/client.go:231 +0x445
	c.err <- err

goroutine 24 [chan send]: (client: 0xc000097400)
github.com/multiplay/go-ts3.(*Client).process(...)
	D:/a/go-ts3/go-ts3/client.go:256
	c.err <- err

goroutine 13 [chan send]: (client: 0xc0000974a0)
github.com/multiplay/go-ts3.(*Client).messageHandler(0xc0000974a0)
	D:/a/go-ts3/go-ts3/client.go:231 +0x445
	c.err <- err

goroutine 14 [chan send]: (client: 0xc0000974a0)
github.com/multiplay/go-ts3.(*Client).process(...)
	D:/a/go-ts3/go-ts3/client.go:256
	c.err <- err

Given this I wonder if the real cause is a race processing the error written to c.err in messageHandler for the disconnect case. If so this might fix it:

diff --git a/client.go b/client.go
index d9c5a7d..ed369a5 100644
--- a/client.go
+++ b/client.go
@@ -228,11 +228,12 @@ func (c *Client) messageHandler() {
                        }
                } else {
                        err := c.scanErr()
-                       c.err <- err
                        if errors.Is(err, io.ErrUnexpectedEOF) {
                                close(c.disconnect)
+                               c.err <- err
                                return
                        }
+                       c.err <- err
                }
        }
 }

Can you check that?

@HalloTschuess
Copy link
Contributor Author

You are right, your suggestion seems to have fixed the issue as well. Digging into it, it pretty much boils down to what you suspected. I ran the tests a few hundred times and they never failed.

I also noticed when an SSH connection disconnects, it returns a wrapped io.EOF instead of io.ErrUnexpectedEOF. This is now checked as well.

@HalloTschuess
Copy link
Contributor Author

Since linting seems to always time out for Windows, I'm wondering if the step is even necessary. Shouldn't linting be independent of OS or do you have a different experience @stevenh?

Otherwise I would suggest disabling the step.

@stevenh
Copy link
Contributor

stevenh commented Feb 22, 2023

Since linting seems to always time out for Windows, I'm wondering if the step is even necessary. Shouldn't linting be independent of OS or do you have a different experience @stevenh?

Otherwise I would suggest disabling the step.

Something odd going on there, might want to add a --timeout 2m to the Window lint, but why its taking so long randomly I don't know.

Linting can be OS dependent so do want to keep it.

Copy link
Contributor

@stevenh stevenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately still seeing random failures, so that change to disconnect handling doesn't fully fix it.

This is with some additional error wrapping to determine where the EOF was being triggered.

=== RUN   TestClientDeadlineSSH
    client_test.go:223:
                Error Trace:    multiplay/go-ts3/client_test.go:223
                Error:          Received unexpected error:
                                header: client: scan: ssh connection: read: EOF
                Test:           TestClientDeadlineSSH
--- FAIL: TestClientDeadlineSSH (0.11s)
panic: test timed out after 10s
running tests:
        TestClientDisconnectSSH (10s)

goroutine 25 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/go/src/testing/testing.go:2241 +0x219
created by time.goFunc
        /usr/local/go/src/time/sleep.go:176 +0x48

goroutine 1 [chan receive]:
testing.(*T).Run(0xc00013cb60, {0x8357b0, 0x17}, 0x85fb78)
        /usr/local/go/src/testing/testing.go:1630 +0x82e
testing.runTests.func1(0x0?)
        /usr/local/go/src/testing/testing.go:2036 +0x8e
testing.tRunner(0xc00013cb60, 0xc000167b48)
        /usr/local/go/src/testing/testing.go:1576 +0x217
testing.runTests(0xc0001212c0?, {0xa86440, 0x1f, 0x1f}, {0x1c?, 0x1e?, 0xa8b020?})
        /usr/local/go/src/testing/testing.go:2034 +0x87d
testing.(*M).Run(0xc0001212c0)
        /usr/local/go/src/testing/testing.go:1906 +0xb45
main.main()
        _testmain.go:107 +0x2ea

goroutine 6 [chan send]:
github.com/multiplay/go-ts3.(*Client).ExecCmd(0xc0001214a0, 0xc00010de48)
        multiplay/go-ts3/client.go:272 +0xc5
github.com/multiplay/go-ts3.(*Client).Exec(...)
        multiplay/go-ts3/client.go:263
github.com/multiplay/go-ts3.TestClientDisconnectSSH(0xc00013cd00)
        multiplay/go-ts3/client_test.go:117 +0x245
testing.tRunner(0xc00013cd00, 0x85fb78)
        /usr/local/go/src/testing/testing.go:1576 +0x217
created by testing.(*T).Run
        /usr/local/go/src/testing/testing.go:1629 +0x806

goroutine 7 [IO wait]:
internal/poll.runtime_pollWait(0x7fb289939230, 0x72)
        /usr/local/go/src/runtime/netpoll.go:306 +0x89
internal/poll.(*pollDesc).wait(0xc000028498, 0x514e7035306a6e01?, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0xbd
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc000028480)
        /usr/local/go/src/internal/poll/fd_unix.go:614 +0x425
net.(*netFD).accept(0xc000028480)
        /usr/local/go/src/net/fd_unix.go:172 +0x4a
net.(*TCPListener).accept(0xc000012438)
        /usr/local/go/src/net/tcpsock_posix.go:148 +0x45
net.(*TCPListener).Accept(0xc000012438)
        /usr/local/go/src/net/tcpsock.go:297 +0x68
github.com/multiplay/go-ts3.(*server).serve(0xc000016360)
        multiplay/go-ts3/mockserver_test.go:181 +0xdb
created by github.com/multiplay/go-ts3.(*server).Start
        multiplay/go-ts3/mockserver_test.go:156 +0xa7

goroutine 40 [chan send]:
github.com/multiplay/go-ts3.(*Client).messageHandler(0xc0001214a0)
        multiplay/go-ts3/client.go:233 +0x51e
created by github.com/multiplay/go-ts3.NewClient
        multiplay/go-ts3/client.go:203 +0x1033
exit status 2
FAIL    github.com/multiplay/go-ts3     10.020s

@HalloTschuess
Copy link
Contributor Author

The more I look into it, the more possible concurrency issues I find. Also debugging concurrent code is not that easy, especially if errors only happen sometimes.

As an example:
c.err is only read in c.ExecCmd(). But c.err can receive errors even if there is no command currently being executed. This can happen, if the connection is unexpectedly disconnected and leads to a blocking wait on c.err. But nobody will ever receive that error. Therefore that goroutine will leak as it never ends.

@HalloTschuess
Copy link
Contributor Author

Also I am not sure whether the failing tests are because of the code or maybe the mockserver.
On my machine at least it always takes a few hundred runs until one of the test cases fail. And usually it is because of EOF which might suggest the mockserver failing (maybe)

@stevenh
Copy link
Contributor

stevenh commented Feb 23, 2023

The more I look into it, the more possible concurrency issues I find. Also debugging concurrent code is not that easy, especially if errors only happen sometimes.

As an example: c.err is only read in c.ExecCmd(). But c.err can receive errors even if there is no command currently being executed. This can happen, if the connection is unexpectedly disconnected and leads to a blocking wait on c.err. But nobody will ever receive that error. Therefore that goroutine will leak as it never ends.

Yes I spotted that too.

There is also an issue with the association of the error returned from ExecCmd if a keep alive request fails.

I'll have a look to see what else I can find.

@stevenh
Copy link
Contributor

stevenh commented Feb 24, 2023

I've had a look and I think this PR addresses all the concurrency issues introduced by the notification PR.

If you can try that out @HalloTschuess that would be great.

@P4sca1
Copy link

P4sca1 commented Apr 13, 2023

With #40 the freezing issues have been fixed. However, i/o timeouts are still a little bit confusing. Here's what I found out (using the fix/cmd-processing branch from #40):

  • The example code in the first post of this issue, will exit with the first message that tries to get read after 10 seconds (the default timeout).
  • The scanner.Scan() method in messageHandler will return false when that happens and scanner.Err() indicates a DeadLineExceededError with err.Op = "write".

I thought the read should never time out, because the deadline is disabled with SetReadDeadline(time.Time{}). However, the read can still time out, when the write deadline is exceeded. This is because ssh.channel.ReadExtended calls ssh.channel.adjustWindow, which writes to the connection.

I'm wondering what would be a good approach to make an application that keeps listening for incoming events without timing out. I can think of the following options:

  1. Disable the write deadline. This might not be a good idea, because real timeouts (ts server getting down) can no longer be detected.
  2. Set the keep alive interval to a value that is less than the timeout. Is this a good default? Are there any alternatives to this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants