Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(daemon): Improve cross-platform sockets using pid suffix #1620

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SMotaal
Copy link
Member

@SMotaal SMotaal commented Jun 8, 2023

Rerunning yarn test can fail on some systems when a previous run terminates without properly cleaning up the daemon/tmp/endo.sock.

Within the scope of testing and separate from the implementation, the issue can be viewed as a naming conflict, which can be addressed by PID-suffixing the .sock path.

On non-win32 platforms, the .sock is currently being placed inside the worktree, a discrepancy which further complicates this issue, and that can be addressed by moving the .sock outside the worktree, specifically os.tmpdir() given the purpose.


Additional Details

This is particularly a problem on macOS which I encounter with different versions of node, including 14 and 19.

The less-than-ideal workaround for now is to use docker with two separate -v mounts to ensure proper .sock handling:

NODE=19; SRC=$(basename $(pwd)); docker run -v $(pwd):/$SRC -w /$SRC --tmpfs /$SRC/packages/daemon/tmp -it --rm node:$NODE yarn --cache-folder .yarn/cache workspace @endo/daemon test

@SMotaal SMotaal marked this pull request as ready for review June 8, 2023 12:11
@SMotaal
Copy link
Member Author

SMotaal commented Jun 8, 2023

Unsure if I should rebase (👍) or update (👎) before landing from my end to avoid unverified commits.

@SMotaal SMotaal force-pushed the smotaal/daemon-test-cross-platform-socket-pid-suffixing branch from 1630693 to 1d5748f Compare June 10, 2023 09:41
@SMotaal SMotaal force-pushed the smotaal/daemon-test-cross-platform-socket-pid-suffixing branch from 1d5748f to 6c379f4 Compare June 11, 2023 19:27
Copy link
Member

@michaelfig michaelfig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

? raw`\\?\pipe\endo-test.sock`
: path.join(dirname, 'tmp/endo.sock'),
? raw`\\?\pipe\endo-test-${process.pid}.sock`
: path.join(os.tmpdir(), `endo-test-${process.pid}.sock`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an OS tmpdir makes it harder to inspect the logs when the tests fail. If we go this route, there at least needs to be a t.log so the developer can rendezvous with the dump, but I’m inclined to keep it local. There’s a corresponding entry in .gitignore.

Copy link
Member

@michaelfig michaelfig Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just the socket path, and nothing else? Or is there some bizarre derivation of the log path from the socket path I'm missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. You are correct. This only affects the location of the socket or named pipe, not the log dump.

@@ -29,8 +30,8 @@ const locator = {
cachePath: path.join(dirname, 'tmp/cache'),
sockPath:
process.platform === 'win32'
? raw`\\?\pipe\endo-test.sock`
: path.join(dirname, 'tmp/endo.sock'),
? raw`\\?\pipe\endo-test-${process.pid}.sock`
Copy link
Member

@kriskowal kriskowal Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m curious whether the problem can be better addressed by ensuring that Endo stomps over the existing named pipe, since restarting without a graceful shutdown will affect end users as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, if handling the named pipe gracefully on Windows is not in the cards, we may need to adjust some other behavior around @endo/where for discovering the rendezvous location of the Endo named pipe by looking first at the PID file to infer the PID in the named pipe.

Copy link
Member

@kriskowal kriskowal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m interested in a precise reproduction so we can tell whether this is a problem with the design of the tests, or a broader problem that will affect normal usage.

I the problem affects normal usage, I would like to consider an alternate approach of using a retry loop with backoff around the socket listener, since it’s likely that we’re seeing this problem only because of a delay between when the daemon stops and when its socket gets cleaned up.

The retry loop would be around the attempt to listen, with bounded back-off and full jitter. That is, the first attempt should be fast and it should settle into a reasonable mean polling frequency, but also randomize to avoid synchronizing with other loops.

Some care should be taken since concurrent attempts to start the daemon are likely to occur in practice. In the CLI, every command that uses a daemon client will attempt to start the daemon if it’s not already running. These should not thrash. One should win the race and both should connect to the same daemon instance.

@SMotaal
Copy link
Member Author

SMotaal commented Jun 12, 2023

From the perspective I am coming from, I saw one test for daemon using the locator object. Based on the discussions so far, it is becoming clearer that there many technical complexities involved which are relevant for the debugging and reproduction purposes.

In that sense, I'd like to propose a direction where there are tests for the specific moving parts, to ensure there is enough clarity on what is going on when something breaks, so that the enduser can attempt to figure out the problem or at least know what to report in an issue.

I was working with limited insights and bandwidth, and did not explicitly determine the causes of the failures. I explored different workarounds at the different times and those suggest that the test is starting in an environment that did not meet the preconditions where it could create the necessary socket.

The most stable workaround of course where docker's tmpfs suggested that this was not an implementation bug, merely a testing complexity, and possibly the need for further refinements. The discussions brought further context about implementation complexity which was not apparent from the tests in daemon at this point.

It is reasonable to pretest any assumptions about the platform or the environment ahead of time. Otherwise, the runtime throwing an error when the only parameter is a locator invites the first rational step of changing this locator. All tests passing with such a change and without any warnings or clear indications in the code to urge caution, this invites further assumptions and seemingly rational steps.

With all the moving parts involved, assumptions will feel like insights, until there is insight that they were not 😅

Meanwhile, I feel strongly inclined that passing a locator argument is concrete enough that it ought to fail with an error message indicating if it needs to meet certain conditions, a warning if the behaviour may be desirable under one or more of several scenarios.

Ultimately, I would rather we figure out a direction for a PR that would allow better diagnostics and concrete insights, to better separate implementation, platform and environment concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants