test(daemon): Improve cross-platform sockets using pid suffix #1620

SMotaal · 2023-06-08T11:53:03Z

Rerunning yarn test can fail on some systems when a previous run terminates without properly cleaning up the daemon/tmp/endo.sock.

Within the scope of testing and separate from the implementation, the issue can be viewed as a naming conflict, which can be addressed by PID-suffixing the .sock path.

On non-win32 platforms, the .sock is currently being placed inside the worktree, a discrepancy which further complicates this issue, and that can be addressed by moving the .sock outside the worktree, specifically os.tmpdir() given the purpose.

Additional Details

This is particularly a problem on macOS which I encounter with different versions of node, including 14 and 19.

The less-than-ideal workaround for now is to use docker with two separate -v mounts to ensure proper .sock handling:

NODE=19; SRC=$(basename $(pwd)); docker run -v $(pwd):/$SRC -w /$SRC --tmpfs /$SRC/packages/daemon/tmp -it --rm node:$NODE yarn --cache-folder .yarn/cache workspace @endo/daemon test

SMotaal · 2023-06-08T13:28:40Z

Unsure if I should rebase (👍) or update (👎) before landing from my end to avoid unverified commits.

michaelfig

LGTM!

kriskowal · 2023-06-11T19:39:27Z

packages/daemon/test/test-endo.js

-      ? raw`\\?\pipe\endo-test.sock`
-      : path.join(dirname, 'tmp/endo.sock'),
+      ? raw`\\?\pipe\endo-test-${process.pid}.sock`
+      : path.join(os.tmpdir(), `endo-test-${process.pid}.sock`),


Using an OS tmpdir makes it harder to inspect the logs when the tests fail. If we go this route, there at least needs to be a t.log so the developer can rendezvous with the dump, but I’m inclined to keep it local. There’s a corresponding entry in .gitignore.

Isn't this just the socket path, and nothing else? Or is there some bizarre derivation of the log path from the socket path I'm missing?

Oh, I see. You are correct. This only affects the location of the socket or named pipe, not the log dump.

kriskowal · 2023-06-11T19:40:32Z

packages/daemon/test/test-endo.js

@@ -29,8 +30,8 @@ const locator = {
  cachePath: path.join(dirname, 'tmp/cache'),
  sockPath:
    process.platform === 'win32'
-      ? raw`\\?\pipe\endo-test.sock`
-      : path.join(dirname, 'tmp/endo.sock'),
+      ? raw`\\?\pipe\endo-test-${process.pid}.sock`


I’m curious whether the problem can be better addressed by ensuring that Endo stomps over the existing named pipe, since restarting without a graceful shutdown will affect end users as well.

Or, if handling the named pipe gracefully on Windows is not in the cards, we may need to adjust some other behavior around @endo/where for discovering the rendezvous location of the Endo named pipe by looking first at the PID file to infer the PID in the named pipe.

kriskowal

I’m interested in a precise reproduction so we can tell whether this is a problem with the design of the tests, or a broader problem that will affect normal usage.

I the problem affects normal usage, I would like to consider an alternate approach of using a retry loop with backoff around the socket listener, since it’s likely that we’re seeing this problem only because of a delay between when the daemon stops and when its socket gets cleaned up.

The retry loop would be around the attempt to listen, with bounded back-off and full jitter. That is, the first attempt should be fast and it should settle into a reasonable mean polling frequency, but also randomize to avoid synchronizing with other loops.

Some care should be taken since concurrent attempts to start the daemon are likely to occur in practice. In the CLI, every command that uses a daemon client will attempt to start the daemon if it’s not already running. These should not thrash. One should win the race and both should connect to the same daemon instance.

SMotaal · 2023-06-12T11:09:59Z

From the perspective I am coming from, I saw one test for daemon using the locator object. Based on the discussions so far, it is becoming clearer that there many technical complexities involved which are relevant for the debugging and reproduction purposes.

In that sense, I'd like to propose a direction where there are tests for the specific moving parts, to ensure there is enough clarity on what is going on when something breaks, so that the enduser can attempt to figure out the problem or at least know what to report in an issue.

I was working with limited insights and bandwidth, and did not explicitly determine the causes of the failures. I explored different workarounds at the different times and those suggest that the test is starting in an environment that did not meet the preconditions where it could create the necessary socket.

The most stable workaround of course where docker's tmpfs suggested that this was not an implementation bug, merely a testing complexity, and possibly the need for further refinements. The discussions brought further context about implementation complexity which was not apparent from the tests in daemon at this point.

It is reasonable to pretest any assumptions about the platform or the environment ahead of time. Otherwise, the runtime throwing an error when the only parameter is a locator invites the first rational step of changing this locator. All tests passing with such a change and without any warnings or clear indications in the code to urge caution, this invites further assumptions and seemingly rational steps.

With all the moving parts involved, assumptions will feel like insights, until there is insight that they were not 😅

Meanwhile, I feel strongly inclined that passing a locator argument is concrete enough that it ought to fail with an error message indicating if it needs to meet certain conditions, a warning if the behaviour may be desirable under one or more of several scenarios.

Ultimately, I would rather we figure out a direction for a PR that would allow better diagnostics and concrete insights, to better separate implementation, platform and environment concerns.

SMotaal marked this pull request as ready for review June 8, 2023 12:11

SMotaal force-pushed the smotaal/daemon-test-cross-platform-socket-pid-suffixing branch from 1630693 to 1d5748f Compare June 10, 2023 09:41

test(daemon): Improve cross-platform sockets using pid suffix

6c379f4

SMotaal force-pushed the smotaal/daemon-test-cross-platform-socket-pid-suffixing branch from 1d5748f to 6c379f4 Compare June 11, 2023 19:27

michaelfig approved these changes Jun 11, 2023

View reviewed changes

kriskowal reviewed Jun 11, 2023

View reviewed changes

kriskowal requested changes Jun 11, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(daemon): Improve cross-platform sockets using pid suffix #1620

test(daemon): Improve cross-platform sockets using pid suffix #1620

SMotaal commented Jun 8, 2023 •

edited

Loading

SMotaal commented Jun 8, 2023

michaelfig left a comment

kriskowal Jun 11, 2023

michaelfig Jun 11, 2023 •

edited

Loading

kriskowal Jun 11, 2023

kriskowal Jun 11, 2023 •

edited

Loading

kriskowal Jun 11, 2023

kriskowal left a comment

SMotaal commented Jun 12, 2023 •

edited

Loading

test(daemon): Improve cross-platform sockets using pid suffix #1620

Are you sure you want to change the base?

test(daemon): Improve cross-platform sockets using pid suffix #1620

Conversation

SMotaal commented Jun 8, 2023 • edited Loading

SMotaal commented Jun 8, 2023

michaelfig left a comment

Choose a reason for hiding this comment

kriskowal Jun 11, 2023

Choose a reason for hiding this comment

michaelfig Jun 11, 2023 • edited Loading

Choose a reason for hiding this comment

kriskowal Jun 11, 2023

Choose a reason for hiding this comment

kriskowal Jun 11, 2023 • edited Loading

Choose a reason for hiding this comment

kriskowal Jun 11, 2023

Choose a reason for hiding this comment

kriskowal left a comment

Choose a reason for hiding this comment

SMotaal commented Jun 12, 2023 • edited Loading

SMotaal commented Jun 8, 2023 •

edited

Loading

michaelfig Jun 11, 2023 •

edited

Loading

kriskowal Jun 11, 2023 •

edited

Loading

SMotaal commented Jun 12, 2023 •

edited

Loading