Skip to content

Subcompilation fails when build-exe executed in parallel #9439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zeramorphic opened this issue Jul 22, 2021 · 5 comments · Fixed by #10290
Closed

Subcompilation fails when build-exe executed in parallel #9439

zeramorphic opened this issue Jul 22, 2021 · 5 comments · Fixed by #10290
Labels
bug Observed behavior contradicts documented or intended behavior frontend Tokenization, parsing, AstGen, Sema, and Liveness.
Milestone

Comments

@zeramorphic
Copy link

zeramorphic commented Jul 22, 2021

Executing the zig compiler (specifically zig build-exe) multiple times in parallel can occasionally cause subcompilations to fail. This issue is present on both Ubuntu and Windows, running from the master branch. A sample of error messages:

Locally on Windows:
error: sub-compilation of compiler_rt failed
    C:\Users\<name>\.quill\compiler-deps\zig\lib\zig\std\special\compiler_rt\compareXf2.zig:10:21: error: unable to load 'C:\\Users\\<name>\\AppData\\Local\\zig\\o\\1e050970caac7a47b47d926ff10336fa\builtin.zig': UnexpectedEndOfFile

CI on Ubuntu:
error: sub-compilation of compiler_rt failed
    /opt/hostedtoolcache/zig/zig-linux-x86_64-0.9.0-dev.453+7ef854682/x64/lib/std/special/compiler_rt/mulodi4.zig:6:25: error: unable to load '/home/runner/.cache/zig/o/7b7987c94a9484a14de4c027d3e08478/builtin.zig': UnexpectedEndOfFile

CI on Windows:
error: sub-compilation of libssp failed
    C:\hostedtoolcache\windows\zig\zig-windows-x86_64-0.9.0-dev.477+00e944f71\x64\lib\zig\std\start.zig:8:22: error: unable to load 'C:\\Users\\runneradmin\\AppData\\Local\\zig\\o\\0e84b4fe2fc2fbd9bba0df21f28dc564\builtin.zig': UnexpectedEndOfFile

This can happen during CI runs (1, 2). In this project, I'm using zig as essentially a portable linker, so the test suite calls build-exe dozens of times in different threads. Without changes to my code, the same test can succeed or fail seemingly at random.

At a guess, it looks like there might be a race condition between different invocations of the compiler writing to and reading from files.

@alexrp
Copy link
Member

alexrp commented Jul 22, 2021

Out of curiosity, could you try setting the ZIG_LOCAL_CACHE_DIR environment variable to a directory local to each compilation and see if it still happens?

@zeramorphic
Copy link
Author

Running the same tests (with unique ZIG_LOCAL_CACHE_DIR) about 10 times eventually caused an error. This time it was in the linking stage.

lld-link: error: could not open 'C:\Users\<name>\AppData\Local\zig\o\acb93446373230c63e3bbe4c5551b85b\user32.lib': no such file or directory
error: LLDReportedFailure

Some files were kept in ZIG_LOCAL_CACHE_DIR, but evidently not all of the files that Zig uses.

@andrewrk andrewrk added the bug Observed behavior contradicts documented or intended behavior label Jul 22, 2021
@andrewrk andrewrk added this to the 0.8.1 milestone Jul 22, 2021
@andrewrk andrewrk added the frontend Tokenization, parsing, AstGen, Sema, and Liveness. label Jul 22, 2021
@andrewrk andrewrk modified the milestones: 0.8.1, 0.9.1 Aug 31, 2021
@andrewrk andrewrk modified the milestones: 0.9.1, 0.9.0 Nov 20, 2021
@andrewrk
Copy link
Member

Reminder to ping the thread of #10157 when this is solved and make sure it solved that issue too.

@andrewrk
Copy link
Member

Reminder to ping the thread of #10158 when this is solved and make sure it solved that issue too.

@andrewrk
Copy link
Member

andrewrk commented Dec 4, 2021

I've spent a few hours looking into this today and it'll take me another day or two but I've diagnosed the problems and I know how to solve them. The bad news is that fixing #9836 is not possible without hamstringing our std.fs API guarantees, but the good news is that #9836 does not appear to be the cause of this bug. It's something more related to the innards of the compiler. Anyway I'm confident I can fix it soon and I'm actively working on it.

andrewrk added a commit that referenced this issue Dec 6, 2021
All Zig code is eligible to `@import("builtin")` which is mapped to a
generated file, build.zig, based on the target and other settings.

Zig invocations which share the same target settings will generate the
same builtin.zig file and thus the path to builtin.zig is in a shared
cache folder, and different projects can sometimes use the same file.

Before this commit, this led to race conditions where multiple
invocations of `zig` would race to write this file. If one process
wanted to *read* the file while the other process *wrote* the file, the
reading process could observe a truncated or partially written
builtin.zig file.

This commit makes the following improvements:
 - limitations:
   - avoid clobbering the inode, mtime in the hot path
   - avoid creating a partially written file
   - builtin.zig needs to be on disk for debug info / stack trace purposes
   - don't mark the task as complete until the file is finished being populated
     (possibly by an external process)
 - strategy:
   - create the `@import("builtin")` `Module.File` during the AstGen
     work, based on generating the contents in memory rather than
     loading from disk.
   - write builtin.zig in a separate task that doesn't have
     to complete until the end of the AstGen work queue so that it
     can be done in parallel with everything else.
   - when writing the file, first stat the file path. If it exists, we are done.
   - otherwise, write the file to a temp file in the same directory and atomically
     rename it into place (clobbering the inode, mtime in the cold path).
 - summary:
   - all limitations respected
   - hot path: one stat() syscall that happens in a worker thread

This required adding a missing function to the standard library:
`std.fs.Dir.statFile`. In this commit, it does open() and then fstat()
which is two syscalls. It should be improved in a future commit to only
make one.

Fixes #9439.
andrewrk added a commit that referenced this issue Dec 7, 2021
All Zig code is eligible to `@import("builtin")` which is mapped to a
generated file, build.zig, based on the target and other settings.

Zig invocations which share the same target settings will generate the
same builtin.zig file and thus the path to builtin.zig is in a shared
cache folder, and different projects can sometimes use the same file.

Before this commit, this led to race conditions where multiple
invocations of `zig` would race to write this file. If one process
wanted to *read* the file while the other process *wrote* the file, the
reading process could observe a truncated or partially written
builtin.zig file.

This commit makes the following improvements:
 - limitations:
   - avoid clobbering the inode, mtime in the hot path
   - avoid creating a partially written file
   - builtin.zig needs to be on disk for debug info / stack trace purposes
   - don't mark the task as complete until the file is finished being populated
     (possibly by an external process)
 - strategy:
   - create the `@import("builtin")` `Module.File` during the AstGen
     work, based on generating the contents in memory rather than
     loading from disk.
   - write builtin.zig in a separate task that doesn't have
     to complete until the end of the AstGen work queue so that it
     can be done in parallel with everything else.
   - when writing the file, first stat the file path. If it exists, we are done.
   - otherwise, write the file to a temp file in the same directory and atomically
     rename it into place (clobbering the inode, mtime in the cold path).
 - summary:
   - all limitations respected
   - hot path: one stat() syscall that happens in a worker thread

This required adding a missing function to the standard library:
`std.fs.Dir.statFile`. In this commit, it does open() and then fstat()
which is two syscalls. It should be improved in a future commit to only
make one.

Fixes #9439.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Observed behavior contradicts documented or intended behavior frontend Tokenization, parsing, AstGen, Sema, and Liveness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants