-
Notifications
You must be signed in to change notification settings - Fork 279
Use a lock file to avoid exceptions due to concurrenct symlink creation #2851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
I'm unlikely to have time to really dive into this in the next couple of weeks but from first read:
|
Once the one program holding the lock closes the file when leaving the 'with' statement that opened the file, the other one can grab the lock. There's no explicit unlocking needed, if that's what you were looking for. I can add an explicit unlock if you want. Actually, while running the 3-liner after copying and pasting into the python interpreter prompt, it also prints out information in each loop:
Otherwise it's easy to add a print statement into the loop if run out of a .py file to see that they all run fine.
You really need the concurrency for a while. Oddly, we ran into this issue quite quickly when running concurrent test cases on github actions.
I am not as familiar with the code as you are. I am primarily trying to address the issue that we were seeing on the level of the root cause. There we can set a lock file with a fine granularity that only serializes access to a specific resources that we need to protect from concurrent access. However, it's quite possible that locking on a higher level could be needed if concurrent process were to conflict on shared files.
What we need is a file that is there for everyone to grab a lock on and that the file doesn't get removed while concurrent processes/threads are trying to grab the lock. Part of the code that we are trying to protect from concurrency deals with removing an existing symlink first (presumably because it is 'wrong') and then setting it again. The removal of the symlink would be a problem. Also, grabbing a lock on the 'current' file causes the issue of the file being created and remaining at 0 bytes content while with the separate lock file I get the current file with content of 5413 bytes. So it seemed better to create an explicit lock file. I agree that it's not nice to have these lock files there permanently laying around after they have been used, but I don't see another easy alternative. Other choices could be:
|
ah apologies, I missed the relation to the file handle -- this means I will have to take a closer look at the overall need for locking though as this is very short lived. Thanks for thinking out loud the lock file issue as well. |
I'm returning to this after a longer pause (sorry about that): summarizing the situation for myself and others:
So the remaining question seems to be: is it useful to merge this PR (fixing a real problem, but potentially later exposing the same users to issue 2) or should I try to implement the locking more extensively. I will have a look at the code today/tomorrow and see if a more comprehensive change makes sense... if not, I think this PR is still useful (even if I might still not recommend running 2 ngclients at the same time without further work) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards accepting this PR and not trying something fancier:
- This PR is quite contained and less likely to fail in unexpected ways: If I add higher level locking, there is a chance that we end up debugging locks that are not released when we expect -- and debugging undeterministically broken locking is the worst
- if this does not completely fix the issue, we're not really worse off than we are now -- the added complexity seems acceptable
I left some review comments (only the file name issue seems really important).Let me know what you think
Also I'm pretty sure there are lint issues in the code (we have pretty strict linter settings) but for some reason GitHub does not show me a button to approve the test run...
tuf/ngclient/updater.py
Outdated
@@ -66,6 +66,7 @@ | |||
from tuf.api.metadata import Root, Snapshot, TargetFile, Targets, Timestamp | |||
from tuf.ngclient._internal.trusted_metadata_set import TrustedMetadataSet | |||
from tuf.ngclient.config import EnvelopeType, UpdaterConfig | |||
from tuf.ngclient.file_lock import lock_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from tuf.ngclient.file_lock import lock_file | |
from tuf.ngclient._internal.file_lock import lock_file |
maybe hide this in _internal so we make it absolutely clear these hacks are not API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will move it.
tuf/ngclient/updater.py
Outdated
os.remove(linkname) | ||
os.symlink(current, linkname) | ||
|
||
with open(os.path.join(self._dir, current + ".lck"), "wb") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using the version number here is potentially harmful (the lock file is now <metadata_dir>/root_history/<version>.root.json.lck
) -- but we are actually trying to protect access to the unversioned symlink.
with open(os.path.join(self._dir, current + ".lck"), "wb") as f: | |
with open(os.path.join(self._dir, ".root.json.lck"), "wb") as f: |
(I also added the leading dot but that's just a suggestion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Let me change this.
tuf/ngclient/updater.py
Outdated
os.symlink(current, linkname) | ||
|
||
with open(os.path.join(self._dir, current + ".lck"), "wb") as f: | ||
lock_file(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment explaining why lock_file() is called might make sense here (maybe mention windows issues specifically for anyone who tries to repro in future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have run into some issues on Windows. When I dump filenames, it shows me something like this.
C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\root_history\12.root.json
I cannot access this file. When trying to walk the directory structure it ends for me at C:\...\Local
, so there's no sigstore
directory that I could cd into. Then I tried to switch to locking via a single file at %TEMP%\lck
-- it works 'better' but strangely not 100% reliable like it does on Linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of Windows I would need to switch to filelock library, which already is available when installing for example dev or build requirements. As it looks I may still need to either entirely switch to different file paths for locking or at least use %TEMP% file paths on Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without using any locking, it fails quickly on Windows. With locking (from filelock package for example) it works better but not 100% since after some time (several minutes) it again fails. Very odd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, let me check if the directory is guaranteed to exist at this point...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- using TEMP is not a solution: the reason for running this code is to create the files in the metadata directory so that has to work, otherwise the code is useless
- I'm a bit confused, can you reproduce the "missing directory" issue with code from main (as in doing whatever you did to get the failure but without your changes)? We do run tests on windows so I would expect this code to work there ... but Windows has never been an important platform so it's not impossible that there are platform bugs
Feel free to include the full error in your copy-paste (including the exception name) if you need to include another one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought maybe filelock has a subtle bug (it's deprecated), so I moved onto fasterners package but it does not solve the above PermissionError problem either. I tried to protect 12.root.json
from concurrent access with 3 different locks now. I am not sure where else it could be accessed so that this PermissionError could occur. The loop solves it but it should not be necessary.
lck = os.path.join(self._dir, ".root.json.lck")
ipl = InterProcessLock(create_lockfile(lck))
ipl.acquire()
try:
while True:
try:
os.replace(temp_file.name, filename)
break
except PermissionError as pe:
print(f"{pe}")
finally:
ipl.release()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another question: does the "Access is denied" only happen when you run your parallel execution test case?
Because if it does only happen in that case, then the "Access is denied" actually means "the file is used by another process", meaning we actually need the more extensive locking process: handling just the symlink is not enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another question: does the "Access is denied" only happen when you run your parallel execution test case?
Yes. And this is the only reason I see it is failing now when I let it propagate the Exception.
Because if it does only happen in that case, then the "Access is denied" actually means "the file is used by another process", meaning we actually need the more extensive locking process: handling just the symlink is not enough
That's what I guessed as well and that's the reason I now have 3 locks. Though maybe there is still another path were this file is being used concurrently. The test is still the simple 2 liner loop... if you have an idea...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for testing, I will have a look at the code: I think multiple separate locks during refresh() is unappealing but maybe setting up a single lock for the whole method is ok... Still need to think of something reasonable for dwnload_target() as I'm sure that will conflict in the same way -- and in that case locking for the whole method might not be appropriate.
But (unless you solve everything in the meantime) I'll have a look with fresh eyes tomorrow and hopefully come up with something.
It looks like I can't rerun the tests in this PR: if you add any new commits here, the option should become available again.
Yes, the new file does trigger ruff quite a bit :) You can run the linter locally with There are also test failures (because there is now an unexpected file in the metadata dir). I think fixing the expected results in the tests is the correct path forward there. You can run tests locally with Let me know if you want me to handle any of this. |
We have seen exceptions being raised from _update_root_symlink() on the level of the sigstore-python library when multiple concurrent threads were creating symlinks in this function with the same symlink name (in a test environment running tests concurrently). To avoid this issue, have each thread open a lock file and create an exclusive lock on it to serialize the access to the removal and creation of the symlink. The reproducer for this issue, that should be run in 2 or more python interpreters concurrently, looks like this: from sigstore import sign while True: sign.TrustedRoot.production() Use fcntl.lockf-based locking for Linux and Mac and a different implementation on Windows. The source originally comes from a discussion on stockoverflow (link below). Resolves: theupdateframework#2836 Link: https://stackoverflow.com/questions/489861/locking-a-file-in-python Signed-off-by: Stefan Berger <[email protected]>
2258629
to
1823a9e
Compare
We have seen exceptions being raised from _update_root_symlink() on the level of the sigstore-python library when multiple concurrent threads were creating symlinks in this function with the same symlink name (in a test environment running tests concurrently). To avoid this issue, have each thread open a lock file and create an exclusive lock on it to serialize the access to the removal and creation of the symlink.
The reproducer for this issue, that should be run in 2 or more python interpreters concurrently, looks like this:
Use fcntl.lockf-based locking for Linux and Mac and a different implementation on Windows. The source originally comes from a discussion on stockoverflow (link below).
Resolves: #2836
Link: https://stackoverflow.com/questions/489861/locking-a-file-in-python
Description of the changes being introduced by the pull request:
Fixes #2836