Skip to content

Use a lock file to avoid exceptions due to concurrenct symlink creation #2851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

stefanberger
Copy link

@stefanberger stefanberger commented Jul 15, 2025

We have seen exceptions being raised from _update_root_symlink() on the level of the sigstore-python library when multiple concurrent threads were creating symlinks in this function with the same symlink name (in a test environment running tests concurrently). To avoid this issue, have each thread open a lock file and create an exclusive lock on it to serialize the access to the removal and creation of the symlink.

The reproducer for this issue, that should be run in 2 or more python interpreters concurrently, looks like this:

from sigstore import sign
while True:
    sign.TrustedRoot.production()

Use fcntl.lockf-based locking for Linux and Mac and a different implementation on Windows. The source originally comes from a discussion on stockoverflow (link below).

Resolves: #2836
Link: https://stackoverflow.com/questions/489861/locking-a-file-in-python

Description of the changes being introduced by the pull request:

Fixes #2836

@stefanberger stefanberger requested a review from a team as a code owner July 15, 2025 17:22
@jku
Copy link
Member

jku commented Jul 16, 2025

The reproducer for this issue, that should be run in 2 or more python interpreters concurrently, looks like this:

from sigstore import sign
while True:
    sign.TrustedRoot.production()

I'm unlikely to have time to really dive into this in the next couple of weeks but from first read:

  • I assume with your code one of these test programs will continue happily and the other one will freeze completely forever, until the first program is killed? In fact you don't even need loops, just running sign.TrustedRoot.production() once per process probably does it as long as the first process remains running. I feel like you can't lock files in a library without ever unlocking them
  • Like I said in a comment, I would like to consider a higher level lock, or at least analyze the risks... the symlink call is what you are seeing an issue with but it might not be the only place where multiple updaters could conflict. Options might include taking a lock for Updater lifetime (also risking locking "forever" accidentally) or taking a lock while the top-level methods are running...
  • leaving the lock file in the directory feels wrong (existence of specific lock files usually implies that the lock is active -- unlike how fcntl.LOCK_EX works), do you think there is a way to use e.g. the symlink file as the lock file?

@stefanberger
Copy link
Author

  • I assume with your code one of these test programs will continue happily and the other one will freeze completely forever,

Once the one program holding the lock closes the file when leaving the 'with' statement that opened the file, the other one can grab the lock. There's no explicit unlocking needed, if that's what you were looking for. I can add an explicit unlock if you want. Actually, while running the 3-liner after copying and pasting into the python interpreter prompt, it also prints out information in each loop:

<sigstore._internal.trust.TrustedRoot object at 0x7f23a7c3b5b0>
<sigstore._internal.trust.TrustedRoot object at 0x7f23a7cb2a90>

Otherwise it's easy to add a print statement into the loop if run out of a .py file to see that they all run fine.

until the first program is killed? In fact you don't even need loops, just running sign.TrustedRoot.production() once per process probably does it as long as the first process remains running. I feel like you can't lock files in a library without ever unlocking them

You really need the concurrency for a while. Oddly, we ran into this issue quite quickly when running concurrent test cases on github actions.

Like I said in a comment, I would like to consider a higher level lock, or at least analyze the risks... the symlink call is what you are seeing an issue with but it might not be the only place where multiple updaters could conflict. Options might include taking a lock

I am not as familiar with the code as you are. I am primarily trying to address the issue that we were seeing on the level of the root cause. There we can set a lock file with a fine granularity that only serializes access to a specific resources that we need to protect from concurrent access. However, it's quite possible that locking on a higher level could be needed if concurrent process were to conflict on shared files.

leaving the lock file in the directory feels wrong (existence of specific lock files usually implies that the lock is active -- unlike how fcntl.LOCK_EX works), do you think there is a way to use e.g. the symlink file as the lock file?

What we need is a file that is there for everyone to grab a lock on and that the file doesn't get removed while concurrent processes/threads are trying to grab the lock. Part of the code that we are trying to protect from concurrency deals with removing an existing symlink first (presumably because it is 'wrong') and then setting it again. The removal of the symlink would be a problem.

Also, grabbing a lock on the 'current' file causes the issue of the file being created and remaining at 0 bytes content while with the separate lock file I get the current file with content of 5413 bytes. So it seemed better to create an explicit lock file.

I agree that it's not nice to have these lock files there permanently laying around after they have been used, but I don't see another easy alternative. Other choices could be:

  • a directory under the f"{Path.home()}/.local/share" somewhere that is dedicated to lock files
  • a directory under the /tmp path (Windows %temp%) that is dedicated to lock files
  • above two choices with a single (coarse-grained) lock file for all file locking needs

@jku
Copy link
Member

jku commented Jul 16, 2025

Once the one program holding the lock closes the file when leaving the 'with' statement that opened the file, the other one can grab the lock.

ah apologies, I missed the relation to the file handle -- this means I will have to take a closer look at the overall need for locking though as this is very short lived.

Thanks for thinking out loud the lock file issue as well.

@jku
Copy link
Member

jku commented Aug 14, 2025

I'm returning to this after a longer pause (sorry about that): summarizing the situation for myself and others:

  • There is user demand for running multiple instances of python-tuf ngclient with the same metadatadir, typically as part of completely different processes
  • The issues with running multiple instances are:
    1. the use of symlink for "root.json" (that was added as part Cache all root metadata versions #2767) creates a situation that easily fails on windows
    2. more generally, modifying files in the metadata dir from two processes sounds like a bad idea
  • This PR provides a simple solution to issue 1 fixing the immediately user visible problem but does not attempt to solve issue 2. The solution seems pretty solid (as in I was incorrect in my earlier comments) although I'm not familiar with the Windows side
  • Maybe issue 2 is not as significant as I feared since we are not trying to protect against malicious modifications but in most cases can expect both ngclient processes to attempt to update the files to the same end state: as long as the file writes are atomic, we should be ok

So the remaining question seems to be: is it useful to merge this PR (fixing a real problem, but potentially later exposing the same users to issue 2) or should I try to implement the locking more extensively. I will have a look at the code today/tomorrow and see if a more comprehensive change makes sense... if not, I think this PR is still useful (even if I might still not recommend running 2 ngclients at the same time without further work)

Copy link
Member

@jku jku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards accepting this PR and not trying something fancier:

  • This PR is quite contained and less likely to fail in unexpected ways: If I add higher level locking, there is a chance that we end up debugging locks that are not released when we expect -- and debugging undeterministically broken locking is the worst
  • if this does not completely fix the issue, we're not really worse off than we are now -- the added complexity seems acceptable

I left some review comments (only the file name issue seems really important).Let me know what you think

Also I'm pretty sure there are lint issues in the code (we have pretty strict linter settings) but for some reason GitHub does not show me a button to approve the test run...

@@ -66,6 +66,7 @@
from tuf.api.metadata import Root, Snapshot, TargetFile, Targets, Timestamp
from tuf.ngclient._internal.trusted_metadata_set import TrustedMetadataSet
from tuf.ngclient.config import EnvelopeType, UpdaterConfig
from tuf.ngclient.file_lock import lock_file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from tuf.ngclient.file_lock import lock_file
from tuf.ngclient._internal.file_lock import lock_file

maybe hide this in _internal so we make it absolutely clear these hacks are not API.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will move it.

os.remove(linkname)
os.symlink(current, linkname)

with open(os.path.join(self._dir, current + ".lck"), "wb") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the version number here is potentially harmful (the lock file is now <metadata_dir>/root_history/<version>.root.json.lck) -- but we are actually trying to protect access to the unversioned symlink.

Suggested change
with open(os.path.join(self._dir, current + ".lck"), "wb") as f:
with open(os.path.join(self._dir, ".root.json.lck"), "wb") as f:

(I also added the leading dot but that's just a suggestion)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Let me change this.

os.symlink(current, linkname)

with open(os.path.join(self._dir, current + ".lck"), "wb") as f:
lock_file(f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment explaining why lock_file() is called might make sense here (maybe mention windows issues specifically for anyone who tries to repro in future)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run into some issues on Windows. When I dump filenames, it shows me something like this.

C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\root_history\12.root.json

I cannot access this file. When trying to walk the directory structure it ends for me at C:\...\Local, so there's no sigstore directory that I could cd into. Then I tried to switch to locking via a single file at %TEMP%\lck -- it works 'better' but strangely not 100% reliable like it does on Linux.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of Windows I would need to switch to filelock library, which already is available when installing for example dev or build requirements. As it looks I may still need to either entirely switch to different file paths for locking or at least use %TEMP% file paths on Windows.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without using any locking, it fails quickly on Windows. With locking (from filelock package for example) it works better but not 100% since after some time (several minutes) it again fails. Very odd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, let me check if the directory is guaranteed to exist at this point...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • using TEMP is not a solution: the reason for running this code is to create the files in the metadata directory so that has to work, otherwise the code is useless
  • I'm a bit confused, can you reproduce the "missing directory" issue with code from main (as in doing whatever you did to get the failure but without your changes)? We do run tests on windows so I would expect this code to work there ... but Windows has never been an important platform so it's not impossible that there are platform bugs

Feel free to include the full error in your copy-paste (including the exception name) if you need to include another one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought maybe filelock has a subtle bug (it's deprecated), so I moved onto fasterners package but it does not solve the above PermissionError problem either. I tried to protect 12.root.json from concurrent access with 3 different locks now. I am not sure where else it could be accessed so that this PermissionError could occur. The loop solves it but it should not be necessary.

            lck = os.path.join(self._dir, ".root.json.lck")
            ipl = InterProcessLock(create_lockfile(lck))
            ipl.acquire()
            try:
                while True:
                    try:
                        os.replace(temp_file.name, filename)
                        break
                    except PermissionError as pe:
                        print(f"{pe}")
            finally:
                ipl.release()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another question: does the "Access is denied" only happen when you run your parallel execution test case?

Because if it does only happen in that case, then the "Access is denied" actually means "the file is used by another process", meaning we actually need the more extensive locking process: handling just the symlink is not enough

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another question: does the "Access is denied" only happen when you run your parallel execution test case?

Yes. And this is the only reason I see it is failing now when I let it propagate the Exception.

Because if it does only happen in that case, then the "Access is denied" actually means "the file is used by another process", meaning we actually need the more extensive locking process: handling just the symlink is not enough

That's what I guessed as well and that's the reason I now have 3 locks. Though maybe there is still another path were this file is being used concurrently. The test is still the simple 2 liner loop... if you have an idea...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing, I will have a look at the code: I think multiple separate locks during refresh() is unappealing but maybe setting up a single lock for the whole method is ok... Still need to think of something reasonable for dwnload_target() as I'm sure that will conflict in the same way -- and in that case locking for the whole method might not be appropriate.

But (unless you solve everything in the meantime) I'll have a look with fresh eyes tomorrow and hopefully come up with something.

@jku
Copy link
Member

jku commented Aug 15, 2025

It looks like I can't rerun the tests in this PR: if you add any new commits here, the option should become available again.

Also I'm pretty sure there are lint issues in the code (we have pretty strict linter settings)

Yes, the new file does trigger ruff quite a bit :) You can run the linter locally with tox -e lint (after installing tox).

There are also test failures (because there is now an unexpected file in the metadata dir). I think fixing the expected results in the tests is the correct path forward there. You can run tests locally with tox -e py.

Let me know if you want me to handle any of this.

We have seen exceptions being raised from _update_root_symlink() on the
level of the sigstore-python library when multiple concurrent threads
were creating symlinks in this function with the same symlink name (in a
test environment running tests concurrently). To avoid this issue, have
each thread open a lock file and create an exclusive lock on it to
serialize the access to the removal and creation of the symlink.

The reproducer for this issue, that should be run in 2 or more python
interpreters concurrently, looks like this:

from sigstore import sign
while True:
    sign.TrustedRoot.production()

Use fcntl.lockf-based locking for Linux and Mac and a different
implementation on Windows. The source originally comes from a
discussion on stockoverflow (link below).

Resolves: theupdateframework#2836
Link: https://stackoverflow.com/questions/489861/locking-a-file-in-python
Signed-off-by: Stefan Berger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ngclient: Be better with concurrent instances
2 participants