Use a lock file to avoid exceptions due to concurrenct symlink creation #2851

stefanberger · 2025-07-15T17:22:00Z

We have seen exceptions being raised from _update_root_symlink() on the level of the sigstore-python library when multiple concurrent threads were creating symlinks in this function with the same symlink name (in a test environment running tests concurrently). To avoid this issue, have each thread open a lock file and create an exclusive lock on it to serialize the access to the removal and creation of the symlink.

The reproducer for this issue, that should be run in 2 or more python interpreters concurrently, looks like this:

from sigstore import sign
while True:
    sign.TrustedRoot.production()

Use fcntl.lockf-based locking for Linux and Mac and a different implementation on Windows. The source originally comes from a discussion on stockoverflow (link below).

Resolves: #2836
Link: https://stackoverflow.com/questions/489861/locking-a-file-in-python

Description of the changes being introduced by the pull request:

Fixes #2836

jku · 2025-07-16T08:26:12Z

The reproducer for this issue, that should be run in 2 or more python interpreters concurrently, looks like this:
from sigstore import sign
while True:
    sign.TrustedRoot.production()

I'm unlikely to have time to really dive into this in the next couple of weeks but from first read:

I assume with your code one of these test programs will continue happily and the other one will freeze completely forever, until the first program is killed? In fact you don't even need loops, just running sign.TrustedRoot.production() once per process probably does it as long as the first process remains running. I feel like you can't lock files in a library without ever unlocking them
Like I said in a comment, I would like to consider a higher level lock, or at least analyze the risks... the symlink call is what you are seeing an issue with but it might not be the only place where multiple updaters could conflict. Options might include taking a lock for Updater lifetime (also risking locking "forever" accidentally) or taking a lock while the top-level methods are running...
leaving the lock file in the directory feels wrong (existence of specific lock files usually implies that the lock is active -- unlike how fcntl.LOCK_EX works), do you think there is a way to use e.g. the symlink file as the lock file?

stefanberger · 2025-07-16T13:38:58Z

I assume with your code one of these test programs will continue happily and the other one will freeze completely forever,

Once the one program holding the lock closes the file when leaving the 'with' statement that opened the file, the other one can grab the lock. There's no explicit unlocking needed, if that's what you were looking for. I can add an explicit unlock if you want. Actually, while running the 3-liner after copying and pasting into the python interpreter prompt, it also prints out information in each loop:

<sigstore._internal.trust.TrustedRoot object at 0x7f23a7c3b5b0>
<sigstore._internal.trust.TrustedRoot object at 0x7f23a7cb2a90>

Otherwise it's easy to add a print statement into the loop if run out of a .py file to see that they all run fine.

until the first program is killed? In fact you don't even need loops, just running sign.TrustedRoot.production() once per process probably does it as long as the first process remains running. I feel like you can't lock files in a library without ever unlocking them

You really need the concurrency for a while. Oddly, we ran into this issue quite quickly when running concurrent test cases on github actions.

Like I said in a comment, I would like to consider a higher level lock, or at least analyze the risks... the symlink call is what you are seeing an issue with but it might not be the only place where multiple updaters could conflict. Options might include taking a lock

I am not as familiar with the code as you are. I am primarily trying to address the issue that we were seeing on the level of the root cause. There we can set a lock file with a fine granularity that only serializes access to a specific resources that we need to protect from concurrent access. However, it's quite possible that locking on a higher level could be needed if concurrent process were to conflict on shared files.

leaving the lock file in the directory feels wrong (existence of specific lock files usually implies that the lock is active -- unlike how fcntl.LOCK_EX works), do you think there is a way to use e.g. the symlink file as the lock file?

What we need is a file that is there for everyone to grab a lock on and that the file doesn't get removed while concurrent processes/threads are trying to grab the lock. Part of the code that we are trying to protect from concurrency deals with removing an existing symlink first (presumably because it is 'wrong') and then setting it again. The removal of the symlink would be a problem.

Also, grabbing a lock on the 'current' file causes the issue of the file being created and remaining at 0 bytes content while with the separate lock file I get the current file with content of 5413 bytes. So it seemed better to create an explicit lock file.

I agree that it's not nice to have these lock files there permanently laying around after they have been used, but I don't see another easy alternative. Other choices could be:

a directory under the f"{Path.home()}/.local/share" somewhere that is dedicated to lock files
a directory under the /tmp path (Windows %temp%) that is dedicated to lock files
above two choices with a single (coarse-grained) lock file for all file locking needs

jku · 2025-07-16T13:55:07Z

Once the one program holding the lock closes the file when leaving the 'with' statement that opened the file, the other one can grab the lock.

ah apologies, I missed the relation to the file handle -- this means I will have to take a closer look at the overall need for locking though as this is very short lived.

Thanks for thinking out loud the lock file issue as well.

jku · 2025-08-14T13:51:07Z

I'm returning to this after a longer pause (sorry about that): summarizing the situation for myself and others:

There is user demand for running multiple instances of python-tuf ngclient with the same metadatadir, typically as part of completely different processes
The issues with running multiple instances are:
1. the use of symlink for "root.json" (that was added as part Cache all root metadata versions #2767) creates a situation that easily fails on windows
2. more generally, modifying files in the metadata dir from two processes sounds like a bad idea
This PR provides a simple solution to issue 1 fixing the immediately user visible problem but does not attempt to solve issue 2. The solution seems pretty solid (as in I was incorrect in my earlier comments) although I'm not familiar with the Windows side
Maybe issue 2 is not as significant as I feared since we are not trying to protect against malicious modifications but in most cases can expect both ngclient processes to attempt to update the files to the same end state: as long as the file writes are atomic, we should be ok

So the remaining question seems to be: is it useful to merge this PR (fixing a real problem, but potentially later exposing the same users to issue 2) or should I try to implement the locking more extensively. I will have a look at the code today/tomorrow and see if a more comprehensive change makes sense... if not, I think this PR is still useful (even if I might still not recommend running 2 ngclients at the same time without further work)

jku

I'm leaning towards accepting this PR and not trying something fancier:

This PR is quite contained and less likely to fail in unexpected ways: If I add higher level locking, there is a chance that we end up debugging locks that are not released when we expect -- and debugging undeterministically broken locking is the worst
if this does not completely fix the issue, we're not really worse off than we are now -- the added complexity seems acceptable

I left some review comments (only the file name issue seems really important).Let me know what you think

Also I'm pretty sure there are lint issues in the code (we have pretty strict linter settings) but for some reason GitHub does not show me a button to approve the test run...

jku · 2025-08-15T07:05:48Z

tuf/ngclient/updater.py

@@ -66,6 +66,7 @@
 from tuf.api.metadata import Root, Snapshot, TargetFile, Targets, Timestamp
 from tuf.ngclient._internal.trusted_metadata_set import TrustedMetadataSet
 from tuf.ngclient.config import EnvelopeType, UpdaterConfig
+from tuf.ngclient.file_lock import lock_file


Suggested change

from tuf.ngclient.file_lock import lock_file

from tuf.ngclient._internal.file_lock import lock_file

maybe hide this in _internal so we make it absolutely clear these hacks are not API.

I will move it.

jku · 2025-08-15T07:05:50Z

tuf/ngclient/updater.py

-            os.remove(linkname)
-        os.symlink(current, linkname)
+
+        with open(os.path.join(self._dir, current + ".lck"), "wb") as f:


I think using the version number here is potentially harmful (the lock file is now <metadata_dir>/root_history/<version>.root.json.lck) -- but we are actually trying to protect access to the unversioned symlink.

Suggested change

with open(os.path.join(self._dir, current + ".lck"), "wb") as f:

with open(os.path.join(self._dir, ".root.json.lck"), "wb") as f:

(I also added the leading dot but that's just a suggestion)

Sounds good. Let me change this.

jku · 2025-08-15T07:07:46Z

tuf/ngclient/updater.py

-        os.symlink(current, linkname)
+
+        with open(os.path.join(self._dir, current + ".lck"), "wb") as f:
+            lock_file(f)


A comment explaining why lock_file() is called might make sense here (maybe mention windows issues specifically for anyone who tries to repro in future)

I have run into some issues on Windows. When I dump filenames, it shows me something like this.

C:\Users\StefanBerger\AppData\Local\sigstore\sigstore-python\tuf\https%3A%2F%2Ftuf-repo-cdn.sigstore.dev\root_history\12.root.json

I cannot access this file. When trying to walk the directory structure it ends for me at C:\...\Local, so there's no sigstore directory that I could cd into. Then I tried to switch to locking via a single file at %TEMP%\lck -- it works 'better' but strangely not 100% reliable like it does on Linux.

Because of Windows I would need to switch to filelock library, which already is available when installing for example dev or build requirements. As it looks I may still need to either entirely switch to different file paths for locking or at least use %TEMP% file paths on Windows.

Without using any locking, it fails quickly on Windows. With locking (from filelock package for example) it works better but not 100% since after some time (several minutes) it again fails. Very odd.

hmm, let me check if the directory is guaranteed to exist at this point...

using TEMP is not a solution: the reason for running this code is to create the files in the metadata directory so that has to work, otherwise the code is useless

I'm a bit confused, can you reproduce the "missing directory" issue with code from main (as in doing whatever you did to get the failure but without your changes)? We do run tests on windows so I would expect this code to work there ... but Windows has never been an important platform so it's not impossible that there are platform bugs

Feel free to include the full error in your copy-paste (including the exception name) if you need to include another one.

I thought maybe filelock has a subtle bug (it's deprecated), so I moved onto fasterners package but it does not solve the above PermissionError problem either. I tried to protect 12.root.json from concurrent access with 3 different locks now. I am not sure where else it could be accessed so that this PermissionError could occur. The loop solves it but it should not be necessary.

lck = os.path.join(self._dir, ".root.json.lck") ipl = InterProcessLock(create_lockfile(lck)) ipl.acquire() try: while True: try: os.replace(temp_file.name, filename) break except PermissionError as pe: print(f"{pe}") finally: ipl.release()

another question: does the "Access is denied" only happen when you run your parallel execution test case?

Because if it does only happen in that case, then the "Access is denied" actually means "the file is used by another process", meaning we actually need the more extensive locking process: handling just the symlink is not enough

another question: does the "Access is denied" only happen when you run your parallel execution test case?

Yes. And this is the only reason I see it is failing now when I let it propagate the Exception.

Because if it does only happen in that case, then the "Access is denied" actually means "the file is used by another process", meaning we actually need the more extensive locking process: handling just the symlink is not enough

That's what I guessed as well and that's the reason I now have 3 locks. Though maybe there is still another path were this file is being used concurrently. The test is still the simple 2 liner loop... if you have an idea...

Thanks for testing, I will have a look at the code: I think multiple separate locks during refresh() is unappealing but maybe setting up a single lock for the whole method is ok... Still need to think of something reasonable for dwnload_target() as I'm sure that will conflict in the same way -- and in that case locking for the whole method might not be appropriate.

But (unless you solve everything in the meantime) I'll have a look with fresh eyes tomorrow and hopefully come up with something.

jku · 2025-08-15T07:35:12Z

It looks like I can't rerun the tests in this PR: if you add any new commits here, the option should become available again.

Also I'm pretty sure there are lint issues in the code (we have pretty strict linter settings)

Yes, the new file does trigger ruff quite a bit :) You can run the linter locally with tox -e lint (after installing tox).

There are also test failures (because there is now an unexpected file in the metadata dir). I think fixing the expected results in the tests is the correct path forward there. You can run tests locally with tox -e py.

Let me know if you want me to handle any of this.

We have seen exceptions being raised from _update_root_symlink() on the level of the sigstore-python library when multiple concurrent threads were creating symlinks in this function with the same symlink name (in a test environment running tests concurrently). To avoid this issue, have each thread open a lock file and create an exclusive lock on it to serialize the access to the removal and creation of the symlink. The reproducer for this issue, that should be run in 2 or more python interpreters concurrently, looks like this: from sigstore import sign while True: sign.TrustedRoot.production() Use fcntl.lockf-based locking for Linux and Mac and a different implementation on Windows. The source originally comes from a discussion on stockoverflow (link below). Resolves: theupdateframework#2836 Link: https://stackoverflow.com/questions/489861/locking-a-file-in-python Signed-off-by: Stefan Berger <[email protected]>

stefanberger requested a review from a team as a code owner July 15, 2025 17:22

stefanberger force-pushed the file_lock branch from 65f10b7 to 2258629 Compare July 15, 2025 17:55

spencerschrock mentioned this pull request Aug 4, 2025

feat: limit Sigstore signer concurrency sigstore/model-transparency#510

Closed

5 tasks

jku requested changes Aug 15, 2025

View reviewed changes

stefanberger force-pushed the file_lock branch from 2258629 to 1823a9e Compare August 18, 2025 15:46

	from tuf.ngclient.file_lock import lock_file
	from tuf.ngclient._internal.file_lock import lock_file

	with open(os.path.join(self._dir, current + ".lck"), "wb") as f:
	with open(os.path.join(self._dir, ".root.json.lck"), "wb") as f:

Use a lock file to avoid exceptions due to concurrenct symlink creation #2851

Are you sure you want to change the base?

Use a lock file to avoid exceptions due to concurrenct symlink creation #2851

Uh oh!

Conversation

stefanberger commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jku commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefanberger commented Jul 16, 2025

Uh oh!

jku commented Jul 16, 2025

Uh oh!

jku commented Aug 14, 2025

Uh oh!

jku left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jku commented Aug 15, 2025

Uh oh!

Uh oh!

stefanberger commented Jul 15, 2025 •

edited

Loading

jku commented Jul 16, 2025 •

edited

Loading