Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible concurrency issues with resolve on a cluster #19

Open
amilsted opened this issue Oct 30, 2023 · 11 comments · May be fixed by #45
Open

Possible concurrency issues with resolve on a cluster #19

amilsted opened this issue Oct 30, 2023 · 11 comments · May be fixed by #45
Labels
bug Something isn't working

Comments

@amilsted
Copy link
Contributor

I think I am seeing multiple processes trying to simultaneously resolve julia versions and dependencies, mirroring JuliaPy/CondaPkg.jl#40. Maybe locking is needed here too?

@cjdoris cjdoris added the bug Something isn't working label Oct 31, 2023
@cjdoris
Copy link
Collaborator

cjdoris commented Oct 31, 2023

Yeah makes sense.

@amilsted
Copy link
Contributor Author

Oh, by the way, I wonder if "pidfile" locking makes sense on a cluster? I think it probably doesn't, since there is no global list of processes.

@cjdoris
Copy link
Collaborator

cjdoris commented Nov 2, 2023

I just took a look at the source:

https://github.com/vtjnash/Pidfile.jl/blob/67c3f23b25d4e43f1f5785ce77c2cac6952fefcf/src/Pidfile.jl#L128

Pidfile.jl uses both the PID and hostname, which presumably means it works with shared filesystems too.

@amilsted
Copy link
Contributor Author

amilsted commented Nov 2, 2023

Ah, that's smart! Much better than https://github.com/mosquito/python-pidfile/blob/master/pidfile/pidfile.py

https://github.com/keiranmraine/pidlock sees to use hostname too.

@cjdoris
Copy link
Collaborator

cjdoris commented Nov 2, 2023

Oh yeah of course this is a python package so need a python implementation of pidfiles.

@cjdoris
Copy link
Collaborator

cjdoris commented Nov 2, 2023

pidlock looks good thanks

@amilsted
Copy link
Contributor Author

amilsted commented Nov 2, 2023

Btw, is it really necessary to write out the meta file every time? Perhaps we could check for changes first?

@amilsted
Copy link
Contributor Author

The situation is much better now that we fall back on hashing juliapkg.json files when there is an mtime mismatch and also don't write out the meta file if there have been no changes. Although there are still situations in which concurrency issues would occur (if multiple processes are simultaneously trying to do the initial setup of an env), perhaps it is now enough to recommend users always initialize on one process before e.g. spawning many cluster jobs.

@cjdoris
Copy link
Collaborator

cjdoris commented Nov 21, 2024

Good to know thanks, I don't do any distributed programming myself. I'll leave this issue open as a reminder to do pidlocking though, it's still a good idea.

Recommending resolving first makes sense. Could put something into the readme and also the message that appears when waiting for the lock.

@willow-ahrens
Copy link

willow-ahrens commented Dec 16, 2024

I was getting a similar error in CI when multiple pytest runners all tried to import juliapkg. We ended up with FileExistsError because multiple runners had deleted and then tried to copytree a julia installation into the conda environment. We fixed this using your recommended approach (call import juliapkg on one process first).

However, a pidfile fix would be more robust, and not require downstream packages to be careful about including packages which depend on juliapkg. I am willing to contribute a PR, and want to double check that there is the will to review. The approach would be:

  1. use PidLock
  2. lock the installer (everything in install_julia)
  3. lock the resolver (everything in resolve)
  4. keep the lock file in the environment directory (the same place that we would put install)
  5. if the lock cannot be aquired after 1 second, prints a debug message that we are waiting on a julia install, and suggests that we might consider installing on one thread first.

@cjdoris
Copy link
Collaborator

cjdoris commented Jan 1, 2025

Yes a PR for this would be great thanks. Your plan sounds good.

@willow-ahrens willow-ahrens linked a pull request Jan 6, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants