Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint fails for containers with deleted-but-open files in --overlay2=none mode #11425

Open
andrew-anthropic opened this issue Feb 3, 2025 · 8 comments
Labels
type: bug Something isn't working

Comments

@andrew-anthropic
Copy link

Description

We're seeing issues when checkpointing some workloads where applications in the container are keeping file descriptor open to a deleted file. It seems like this is a common pattern in many third-party applications, making it a significant operational issue. Would it be possible to support this, or at least provide a flag to make this more of a "warning" rather than aborting the checkpoint entirely?

Error Message:

encoding error: gofer.dentry(...).beforeSave: deleted and invalidated dentries can't be restored

Example Paths:
We've seen this with various applications, including:

  • Browser temp files: /tmp/.org.chromium.Chromium.uG5Ddr
  • Graphics applications: root/.xpaint/tmp/XPaint-oEMdmH
  • NPM logs: root/.npm/_logs/2025-01-30T17_48_32_236Z-debug-0.log
  • Application startup files: root/startup
  • Language runtimes: /tmp/language-exchange

Here is a copy of the stacktrace

Steps to reproduce

No response

runsc version

version release-20240916.0

docker version (if using docker)

uname

No response

kubectl (if using Kubernetes)

repo state (if built from source)

No response

runsc debug logs (if available)

@andrew-anthropic andrew-anthropic added the type: bug Something isn't working label Feb 3, 2025
@ayushr2
Copy link
Collaborator

ayushr2 commented Feb 3, 2025

Yeah this is a known issue. cc @fvoznika

@l1n
Copy link

l1n commented Feb 3, 2025

Yeah, I think this is a documented gap - curious though whether it's easy to fix or swallow in some way?

@ayushr2
Copy link
Collaborator

ayushr2 commented Feb 4, 2025

The error you are seeing is coming from

func (d *dentry) beforeSave() {
if d.vfsd.IsDead() {
panic(fmt.Sprintf("gofer.dentry(%q).beforeSave: deleted and invalidated dentries can't be restored", genericDebugPathname(d.fs, d)))
}
}

From the 5 applications you have listed above, it seems that all such crashes are coming from the rootfs. (Note that /tmp is also considered part of rootfs if the container image has a non-empty /tmp.)

Could you confirm if you are using rootfs overlay? runsc enables it by default. You have to explicitly turn it off with --overlay2=none. If you are using rootfs overlay, then this might be easier to support. Since with rootfs overlay, the gofer lower layer filesystem doesn't change. I can look into this.

Note that this will be harder to support with non-rootfs gofer mounts (like bind mounts).

@xiangbin-hu
Copy link

We have --overlay2=none set as it is helpful for us to expose and modify the file diffs in the container's filesystem.

@ayushr2
Copy link
Collaborator

ayushr2 commented Feb 4, 2025

I verified that this issue does not occur with rootfs overlay.

If rootfs overlay is disabled, then changes to the rootfs are propagated to the host. So are you migrating the rootfs after checkpoint to the restore site? Supporting deleted file restore in gofer filesystem might be tricky. On restore, we may need to create the file on the host, fill it with the file contents, open an FD to it and then delete it again. cc @nixprime any better ideas?

helpful for us to expose and modify the file diffs in the container's filesystem.

When using rootfs overlay, the filesystem diff is stored in gVisor tmpfs. How do you want to modify this diff? Maybe it is possible to restore the container, modify the filesystem and then checkpoint again? With rootfs overlay, you don't have to worry about filesystem migration and it also prevents issues like this one (deleted file FD from host).

@ayushr2 ayushr2 changed the title Checkpoint fails for containers with deleted-but-open files Checkpoint fails for containers with deleted-but-open files in --overlay2=none mode Feb 4, 2025
@xiangbin-hu
Copy link

Here are some usecases enabled by --overlay=none:

  1. restore a filesystem only "checkpoint" onto a host with a different CPU
  2. we load some adhoc data to container, let the container run some processes, and then export the overlay fs data from the host for post processing after the container has been terminated.

@ayushr2
Copy link
Collaborator

ayushr2 commented Feb 4, 2025

we load some adhoc data to container, let the container run some processes, and then export the overlay fs data from the host for post processing after the container has been terminated.

Why do you need to checkpoint the container in this case? Consider runsc pause which will pause the container. It can then be resumed with runsc resume. You can pause the container and export the host overlayfs for post processing.

restore a filesystem only "checkpoint" onto a host with a different CPU

Sorry could you define what a "filesystem only checkpoint" is? If you do not need the gVisor checkpoint image, then no need to checkpoint the container. You can just pause it, take the "filesystem checkpoint" and use it.

@xiangbin-hu
Copy link

yeah what you said make sense. But at the same time we still have the use case where we need to support the normal gvisor checkpoint image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants