Skip to content

feat(helm): move dind to sidecar #3842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 23, 2025

Conversation

velkovb
Copy link
Contributor

@velkovb velkovb commented Dec 11, 2024

Run dind container as a Sidecar

Currently, in dind mode we run 2 containers in the pod. This results in them having their own lifecycles. In the case of dind this could lead to issues as the process in the runner container depend on the docker socket from the dind container. We hit that multiple times were we wanted to wait for jobs to complete and the runner container stays alive as we handle the SIGTERM manually but dind dies and the jo fails. When running as a Sidecar container dind stays alive as long as the pod is alive and we don't need to do custom SIGTERM handling there.

Since 1.29 Sidecars are enabled by default so this PR moves the dind container as a Sidecar if the version is above 1.29.

@velkovb velkovb marked this pull request as ready for review December 12, 2024 10:03
@velkovb velkovb requested review from mumoshu, toast-gear, rentziass and a team as code owners December 12, 2024 10:03
@chadxz
Copy link

chadxz commented Jan 5, 2025

I think this may fix an issue i'm seeing sometimes with our runners. In some jobs, we were seeing this error message in the "Initialize Containers" workflow step:

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

@velkovb
Copy link
Contributor Author

velkovb commented Jan 6, 2025

I think this may fix an issue i'm seeing sometimes with our runners. In some jobs, we were seeing this error message in the "Initialize Containers" workflow step:

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

@chadxz This exactly the issue we were seeing and it helped us fix it.

@chadxz
Copy link

chadxz commented Jan 7, 2025

I tried using this but realized it conflicts with my manual dind setup in values.yaml. However I was able to move my dind container from “containers” to “initContainers” and set ‘restartPolicy: Always’ to replicate this behavior.

worked!

@velkovb
Copy link
Contributor Author

velkovb commented Jan 7, 2025

I tried using this but realized it conflicts with my manual dind setup in values.yaml. However I was able to move my dind container from “containers” to “initContainers” and set ‘retryPolicy: Always’ to replicate this behavior.

worked!

Probably the case for most people. It is in the custom dind for us as well but I believe it should be the default option actually

@velkovb velkovb requested a review from nikola-jokic as a code owner January 17, 2025 16:51
@saienduri
Copy link

saienduri commented Feb 3, 2025

I tried using this but realized it conflicts with my manual dind setup in values.yaml. However I was able to move my dind container from “containers” to “initContainers” and set ‘retryPolicy: Always’ to replicate this behavior.

worked!

Hi @chadxz, seeing the same problem and really curious on how you were able to do this? From my understanding, Init containers have to run completion before the main containers are started. How do you make sure that the daemon is still running by the time we get to the main runner container?

Update: I see that adding the restart policy to always makes it a sidecar container which means it stays running

@chadxz
Copy link

chadxz commented Feb 3, 2025

@saienduri as you found, the restart policy is the trick

@danielmanor
Copy link

@chadxz @velkovb I'm having this issue intermittently in the "Initialize Containers" workflow step. I've done what you're suggesting, but it didn't solve it. I'm using Kubernetes 1.30.

In case I'm missing something, that's the template.spec I've been using:

template:
  spec:
    initContainers:
      - name: init-dind-externals
        image: ghcr.io/actions/actions-runner:latest
        command:
          ["cp", "-r", "/home/runner/externals/.", "/home/runner/tmpDir/"]
        volumeMounts:
          - name: dind-externals
            mountPath: /home/runner/tmpDir
      - name: dind
        image: docker:dind
        args:
          - dockerd
          - --host=unix:///var/run/docker.sock
          - --group=$(DOCKER_GROUP_GID)
        env:
          - name: DOCKER_GROUP_GID
            value: "123"
        securityContext:
          privileged: true
        restartPolicy: Always
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: dind-sock
            mountPath: /var/run
          - name: dind-externals
            mountPath: /home/runner/externals
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
          - name: DOCKER_HOST
            value: unix:///var/run/docker.sock
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: dind-sock
            mountPath: /var/run
    volumes:
      - name: work
        emptyDir: {}
      - name: dind-sock
        emptyDir: {}
      - name: dind-externals
        emptyDir: {}

@velkovb
Copy link
Contributor Author

velkovb commented Apr 16, 2025

Is there any chance for this to be reviewed? @nikola-jokic

@FabioDavidF
Copy link

Bump

@nikola-jokic
Copy link
Collaborator

Hey @velkovb,

Can you please fix the test as well?

@velkovb
Copy link
Contributor Author

velkovb commented Apr 23, 2025

@nikola-jokic Thanks for bringing it up. I didn't realise there were tests. Ran tests locally and they pass now.

Copy link
Collaborator

@nikola-jokic nikola-jokic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for your contribution!

@nikola-jokic nikola-jokic merged commit 32f19ac into actions:master Apr 23, 2025
17 checks passed
@velkovb velkovb deleted the feat/dind-sidecar branch April 23, 2025 12:52
@elocke
Copy link

elocke commented Apr 27, 2025

We've been struggling with this Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? error for a very long time, I can't believe I stumbled on this PR! Any chance this fix could get released asap?

@nikola-jokic
Copy link
Collaborator

Hey @elocke,

You can use these changes before we issue the next release by not using our official chart, but the local one with the changes you need.

@elocke
Copy link

elocke commented Apr 30, 2025

Gave it a shot, unfortunately AWS EKS does not support Sidecar initcontainers yet :/ Thanks for your work!

@velkovb
Copy link
Contributor Author

velkovb commented Apr 30, 2025

@elocke we are on EKS and that works just fine. You should be above 1.29 for it to work.

@elocke
Copy link

elocke commented Apr 30, 2025

@elocke we are on EKS and that works just fine. You should be above 1.29 for it to work.

Very interesting, we're on 1.32 and AWS support told me it's not supported. restartPolicy in the initcontainer is dropped from the live manifest

Even this test doesn't work for me:

apiVersion: v1
kind: Pod
metadata:
  name: sidecar-test-behavior
  labels:
    admission.datadoghq.com/enabled: "false"

spec:
  initContainers:
  - name: sidecar-init
    image: docker:dind
    args: ["dockerd", "--host=unix:///var/run/docker.sock"]
    securityContext:
      privileged: true
    restartPolicy: Always
    volumeMounts:
    - name: sock
      mountPath: /var/run
  containers:
  - name: main
    image: busybox:1.28
    command: ["sleep", "300"]
    volumeMounts:
    - name: sock
      mountPath: /var/run
  volumes:
  - name: sock
    emptyDir: {}

EDIT: figured it out, it was an old mutatingwebhook that was pulling it out. Thanks again. However, we're seeing this error of ERROR: listing workers for Build: failed to list workers: Unavailable: connection error: desc = "error reading server preface: read unix @->/var/run/docker.sock: use of closed network connection"

@dee-kryvenko
Copy link

Just noticed this change in the release notes. Why in the world are you assuming that dockerd do not need to handle SIGTERM gracefully? How did you test that theory and what proof do you have?

@sorliem
Copy link

sorliem commented Jun 13, 2025

This example in values.yaml is confusing - Wouldn't you need the docker:dind container to always run as a sidecar and not as an init container? If your action is doing docker bulids in the runner, the docker daemon must be alive for the duration of the build, am I missing something?

edit, nevermind

@dmitrijrub
Copy link

Hi, i have issues running dind-rootless.
CRD EphemealRunners gets error
Failed to create the pod: Pod "blablabla" is invalid: spec.initContainers[0].name: Duplicate value: "dind"

can you update the official documentation how to run rootless ARC?
https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/deploying-runner-scale-sets-with-actions-runner-controller#example-running-dind-rootless

@djabalbuena
Copy link

Hi,

I wanted to ask if restartPolicy: Always is already enough to fix the issue with Docker daemon connectivity (Cannot connect to the Docker daemon at unix:///var/run/docker.sock) or if probing is necessary to ensure the dind container is fully initialized before the runner starts.

For example, Ken Muse describes an approach where the dind container uses a startupProbe with the docker info command to verify that the Docker daemon is ready. This ensures the runner doesn't start prematurely and avoids connectivity issues.

Would love to hear your thoughts or experiences on whether restartPolicy: Always alone is sufficient or if probing adds significant value.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.