nod-ai
diff --git a/‎README.md
Lines changed: 5 additions & 135 deletions b/‎README.md
Lines changed: 5 additions & 135 deletions
diff --git a/‎latest-config-files/values.yaml renamed to ‎config-files/iree-org/azure-linux-scale.yaml
Lines changed: 34 additions & 19 deletions b/‎latest-config-files/values.yaml renamed to ‎config-files/iree-org/azure-linux-scale.yaml
Lines changed: 34 additions & 19 deletions
diff --git a/‎legacy-config-files/horizontal-scale.yaml
Lines changed: 0 additions & 15 deletions b/‎legacy-config-files/horizontal-scale.yaml
Lines changed: 0 additions & 15 deletions
diff --git a/‎legacy-config-files/runner-controller.yaml
Lines changed: 0 additions & 42 deletions b/‎legacy-config-files/runner-controller.yaml
Lines changed: 0 additions & 42 deletions
diff --git a/‎legacy-config-files/runner-deployment.yaml
Lines changed: 0 additions & 17 deletions b/‎legacy-config-files/runner-deployment.yaml
Lines changed: 0 additions & 17 deletions
@@ -1,8 +1,8 @@
-# Azure-AKS-ARC-Setup
+# ARC Setup
 
-Documentation for bringing up an Azure Kubernetes cluster integrated with GitHub Actions Runner Controller for IREE Project
+Documentation for bringing up a Kubernetes cluster integrated with GitHub Actions Runner Controller.
 
-### Step 1: Create Azure Kubernetes Service
+### Step 1: Create Azure Kubernetes Service (skip if kubernetes already setup on bare metal or other CSP)
 
 Search for Kubernetes Service in the top search bar in Azure Portal. Once in, now click on Create -> Kubernetes Cluster. 
 Choose your resource group and cluster name and proceed with default options for Basics.
@@ -19,7 +19,7 @@ I went with this VM because out of all the 48 core ones, it is the only one that
 
 For the rest of the cluster creation options you can choose the default.
 
-### Step 2: Login to your Cluster
+### Step 2: Login to your Cluster (skip if kubernetes already setup on bare metal or other CSP)
 
 Now, to configure the cluster and all the services you need to connect to the cluster.
 You can do this in your own local dev environment (just make sure you have kube, helm, and azure cli installed)
@@ -44,7 +44,7 @@ helm install arc --namespace "arc" --create-namespace oci://ghcr.io/actions/acti
 ### Step 4: Configure and Deploy Runner Scale Set
 
 ```
-helm upgrade --install "azure-linux-scale"     --namespace "<namespace_name_for_runners>"     --create-namespace     --set githubConfigUrl="<link_to_your_github_repo_or_org>"     --set githubConfigSecret.github_token="<your_PAT_token>"     oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set -f values.yaml
+helm upgrade --install "azure-linux-scale"     --namespace "<namespace_name_for_runners>"     --create-namespace  oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set -f config-file.yaml
 ```
 
 Please use the values.yaml file from `latest-config-files` folder in this repo for the above command.
@@ -58,133 +58,3 @@ The scaling setup is basically the same as the legacy documentation below, so pl
 Also, docker in docker is setup, so in our github workflows we can specify images to use if we want (iree uses cpubuilder_ubuntu_jammy image for example), but as done in iree-turbine, we can just run workflows using the preconfigured custom image here without further setup and that works too.
 
 And you're done (just make sure label matches installation name in workflow) :)
-
-# Legacy ARC Instructions (still works)
-
-### Step 3: Install Cert Manager
-
-```
-helm repo add jetstack https://charts.jetstack.io
-helm repo update
-helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --version v1.15.3 --set crds.enabled=true
-```
-
-Cert-Manager is a Kubernetes add-on that automates the management and issuance of TLS (Transport Layer Security) certificates.
-This is used for security reasons.
-
-### Step 4: Install Github ARC and Authenticate
-
-I do this using a personal token. So, if you don't have one, create a github token with these permissions:
-
-```
-repo (all)
-admin:org (all) (mandatory for organization-wide runner)
-admin:enterprise (all) (mandatory for enterprise-wide runner)
-admin:public_key - read:public_key
-admin:repo_hook - read:repo_hook
-admin:org_hook
-notifications
-workflow
-```
-
-We will also be adding a webhook server as part of installing the actions-runner-controller, so we need to create a secret for the server to authenticate the github webhooks coming in.
-
-```
-kubectl create namespace actions-runner-system
-kubectl create secret generic github-selfhosted-webhook-token -n actions-runner-system --from-literal=SELFHOSTED_GITHUB_WEBHOOK_SECRET_TOKEN=<your_webhook_secret>
-```
-
-Then, use the following command to install the github ARC
-
-```
-helm repo add actions-runner-controller https://actions-runner-controller.github.io/actions-runner-controller
-helm repo update
-helm upgrade --install --namespace actions-runner-system --set=authSecret.create=true --set=authSecret.github_token="<your_token>"  --wait actions-runner-controller actions-runner-controller/actions-runner-controller -f runner-controller.yaml
-```
-
-The yaml file used above configures the actions runner controller service and the webhook server. I've added the yaml file I used (`runner-controller.yaml`) to this repo.
-Here we tell it to configure a bunch of things for the runner controller, and we give it a docker image to use.
-I've set it up to use `summerwind/actions-runner:ubuntu-22.04` which is the latest one provided by the github actions controller with dind enabled.
-This works fine for us and passes all iree-turbine jobs (with no docker) and the iree jobs (these use multiple docker images and work through dind)
-
-### Step 5: Configure GitHub Webhooks
-
-I've set this up to use webhooks to drive the overall scaling of our cluster.
-This scaling is performed based on the number of webhook events received from GitHub.
-Here's an image on how that overall process works:
-
-![image](https://github.com/user-attachments/assets/b11266c5-0c80-4a34-aa18-19a4da255965)
-
-
-To configure this, first we need to expose the github-webhook server created above to the public, so it can receive from GitHub API.
-To do this, get the current configuration if the server using this command:
-`kubectl get svc actions-runner-controller-github-webhook-server -n actions-runner-system -o yaml > current-config.yaml`
-
-Then, open up current-config.yaml and change spec type from `ClusterIP` to `LoadBalancer` in the yaml file and also delete the following lines which aren't neccesary after the switch.
-Also change `http` to `https` in the config.
-```
-clusterIP: 10.0.11.74
-  clusterIPs:
-  - 10.0.11.74
-  internalTrafficPolicy: Cluster
-  ipFamilies:
-  - IPv4
-  ipFamilyPolicy: SingleStack
-```
-TODO(saienduri): Find a way to just configure it with a load balancer initially (just webhook server, not the service)
-
-Then, to actually update the service to use the updated config:
-```
-kubectl apply -f current-config.yaml
-```
-
-Now that the server and webhook secret have been configured, you can go to the github org/repo to set up the github side of things.
-Go to "Settings" -> "Webhooks".
-Create a new webhook with address `http://<external-ip>/webhooks` and the content type as `application/json`.
-Then in the secret section add the secret that we added earlier.
-For events, you can pick "Let me select individual events" and then choose push, workflow, and workflow jobs.
-If you don't know the external IP of the webhook server you can run:
-`kubectl get svc -n actions-runner-system`
-
-<img width="566" alt="image" src="https://github.com/user-attachments/assets/76e5d247-c5dd-4aef-aba1-374b789ce7f8">
-
-
-### Step 6: Deploy the Runners
-
-Here, we deploy the runners.
-Specifically, we tell the actions runner controller how much resources we need (45 cores, 50 GB).
-We also give it a runner label that we use in the actual workflow `runs-on:` (I use azure-linux in the yaml)
-You can use the yaml in this repo (runner-deployment.yaml) in the following command:
-
-`kubectl apply -f runner-deployment.yaml`
-
-### Step 7: Configure HRA
-
-This is to configure GitHub Actions Runner Controller's HorizontalRunnerAutoscaler (HRA).
-With the GitHub Actions Runner Controller in a Kubernetes cluster, each runner corresponds to a single container within a pod, and each pod only runs one runner.
-This particular design of the Actions Runner Controller makes sure that each runner operates in its own isolated environment, for the best security of concurrent CI jobs running.
-So, you can think of HRA as a specialized version of HPA, and we don't need it in the GitHub ARC context.
-Here, we tell HRA to scale the GitHub Actions runners based on the webhooks we configured earlier.
-Specifically, we trigger an autoscale everytime there is a webhook event for a workflow, so a runner will be requested.
-It will also downscale appropriately.
-You can use the yaml in this repo (horizontal-scale.yaml) for the following command:
-
-`kubectl apply -f horizontal-scale.yaml`
-
-Basically there are two levels of autoscaling.
-HRA adjusts the number of pods to meet the runner demand.
-If the number of pods increases beyond the capacity of the current nodes, the Cluster Autoscaler (the thing we setup at the very start) steps in to scale up the node pool, adding more nodes to provide the necessary resources for the additional pods.
-
-
-Now, change your workflows appropriately to match the labels set in the runner-deployment.yaml and enjoy the AKS + ARC magic :)
-
-
-
-
-
-
-
-
-
-
-
@@ -1,3 +1,8 @@
+# Cluster: Azure SaiScale Kubernetes Cluster
+# Deployment command:
+# helm upgrade --install "azure-linux-scale" --namespace "arc-runners" --create-namespace oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set -f <path-to-this-file>
+githubConfigUrl: https://github.com/iree-org
+githubConfigSecret: "iree-secret"
 ## maxRunners is the max number of runners the auto scaling runner set will scale up to.
 maxRunners: 30
 
@@ -16,11 +21,39 @@ template:
         volumeMounts:
           - name: dind-externals
             mountPath: /home/runner/tmpDir
+      - name: dind
+        image: ghcr.io/saienduri/dind:main
+        restartPolicy: Always
+        command: ["sh", "-c"]
+        args:
+          - |
+            dockerd --host=unix:///var/run/docker.sock --group=${DOCKER_GROUP_GID} &
+            until docker info >/dev/null 2>&1; do sleep 5; done
+            tail -f /dev/null
+        env:
+          - name: DOCKER_GROUP_GID
+            value: "123"
+        securityContext:
+          privileged: true
+        volumeMounts:
+          - name: work
+            mountPath: /home/runner/_work
+          - name: dind-sock
+            mountPath: /var/run
+          - name: dind-externals
+            mountPath: /home/runner/externals
     containers:
       - name: runner
         image: ghcr.io/saienduri/ghascale:main
         imagePullPolicy: Always
-        command: ["/home/runner/run.sh"]
+        command:
+          - /bin/sh
+          - -c
+          - |
+            # Wait for Docker to be ready before starting runner
+            echo "Waiting for docker..."
+            until docker info >/dev/null 2>&1; do sleep 5; done
+            /home/runner/run.sh
         resources:
           requests:
             cpu: 40000m
@@ -33,24 +66,6 @@ template:
             mountPath: /home/runner/_work
           - name: dind-sock
             mountPath: /var/run
-      - name: dind
-        image: docker:dind
-        args:
-          - dockerd
-          - --host=unix:///var/run/docker.sock
-          - --group=$(DOCKER_GROUP_GID)
-        env:
-          - name: DOCKER_GROUP_GID
-            value: "123"
-        securityContext:
-          privileged: true
-        volumeMounts:
-          - name: work
-            mountPath: /home/runner/_work
-          - name: dind-sock
-            mountPath: /var/run
-          - name: dind-externals
-            mountPath: /home/runner/externals
     volumes:
       - name: work
         emptyDir: {}