Skip to content

Commit

Permalink
Fix the local development environment and update documentation (#92)
Browse files Browse the repository at this point in the history
Fix the local development environment using Astro CLI and a Kind Kubernetes cluster, and update the documentation.

While implementing #81, I faced several issues in the local development environment. Unfortunately, the existing documentation and configuration did not allow developers to run the example DAGs locally.

One of the main issues was that Airflow (running in Docker via Astro CLI) could not connect to Kind properly. Once that was solved, another critical problem was that Airflow could not access the Ray clusters created in the Kind Kubernetes cluster.

Some of the issues faced include:
* Inconsistent connection naming
* Missing Kind configuration
* Missing Docker overrides for Astro CLI
* MacOS Docker/kind network specifics

After applying all these changes, I was able to successfully run all the example DAGs locally:
![Screenshot 2024-11-26 at 14 42 21](https://github.com/user-attachments/assets/4c35e3e9-604b-4458-9107-f4945ffa2a67)

As illustrated below:

![Screenshot 2024-11-26 at 14 42 34](https://github.com/user-attachments/assets/f2a8a700-7d0a-43c6-a5f1-ea74d6f8b54b)

![Screenshot 2024-11-26 at 14 42 47](https://github.com/user-attachments/assets/cfe0324f-1325-4d2a-9ed5-68d8f7f61d63)

![Screenshot 2024-11-26 at 14 42 58](https://github.com/user-attachments/assets/23f1dfa5-1c7a-46b6-9096-34459bdc5047)

![Screenshot 2024-11-26 at 14 43 08](https://github.com/user-attachments/assets/6d407da4-68d9-41dc-8be7-0ddd62844308)

With this PR, I hope to save other Ray provider developers time.
  • Loading branch information
tatiana authored Nov 27, 2024
1 parent 1abf239 commit 7eba460
Show file tree
Hide file tree
Showing 9 changed files with 352 additions and 9 deletions.
3 changes: 3 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,9 @@ jobs:
- name: Run integration tests
run: |
hatch run tests.py${{ matrix.python-version }}-${{ matrix.airflow-version }}:test-integration
env:
RAY_SPEC_FILENAME: "ray-gke.yaml"

- name: Upload coverage to Github
uses: actions/upload-artifact@v4
with:
Expand Down
5 changes: 4 additions & 1 deletion dev/dags/ray_taskflow_example.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import os
from datetime import datetime
from pathlib import Path

Expand All @@ -6,7 +7,9 @@
from ray_provider.decorators import ray

CONN_ID = "ray_conn"
RAY_SPEC = Path(__file__).parent / "scripts/ray.yaml"
RAY_SPEC_FILENAME = os.getenv("RAY_SPEC_FILENAME", "ray.yaml")
RAY_SPEC = Path(__file__).parent / "scripts" / RAY_SPEC_FILENAME

FOLDER_PATH = Path(__file__).parent / "ray_scripts"
RAY_TASK_CONFIG = {
"conn_id": CONN_ID,
Expand Down
2 changes: 1 addition & 1 deletion dev/dags/ray_taskflow_example_existing_cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

from ray_provider.decorators import ray

CONN_ID = "ray_job"
CONN_ID = "ray_conn"
FOLDER_PATH = Path(__file__).parent / "ray_scripts"
RAY_TASK_CONFIG = {
"conn_id": CONN_ID,
Expand Down
62 changes: 62 additions & 0 deletions dev/dags/scripts/ray-gke.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: raycluster-complete
spec:
rayVersion: "2.10.0"
enableInTreeAutoscaling: true
headGroupSpec:
serviceType: LoadBalancer
rayStartParams:
dashboard-host: "0.0.0.0"
block: "true"
template:
metadata:
labels:
ray-node-type: head
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:latest
resources:
limits:
cpu: 1
memory: 3Gi
requests:
cpu: 1
memory: 3Gi
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
- containerPort: 8080
name: metrics
workerGroupSpecs:
- groupName: small-group
replicas: 1
minReplicas: 1
maxReplicas: 2
rayStartParams:
block: "true"
template:
metadata:
spec:
containers:
- name: machine-learning
image: rayproject/ray-ml:latest
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 1
memory: 1Gi
10 changes: 7 additions & 3 deletions dev/dags/scripts/ray.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: raycluster-complete
name: airflow-raycluster
spec:
rayVersion: "2.10.0"
enableInTreeAutoscaling: true
Expand All @@ -15,9 +15,11 @@ spec:
labels:
ray-node-type: head
spec:
imagePullSecrets:
- name: my-registry-secret
containers:
- name: ray-head
image: rayproject/ray-ml:latest
image: rayproject/ray:2.20.0-aarch64
resources:
limits:
cpu: 1
Expand Down Expand Up @@ -50,9 +52,11 @@ spec:
template:
metadata:
spec:
imagePullSecrets:
- name: my-registry-secret
containers:
- name: machine-learning
image: rayproject/ray-ml:latest
image: rayproject/ray:2.20.0-aarch64
resources:
limits:
cpu: 1
Expand Down
18 changes: 18 additions & 0 deletions dev/docker-compose.override.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
version: '3.8'

services:
webserver:
networks:
- kind

scheduler:
networks:
- kind

triggerer:
networks:
- kind

networks:
kind:
external: true
16 changes: 16 additions & 0 deletions dev/kind-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
kind: Cluster
name: local
apiVersion: kind.x-k8s.io/v1alpha4
networking:
apiServerAddress: "0.0.0.0"
apiServerPort: 6443
nodes:
- role: control-plane
kubeadmConfigPatchesJSON6902:
- group: kubeadm.k8s.io
version: v1beta3
kind: ClusterConfiguration
patch: |
- op: add
path: /apiServer/certSANs/-
value: host.docker.internal
Loading

0 comments on commit 7eba460

Please sign in to comment.