Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat][plugin] support creating RayCluster with config file #3225

Merged
merged 2 commits into from
Apr 10, 2025

Conversation

davidxia
Copy link
Contributor

@davidxia davidxia commented Mar 25, 2025

as described in the Ray Kubectl Plugin 1.4.0 Wishlist.

Here are the resulting RayCluster YAMLs after running kubectl ray create cluster dxia-test --file /path/to/file.yaml --dry-run on various config files.

RayCluster from a minimal config file
worker-groups:
- replicas: 1
  gpu: 1
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: dxia-test
  namespace: default
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: 0.0.0.0
    template:
      spec:
        containers:
        - image: rayproject/ray:2.41.0
          name: ray-head
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              memory: 4Gi
            requests:
              cpu: "2"
              memory: 4Gi
  rayVersion: 2.41.0
  workerGroupSpecs:
  - groupName: default-group
    rayStartParams:
      metrics-export-port: "8080"
    replicas: 1
    template:
      spec:
        containers:
        - image: rayproject/ray:2.41.0
          name: ray-worker
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              nvidia.com/gpu: "1"
RayCluster from a full config file
namespace: hyperkube
name: dxia-test

labels:
  foo: bar
annotations:
  dead: beef

ray-version: 2.44.0
image: rayproject/ray:2.44.0

head:
  cpu: 3
  memory: 5Gi
  gpu: 0
  ephemeral-storage: 8Gi
  ray-start-params:
    metrics-export-port: 8082
  node-selectors:
    foo: bar
    baz: qux

worker-groups:
- name: cpu-workers
  replicas: 1
  cpu: 2
  memory: 4Gi
  gpu: 0
  ephemeral-storage: 12Gi
  ray-start-params:
    metrics-export-port: 8081
  node-selectors:
    hi: there
- name: gpu-workers
  replicas: 1
  cpu: 3
  memory: 6Gi
  gpu: 1
  ephemeral-storage: 13Gi
  ray-start-params:
    metrics-export-port: 8081

gke:
  # Cloud Storage FUSE options
  gcsfuse:
    # Required bucket name
    bucket-name: my-bucket
    # Required mount path where bucket will be mounted in both head and worker nodes
    mount-path: /mnt/cluster_storage
    # See the Cloud Storage FUSE CLI file docs for all supported mount options.
    # https://cloud.google.com/storage/docs/cloud-storage-fuse/cli-options#options
    mount-options: "implicit-dirs,uid=1000,gid=100"
    # Optional resource configs for Cloud Storage FUSE CSI driver sidecar container
    resources:
      cpu: 250m
      memory: 256Mi
      ephemeral-storage: 5Gi
    # Optional volume attributes for Cloud Storage FUSE CSI driver
    # from the following page excluding the ones that Google recommends you set as mount options.
    # https://cloud.google.com/kubernetes-engine/docs/reference/cloud-storage-fuse-csi-driver/volume-attr
    disable-metrics: true
    gcsfuse-metadata-prefetch-on-mount: false
    skip-csi-bucket-access-check: false
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  annotations:
    dead: beef
  labels:
    foo: bar
  name: dxia-test
  namespace: hyperkube
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: 0.0.0.0
      metrics-export-port: "8082"
    template:
      metadata:
        annotations:
          gke-gcsfuse/cpu-request: 250m
          gke-gcsfuse/ephemeral-storage-limit: 5Gi
          gke-gcsfuse/ephemeral-storage-request: 5Gi
          gke-gcsfuse/memory-limit: 256Mi
          gke-gcsfuse/memory-request: 256Mi
      spec:
        containers:
        - image: rayproject/ray:2.44.0
          name: ray-head
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          resources:
            limits:
              ephemeral-storage: 8Gi
              memory: 5Gi
            requests:
              cpu: "3"
              ephemeral-storage: 8Gi
              memory: 5Gi
          volumeMounts:
          - mountPath: /mnt/cluster_storage
            name: cluster-storage
        nodeSelector:
          baz: qux
          foo: bar
        volumes:
        - csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: my-bucket
              disableMetrics: "true"
              gcsfuseMetadataPrefetchOnMount: "false"
              mountOptions: implicit-dirs,uid=1000,gid=100
              skipCSIBucketAccessCheck: "false"
          name: cluster-storage
  rayVersion: 2.44.0
  workerGroupSpecs:
  - groupName: cpu-workers
    rayStartParams:
      metrics-export-port: "8081"
    replicas: 1
    template:
      metadata:
        annotations:
          gke-gcsfuse/cpu-request: 250m
          gke-gcsfuse/ephemeral-storage-limit: 5Gi
          gke-gcsfuse/ephemeral-storage-request: 5Gi
          gke-gcsfuse/memory-limit: 256Mi
          gke-gcsfuse/memory-request: 256Mi
      spec:
        containers:
        - image: rayproject/ray:2.44.0
          name: ray-worker
          resources:
            limits:
              ephemeral-storage: 12Gi
              memory: 4Gi
            requests:
              cpu: "2"
              ephemeral-storage: 12Gi
              memory: 4Gi
          volumeMounts:
          - mountPath: /mnt/cluster_storage
            name: cluster-storage
        nodeSelector:
          hi: there
        volumes:
        - csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: my-bucket
              disableMetrics: "true"
              gcsfuseMetadataPrefetchOnMount: "false"
              mountOptions: implicit-dirs,uid=1000,gid=100
              skipCSIBucketAccessCheck: "false"
          name: cluster-storage
  - groupName: gpu-workers
    rayStartParams:
      metrics-export-port: "8081"
    replicas: 1
    template:
      metadata:
        annotations:
          gke-gcsfuse/cpu-request: 250m
          gke-gcsfuse/ephemeral-storage-limit: 5Gi
          gke-gcsfuse/ephemeral-storage-request: 5Gi
          gke-gcsfuse/memory-limit: 256Mi
          gke-gcsfuse/memory-request: 256Mi
      spec:
        containers:
        - image: rayproject/ray:2.44.0
          name: ray-worker
          resources:
            limits:
              ephemeral-storage: 13Gi
              memory: 6Gi
              nvidia.com/gpu: "1"
            requests:
              cpu: "3"
              ephemeral-storage: 13Gi
              memory: 6Gi
              nvidia.com/gpu: "1"
          volumeMounts:
          - mountPath: /mnt/cluster_storage
            name: cluster-storage
        volumes:
        - csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: my-bucket
              disableMetrics: "true"
              gcsfuseMetadataPrefetchOnMount: "false"
              mountOptions: implicit-dirs,uid=1000,gid=100
              skipCSIBucketAccessCheck: "false"
          name: cluster-storage

depends on #3238
closes #3142

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(


// GKEConfig represents GKE-specific configuration
type GKEConfig struct {
GCSFuse *GCSFuseConfig `yaml:"gcsfuse,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome :) do you think it's required to allow configuration of the service account to make this work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will test! by the way, a lot of these lines were added by Claude model in my Cursor IDE, and I haven't checked they make sense or work yet. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it's required to allow configuration of the service account to make this work?

I think so. How do you think that should look? Should we add kubectl ray create cluster --service-account KSA_NAME or something in a separate PR?

@davidxia davidxia force-pushed the config-file branch 2 times, most recently from 206b973 to 271fe11 Compare March 27, 2025 13:06
@davidxia davidxia force-pushed the config-file branch 6 times, most recently from 0848b28 to a7d1dc5 Compare March 28, 2025 13:57
@davidxia davidxia changed the title create cluster with config file [feat][plugin] support creating RayCluster with config file Mar 28, 2025
Comment on lines 119 to 151
if *options.configFlags.Namespace == "" {
*options.configFlags.Namespace = "default"
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to Run() because we process the config file there which might specify namespace too

@@ -132,7 +205,7 @@ func (options *CreateClusterOptions) Complete(cmd *cobra.Command, args []string)
return nil
}

func (options *CreateClusterOptions) Validate() error {
func (options *CreateClusterOptions) Validate(cmd *cobra.Command) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass cmd because switchesIncompatibleWithConfigFilePresent(cmd) needs it


// GKEConfig represents GKE-specific configuration
type GKEConfig struct {
GCSFuse *GCSFuseConfig `yaml:"gcsfuse,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it's required to allow configuration of the service account to make this work?

I think so. How do you think that should look? Should we add kubectl ray create cluster --service-account KSA_NAME or something in a separate PR?

Comment on lines +62 to +82
type GCSFuse struct {
MountOptions *string `yaml:"mount-options,omitempty"`
DisableMetrics *bool `yaml:"disable-metrics,omitempty"`
GCSFuseMetadataPrefetchOnMount *bool `yaml:"gcsfuse-metadata-prefetch-on-mount,omitempty"`
SkipCSIBucketAccessCheck *bool `yaml:"skip-csi-bucket-access-check,omitempty"`
Resources *GCSFuseResources `yaml:"resources,omitempty"`
BucketName string `yaml:"bucket-name"`
MountPath string `yaml:"mount-path"`
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmk if I'm missing anything here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need a way to specify service accounts, which is usually necessary for IAM binding with workload identity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a service-account field at the top level of the config make sense? See here.

If specified, the RayCluster head and all worker group Pod templates would have that as their serviceAccountName? Do we need any validation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 93533ad

Annotations map[string]string
RayClusterSpecObject
const (
volumeName = "cluster-storage"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hardcoded the name using the same one in Distributed checkpointing with KubeRay and GCSFuse. Lmk if this should be exposed to user and configurable.

@davidxia davidxia force-pushed the config-file branch 4 times, most recently from fecb922 to 56a8f79 Compare March 28, 2025 18:09
type RayClusterSpecObject struct {
Context *string `yaml:"context,omitempty"`
const (
volumeName = "cluster-storage"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be configurable?

Copy link
Collaborator

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @chiayi @MortalHappiness can you take a look too?

@davidxia davidxia force-pushed the config-file branch 2 times, most recently from dd6db04 to fdbfac8 Compare April 2, 2025 17:27
Comment on lines +104 to +118
cmd.Flags().StringToStringVar(&options.labels, "labels", nil, "K8s labels (e.g. --labels app=ray,env=dev)")
cmd.Flags().StringToStringVar(&options.annotations, "annotations", nil, "K8s annotations (e.g. --annotations ttl-hours=24,owner=chthulu)")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reordered flags to be in more logical groupings

Comment on lines -98 to +124
cmd.Flags().StringVar(&options.headCPU, "head-cpu", "2", "number of CPUs in the Ray head")
cmd.Flags().StringVar(&options.headMemory, "head-memory", "4Gi", "amount of memory in the Ray head")
cmd.Flags().StringVar(&options.headGPU, "head-gpu", "0", "number of GPUs in the Ray head")
cmd.Flags().StringVar(&options.headEphemeralStorage, "head-ephemeral-storage", "", "amount of ephemeral storage in the Ray head")
cmd.Flags().StringVar(&options.headCPU, "head-cpu", util.DefaultHeadCPU, "number of CPUs in the Ray head")
cmd.Flags().StringVar(&options.headMemory, "head-memory", util.DefaultHeadMemory, "amount of memory in the Ray head")
cmd.Flags().StringVar(&options.headGPU, "head-gpu", util.DefaultHeadGPU, "number of GPUs in the Ray head")
cmd.Flags().StringVar(&options.headEphemeralStorage, "head-ephemeral-storage", util.DefaultHeadEphemeralStorage, "amount of ephemeral storage in the Ray head")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaningful changes here are putting default values into constants in util so they can be DRY and used by both flag default values here and config default values.

@@ -16,8 +22,12 @@ import (
rayv1ac "github.com/ray-project/kuberay/ray-operator/pkg/client/applyconfiguration/ray/v1"
)

type RayClusterSpecObject struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to RayClusterConfig

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes here are the result of renaming RayClusterSpecObject to RayClusterConfig

Comment on lines -29 to +40
HeadCPU *string `yaml:"head-cpu,omitempty"`
HeadGPU *string `yaml:"head-gpu,omitempty"`
HeadMemory *string `yaml:"head-memory,omitempty"`
HeadEphemeralStorage *string `yaml:"head-ephemeral-storage,omitempty"`
HeadRayStartParams map[string]string `yaml:"head-ray-start-params,omitempty"`
HeadNodeSelectors map[string]string `yaml:"head-node-selectors,omitempty"`
Head *Head `yaml:"head,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nested head attributes into Head struct and removed head- prefix from the YAML keys.

Comment on lines 41 to 49
WorkerCPU *string `yaml:"worker-cpu,omitempty"`
WorkerGPU *string `yaml:"worker-gpu,omitempty"`
WorkerMemory *string `yaml:"worker-memory,omitempty"`
WorkerEphemeralStorage *string `yaml:"worker-ephemeral-storage,omitempty"`
WorkerReplicas *int32 `yaml:"worker-replicas,omitempty"`
WorkerRayStartParams map[string]string `yaml:"worker-ray-start-params,omitempty"`
WorkerNodeSelectors map[string]string `yaml:"worker-node-selectors,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed worker- prefix

Comment on lines 46 to 67
type Head struct {
CPU *string `yaml:"cpu,omitempty"`
GPU *string `yaml:"gpu,omitempty"`
Memory *string `yaml:"memory,omitempty"`
EphemeralStorage *string `yaml:"ephemeral-storage,omitempty"`
RayStartParams map[string]string `yaml:"ray-start-params,omitempty"`
NodeSelectors map[string]string `yaml:"node-selectors,omitempty"`
}

WorkerGroups []WorkerGroupConfig `yaml:"worker-groups,omitempty"`
type WorkerGroup struct {
Name *string `yaml:"name,omitempty"`
CPU *string `yaml:"cpu,omitempty"`
GPU *string `yaml:"gpu,omitempty"`
Memory *string `yaml:"memory,omitempty"`
EphemeralStorage *string `yaml:"ephemeral-storage,omitempty"`
RayStartParams map[string]string `yaml:"ray-start-params,omitempty"`
NodeSelectors map[string]string `yaml:"node-selectors,omitempty"`
Replicas int32 `yaml:"replicas"`
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you consider to use embeded struct and yaml:",inline" for these 2 structs? Because the fields are almost the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. I tried to see how it looks here. It gets dryer here in the struct definition. But it actually results in ~50 more lines of code because we have to add more lines to set the attributes now.

@davidxia davidxia force-pushed the config-file branch 2 times, most recently from f709e6f to 5de0fb8 Compare April 7, 2025 02:54
}
func (options *CreateClusterOptions) Validate(cmd *cobra.Command) error {
if options.configFile != "" {
if err := flagsIncompatibleWithConfigFilePresent(cmd); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decided to error if user tries to use a mix of config file and CLI flag to keep it simple and avoid merging or override logic for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM for this.

@@ -21,22 +23,24 @@ func ValidateResourceQuantity(value string, name string) error {
return nil
}

func ValidateTPUNodeSelector(numOfHosts int32, nodeSelector map[string]string) error {
func ValidateTPU(tpu *string, numOfHosts *int32, nodeSelector map[string]string) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored this function to include the logic of checking the TPU string so caller doesn't have to

@@ -30,23 +30,59 @@ func TestValidateResourceQuantity(t *testing.T) {
}

func TestValidateTPUNodeSelector(t *testing.T) {
tests := []struct {
tests := map[string]struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a name to each test case

Copy link
Member

@MortalHappiness MortalHappiness left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@chiayi chiayi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kevin85421
Copy link
Member

@davidxia would you mind fixing the conflict?

@davidxia
Copy link
Contributor Author

@davidxia would you mind fixing the conflict?

yup, updated

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp

@kevin85421 kevin85421 merged commit 099bf61 into ray-project:master Apr 10, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feat][kubectl-plugin] Config file for creating RayClusters
5 participants