-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat][plugin] support creating RayCluster with config file #3225
Conversation
|
||
// GKEConfig represents GKE-specific configuration | ||
type GKEConfig struct { | ||
GCSFuse *GCSFuseConfig `yaml:"gcsfuse,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome :) do you think it's required to allow configuration of the service account to make this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will test! by the way, a lot of these lines were added by Claude model in my Cursor IDE, and I haven't checked they make sense or work yet. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think it's required to allow configuration of the service account to make this work?
I think so. How do you think that should look? Should we add kubectl ray create cluster --service-account KSA_NAME
or something in a separate PR?
206b973
to
271fe11
Compare
0848b28
to
a7d1dc5
Compare
if *options.configFlags.Namespace == "" { | ||
*options.configFlags.Namespace = "default" | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to Run()
because we process the config file there which might specify namespace too
@@ -132,7 +205,7 @@ func (options *CreateClusterOptions) Complete(cmd *cobra.Command, args []string) | |||
return nil | |||
} | |||
|
|||
func (options *CreateClusterOptions) Validate() error { | |||
func (options *CreateClusterOptions) Validate(cmd *cobra.Command) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass cmd
because switchesIncompatibleWithConfigFilePresent(cmd)
needs it
|
||
// GKEConfig represents GKE-specific configuration | ||
type GKEConfig struct { | ||
GCSFuse *GCSFuseConfig `yaml:"gcsfuse,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think it's required to allow configuration of the service account to make this work?
I think so. How do you think that should look? Should we add kubectl ray create cluster --service-account KSA_NAME
or something in a separate PR?
type GCSFuse struct { | ||
MountOptions *string `yaml:"mount-options,omitempty"` | ||
DisableMetrics *bool `yaml:"disable-metrics,omitempty"` | ||
GCSFuseMetadataPrefetchOnMount *bool `yaml:"gcsfuse-metadata-prefetch-on-mount,omitempty"` | ||
SkipCSIBucketAccessCheck *bool `yaml:"skip-csi-bucket-access-check,omitempty"` | ||
Resources *GCSFuseResources `yaml:"resources,omitempty"` | ||
BucketName string `yaml:"bucket-name"` | ||
MountPath string `yaml:"mount-path"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lmk if I'm missing anything here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need a way to specify service accounts, which is usually necessary for IAM binding with workload identity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does a service-account
field at the top level of the config make sense? See here.
If specified, the RayCluster head and all worker group Pod templates would have that as their serviceAccountName
? Do we need any validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 93533ad
Annotations map[string]string | ||
RayClusterSpecObject | ||
const ( | ||
volumeName = "cluster-storage" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hardcoded the name using the same one in Distributed checkpointing with KubeRay and GCSFuse. Lmk if this should be exposed to user and configurable.
fecb922
to
56a8f79
Compare
753f653
to
dbb7621
Compare
type RayClusterSpecObject struct { | ||
Context *string `yaml:"context,omitempty"` | ||
const ( | ||
volumeName = "cluster-storage" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @chiayi @MortalHappiness can you take a look too?
dd6db04
to
fdbfac8
Compare
cmd.Flags().StringToStringVar(&options.labels, "labels", nil, "K8s labels (e.g. --labels app=ray,env=dev)") | ||
cmd.Flags().StringToStringVar(&options.annotations, "annotations", nil, "K8s annotations (e.g. --annotations ttl-hours=24,owner=chthulu)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reordered flags to be in more logical groupings
cmd.Flags().StringVar(&options.headCPU, "head-cpu", "2", "number of CPUs in the Ray head") | ||
cmd.Flags().StringVar(&options.headMemory, "head-memory", "4Gi", "amount of memory in the Ray head") | ||
cmd.Flags().StringVar(&options.headGPU, "head-gpu", "0", "number of GPUs in the Ray head") | ||
cmd.Flags().StringVar(&options.headEphemeralStorage, "head-ephemeral-storage", "", "amount of ephemeral storage in the Ray head") | ||
cmd.Flags().StringVar(&options.headCPU, "head-cpu", util.DefaultHeadCPU, "number of CPUs in the Ray head") | ||
cmd.Flags().StringVar(&options.headMemory, "head-memory", util.DefaultHeadMemory, "amount of memory in the Ray head") | ||
cmd.Flags().StringVar(&options.headGPU, "head-gpu", util.DefaultHeadGPU, "number of GPUs in the Ray head") | ||
cmd.Flags().StringVar(&options.headEphemeralStorage, "head-ephemeral-storage", util.DefaultHeadEphemeralStorage, "amount of ephemeral storage in the Ray head") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaningful changes here are putting default values into constants in util
so they can be DRY and used by both flag default values here and config default values.
@@ -16,8 +22,12 @@ import ( | |||
rayv1ac "github.com/ray-project/kuberay/ray-operator/pkg/client/applyconfiguration/ray/v1" | |||
) | |||
|
|||
type RayClusterSpecObject struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to RayClusterConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes here are the result of renaming RayClusterSpecObject
to RayClusterConfig
HeadCPU *string `yaml:"head-cpu,omitempty"` | ||
HeadGPU *string `yaml:"head-gpu,omitempty"` | ||
HeadMemory *string `yaml:"head-memory,omitempty"` | ||
HeadEphemeralStorage *string `yaml:"head-ephemeral-storage,omitempty"` | ||
HeadRayStartParams map[string]string `yaml:"head-ray-start-params,omitempty"` | ||
HeadNodeSelectors map[string]string `yaml:"head-node-selectors,omitempty"` | ||
Head *Head `yaml:"head,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nested head attributes into Head
struct and removed head-
prefix from the YAML keys.
WorkerCPU *string `yaml:"worker-cpu,omitempty"` | ||
WorkerGPU *string `yaml:"worker-gpu,omitempty"` | ||
WorkerMemory *string `yaml:"worker-memory,omitempty"` | ||
WorkerEphemeralStorage *string `yaml:"worker-ephemeral-storage,omitempty"` | ||
WorkerReplicas *int32 `yaml:"worker-replicas,omitempty"` | ||
WorkerRayStartParams map[string]string `yaml:"worker-ray-start-params,omitempty"` | ||
WorkerNodeSelectors map[string]string `yaml:"worker-node-selectors,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed worker-
prefix
type Head struct { | ||
CPU *string `yaml:"cpu,omitempty"` | ||
GPU *string `yaml:"gpu,omitempty"` | ||
Memory *string `yaml:"memory,omitempty"` | ||
EphemeralStorage *string `yaml:"ephemeral-storage,omitempty"` | ||
RayStartParams map[string]string `yaml:"ray-start-params,omitempty"` | ||
NodeSelectors map[string]string `yaml:"node-selectors,omitempty"` | ||
} | ||
|
||
WorkerGroups []WorkerGroupConfig `yaml:"worker-groups,omitempty"` | ||
type WorkerGroup struct { | ||
Name *string `yaml:"name,omitempty"` | ||
CPU *string `yaml:"cpu,omitempty"` | ||
GPU *string `yaml:"gpu,omitempty"` | ||
Memory *string `yaml:"memory,omitempty"` | ||
EphemeralStorage *string `yaml:"ephemeral-storage,omitempty"` | ||
RayStartParams map[string]string `yaml:"ray-start-params,omitempty"` | ||
NodeSelectors map[string]string `yaml:"node-selectors,omitempty"` | ||
Replicas int32 `yaml:"replicas"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you consider to use embeded struct and yaml:",inline"
for these 2 structs? Because the fields are almost the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea. I tried to see how it looks here. It gets dryer here in the struct definition. But it actually results in ~50 more lines of code because we have to add more lines to set the attributes now.
f709e6f
to
5de0fb8
Compare
} | ||
func (options *CreateClusterOptions) Validate(cmd *cobra.Command) error { | ||
if options.configFile != "" { | ||
if err := flagsIncompatibleWithConfigFilePresent(cmd); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decided to error if user tries to use a mix of config file and CLI flag to keep it simple and avoid merging or override logic for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM for this.
@@ -21,22 +23,24 @@ func ValidateResourceQuantity(value string, name string) error { | |||
return nil | |||
} | |||
|
|||
func ValidateTPUNodeSelector(numOfHosts int32, nodeSelector map[string]string) error { | |||
func ValidateTPU(tpu *string, numOfHosts *int32, nodeSelector map[string]string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored this function to include the logic of checking the TPU string so caller doesn't have to
@@ -30,23 +30,59 @@ func TestValidateResourceQuantity(t *testing.T) { | |||
} | |||
|
|||
func TestValidateTPUNodeSelector(t *testing.T) { | |||
tests := []struct { | |||
tests := map[string]struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a name to each test case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@davidxia would you mind fixing the conflict? |
as described in the [Ray Kubectl Plugin 1.4.0 Wishlist][1]. closes ray-project#3142 Signed-off-by: David Xia <[email protected]> [1]: https://docs.google.com/document/d/18V5_7SGUzS0LGlaJB7xus04omyr3drKkK4vV6Kz14jY/edit?tab=t.0#heading=h.tuqyq1b6b2x7
for all Pods in RayCluster Signed-off-by: David Xia <[email protected]>
yup, updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamp
as described in the Ray Kubectl Plugin 1.4.0 Wishlist.
Here are the resulting RayCluster YAMLs after running
kubectl ray create cluster dxia-test --file /path/to/file.yaml --dry-run
on various config files.RayCluster from a minimal config file
RayCluster from a full config file
depends on #3238
closes #3142
Checks