You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rework distributed inference docs for LWS + RDMA. (#49)
* PR: Distributed Inference Rework + RDMA Docs
- Rework distributed inference docs for LWS.
- Add docs for deploying / using RDMA connected nodes in cluster.
- Update docs for deploying blueprints to specific nodes.
* Update deployment documentation for blueprints and multi-node inference
- Refine JSON formatting in the blueprint deployment section for clarity.
- Add a new section on using RDMA with multi-node inference.
- Update terminology from "Kuberay Operator" to "LWS Operator" for consistency.
---------
Co-authored-by: grantneumanoracle <[email protected]>
Copy file name to clipboardExpand all lines: docs/common_workflows/deploying_blueprints_onto_specific_nodes/README.md
+51-11Lines changed: 51 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,16 +6,51 @@ Assumption: the node exists and you are installing OCI AI Blueprints alongside t
6
6
7
7
## Label Nodes
8
8
9
-
As a first step, we will tell OCI AI Blueprints about the node by manually labeling them and turning it in a shared node pool. Make sure to have the node ip address.
9
+
If you have existing node pools in your original OKE cluster that you'd like Blueprints to be able to use, follow these steps after the stack is finished:
10
10
11
-
Let's pretend I wanted to create the shared node pool named "a100pool". We will use this in the examples going forward.
11
+
1. Find the private IP address of the node you'd like to add.
12
+
- Console:
13
+
- Go to the OKE cluster in the console like you did above
14
+
- Click on "Node pools"
15
+
- Click on the pool with the node you want to add
16
+
- Identify the private ip address of the node under "Nodes" in the page.
17
+
- Command line with `kubectl` (assumes cluster access is setup):
18
+
- run `kubectl get nodes`
19
+
- run `kubectl describe node <nodename>` on each node until you find the node you want to add
20
+
- The private ip appears under the `Name` field of the output of `kubectl get nodes`.
21
+
2. Go to the stack and click "Application information". Click the API Url.
22
+
3. Login with the `Admin Username` and `Admin Password` in the Application information tab.
23
+
4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box.
24
+
5. Paste in the sample blueprint json found [here](../../sample_blueprints/add_node_to_control_plane.json).
25
+
6. Modify the "recipe_node_name" field to the private IP address you found in step 1 above.
26
+
7. Click "POST". This is a fast operation.
27
+
8. Wait about 20 seconds and refresh the page. It should look like:
This will actually simulate the labels OCI AI Blueprints uses in a shared pool. If you want to add a second node to that same pool, you'd just add those labels to the next node following the same process.
43
+
### Adding additional labels
44
+
45
+
To add any additional labels to nodes that you may wish to use later to specify deployment targets, this field (`recipe_node_labels`) can take any arbitrary number of labels to apply to a given node. For example, in the blueprint json, you could add the following:
46
+
47
+
```json
48
+
"recipe_node_labels": {
49
+
"key1": "value1",
50
+
"key2": "value2",
51
+
"key3": "value3"
52
+
}
53
+
```
19
54
20
55
## Deploy a blueprint
21
56
@@ -25,21 +60,26 @@ Now that you have artifically created a shared node pool using the node labels a
Note: In the example above, we specified `recipe_nvidia_gpu_count` as 4 which means we want to use 4 of the GPUs on the node.
41
81
42
-
Note: We set `recipe_shared_node_pool_selector` to "a100pool" to match the name of the shared node pool we created with the exisiting node.
82
+
Note: We set `shared_node_pool_custom_node_selectors` to "a10pool" to match the name of the shared node pool we created with the exisiting node. Here, we could add any additional labels added to target specific nodes for work.
43
83
44
84
Note: We set `recipe_use_shared_node_pool` to true so that we are using the shared node mode behavior for the blueprint (previously called recipe).
Copy file name to clipboardExpand all lines: docs/multi_node_inference/README.md
+50-71Lines changed: 50 additions & 71 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,124 +30,103 @@ Use multi-node inference whenever you are trying to use a very large model that
30
30
4. Determine which shapes you have access to and how much GPU memory each instance of that shape has: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm (ex: VM.GPU2.1 has 16 GB of GPU memory per instance). Note that as of right now, you must use the same shape across the entire node pool when using multi-node inference. Mix and match of shape types is not supported within the node pool used for the multi-node inference blueprint.
31
31
5. Divide the total GPU memory size needed (from Step #3) by the amount of GPU memory per instance of the shape you chose in Step #4. Round up to the nearest whole number. This will be the total number of nodes you will need in your node pool for the given shape and model.
32
32
33
-
## How to use it?
34
-
35
-
We are using [vLLM](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) and [KubeRay](https://github.com/ray-project/kuberay?tab=readme-ov-file) which is the Kubernetes operator for [Ray applications](https://github.com/ray-project/ray).
36
-
37
-
In order to use multi-node inference in an OCI Blueprint, use the following blueprint as a starter: [LINK](../sample_blueprints/multinode_inference_VM_A10.json)
38
-
39
-
The blueprint creates a RayCluster which is made up of one head node and worker nodes. The head node is identical to other worker nodes (in terms of ability to run workloads on it), except that it also runs singleton processes responsible for cluster management.
40
-
41
-
More documentation on RayCluster terminology [here](https://docs.ray.io/en/latest/cluster/key-concepts.html#ray-cluster).
42
-
43
-
## Required Blueprint Parameters
44
-
45
-
The following parameters are required:
46
-
47
-
-`"blueprint_mode": "raycluster"` -> blueprint_mode must be set to raycluster
48
-
49
-
-`blueprint_container_port` -> the port to access the inference endpoint
50
-
51
-
-`deployment_name` -> name of this deployment
33
+
## RDMA + Multinode Inference
52
34
53
-
-`blueprint_node_shape` -> OCI name of the Compute shape chosen (use exact names as found here: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm)
35
+
Want to use RDMA with multinode inference? [See here for details](../deploy_ai_blueprints_onto_hpc_cluster)
54
36
55
-
-`input_object_storage` (plus the parameters required inside this object)
37
+
## How to use it?
56
38
57
-
-`blueprint_node_pool_size` -> the number of physical nodes to launch (will be equal to `num_worker_nodes` plus 1 for the head node)
39
+
We are using [vLLM](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) and [Ray](https://github.com/ray-project/ray) using the [LeaderWorkerSet (LWS)](https://github.com/kubernetes-sigs/lws)to manage state between multiple nodes.
58
40
59
-
-`blueprint_node_boot_volume_size_in_gbs` -> size of boot volume for each node launched in the node pool (make sure it is at least 1.5x the size of your model)
41
+
In order to use multi-node inference in an OCI Blueprint, first deploy a shared node pool with blueprints using [this recipe](../sample_blueprints/shared_node_pool_A10_VM.json).
60
42
61
-
-`blueprint_ephemeral_storage_size` -> size of the attached block volume that will be used to store the model for reference by each node (make sure it is at least 1.5x the size of your model)
43
+
Then, use the following blueprint to deploy serving software: [LINK](../sample_blueprints/multinode_inference_VM_A10.json)
62
44
63
-
-`blueprint_nvidia_gpu_count` -> the number of GPUs per node (since head and worker nodes are identical, it is the number of GPUs in the shape you have specified. Ex: VM.GPU.A10.2 would have 2 GPUs)
45
+
The blueprint creates a LeaderWorkerSet which is made up of one head node and worker nodes. The head node is identical to other worker nodes (in terms of ability to run workloads on it), except that it also runs singleton processes responsible for cluster management.
64
46
65
-
-`"blueprint_raycluster_params"` object -> which includes the following properties:
47
+
More documentation on LWS terminology [here](https://lws.sigs.k8s.io/docs/).
66
48
67
-
-`model_path_in_container` : the file path to the model in the container
49
+
## Required Blueprint Parameters
68
50
69
-
-`head_node_num_cpus` : the number of OCPUs allocated to the head node (must match `worker_node_num_cpus`)
51
+
The following parameters are required:
70
52
71
-
-`head_node_num_gpus` : the number of GPUs allocated the head node (must match `worker_node_num_gpus`)
53
+
-`"recipe_mode": "service"` -> recipe_mode must be set to `service`
72
54
73
-
-`head_node_cpu_mem_in_gbs` : the amount of CPU memory allocated to the head node (must match `worker_node_cpu_mem_in_gbs`)
55
+
-`"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:ray2430_vllmv083"` -> currently, the only image we have supporting distributed inference.
74
56
75
-
-`num_worker_nodes` : the number of worker nodes you want to deploy (must be equal to `blueprint_node_pool_size` - 1)
57
+
-`recipe_container_port` -> the port to access the inference endpoint
76
58
77
-
-`worker_node_num_cpus` : the number of OCPUs allocated to the head node (must match `head_node_num_cpus`)
59
+
-`deployment_name` -> name of this deployment
78
60
79
-
-`worker_node_num_gpus` : the number of GPUs allocated the head node (must match `head_node_num_gpus`)
61
+
-`recipe_replica_count` -> the number of replicas (copies) of your blueprint.
80
62
81
-
-`worker_node_cpu_mem_in_gbs` : the amount of CPU memory allocated to the head node (must match `head_node_cpu_mem_in_gbs`)
63
+
-`recipe_node_shape` -> OCI name of the Compute shape chosen (use exact names as found here: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm)
82
64
83
-
-[OPTIONAL]`redis_port` : the port to use for Redis inside the cluster (default is 6379)
65
+
-`input_object_storage` (plus the parameters required inside this object). volume_size_in_gbs creates a block volume to store your model, so ensure this is sufficient to hold your model (roughly 1.5x model size).
84
66
85
-
-[OPTIONAL]`dashboard_port` : port on which the Ray dashboard will be available on inside the cluster (default is 8265)
67
+
-`recipe_ephemeral_storage_size` -> size of the attached block volume that will be used to store any ephemeral data (a separate block volume is managed by input_object_storage to house model).
86
68
87
-
-[OPTIONAL]`metrics_export_port`: port where metrics are exposed from inside the cluster (default is 8080)
69
+
-`recipe_nvidia_gpu_count` -> the number of GPUs per node (since head and worker nodes are identical, it is the number of GPUs in the shape you have specified. Ex: VM.GPU.A10.2 would have 2 GPUs)
88
70
89
-
-[OPTIONAL]`rayclient_server_port`: Ray client server port for external connections (default is 10001)
71
+
-`recipe_use_shared_node_pool` -> `true` - currently, multinode inference is only available on shared node pool deployments (for compatibility with RDMA shapes).
90
72
91
-
-[OPTIONAL]`head_image_uri`: Container image for the head node of the ray cluster (default is `vllm/vllm-openai:v0.7.2`)
73
+
-`multinode_num_nodes_to_use_from_shared_pool` -> the total number of nodes (as an integer) you want to use to serve this model. This number must be less than the size of the shared node pool, and will only use schedulable nodes in the pool.
92
74
93
-
-[OPTIONAL]`worker_image_uri`: Container image for the worker nodes of the ray cluster (default is `vllm/vllm-openai:v0.7.2`)
75
+
-[OPTIONAL]`"multinode_rdma_enabled_in_shared_pool": "true"` -> If you have deployed an HPC cluster with RDMA enabled for node pools - [see here for details](../deploy_ai_blueprints_onto_hpc_cluster) - enable RDMA communication between nodes (currently only supported for BM.GPU.H100.8). This will fail validation if RDMA is not supported for shape type, or node is missing appropriate labels described in linked doc.
94
76
95
-
-[OPTIONAL]`rayjob_image_uri`: Container image for the K8s Job that is applied after the head and worker nodes are in ready state (in the future, we will change this to be a RayJob CRD but are using K8s Job for now) (default is `vllm/vllm-openai:v0.7.2`)
77
+
-[OPTIONAL]`recipe_readiness_probe_params` -> Readiness probe to ensure that service is ready to serve requests. Parameter details found [here](../startup_liveness_readiness_probes/README.md).
96
78
97
79
## Requirements
98
80
99
-
-**Kuberay Operator Installed** = Make sure that the kuberay operator is installed (this is installed via the Resource Manager if the Kuberay option is selected - default is selected). Any OCI AI Blueprints installation before 2/24/25 will need to be reinstalled via the latest quickstarts release in order to ensure Kuberay is installed in your OCI AI Blueprints instance.
81
+
-**LWS Operator Installed** = Make sure that the leaderworkerset (LWS) operator is installed (this is installed via the Resource Manager). Any OCI AI Blueprints installation before 4/17/25 will need to be reinstalled via the latest quickstarts release in order to ensure LWS is installed in your OCI AI Blueprints instance.
100
82
101
83
-**Same shape for worker and head nodes** = Cluster must be uniform in regards to node shape and size (same shape, number of GPUs, number of CPUs etc.) for the worker nodes and head nodes.
102
84
103
85
-**Chosen shape must have GPUs** = no CPU inferencing is available at the moment
104
86
105
-
- Only job supported right now using Ray cluster and OCI Blueprints is vLLM Distributed Inference. This will change in the future.
106
-
107
-
- All nodes in the multi-node inferencing blueprint's node pool will be allocated to Ray (subject to change). You cannot assign just a portion; the entire node pool is reserved for the Ray cluster.
108
-
109
-
## Interacting with Ray Cluster
110
-
111
-
Once the multi-node inference blueprint has been successfully deployed, you will have access to the following URLs:
112
-
113
-
1.**Ray Dashboard:** Ray provides a web-based dashboard for monitoring and debugging Ray applications. The visual representation of the system state, allows users to track the performance of applications and troubleshoot issues.
114
-
**To find the URL for the API Inference Endpoint:** Go to `workspace` API endpoint and the URL will be under "blueprints" object. The object will be labeled `<deployment_name>-raycluster-dashboard`. The format for the URL is `<deployment_name>.<assigned_service_endpoint>.com`
2.**API Inference Endpoint:** This is the API endpoint you will use to do inferencing across the multiple nodes. It follows the [OpenAI API spec](https://platform.openai.com/docs/api-reference/introduction)
118
-
**To find the URL for the API Inference Endpoint:** Go to `workspace` API endpoint and the URL will be under "recipes" object. The object will be labeled `<deployment_name>-raycluster-app`. The format for the URL is `<deployment_name>.<assigned_service_endpoint>.com`
119
-
**Example curl command:**`curl --request GET --location 'rayclustervmtest10.132-226-50-64.nip.io/v1/models'`
87
+
- We only provide one distributed inference image which contains vLLM + Ray and some custom launching with LWS. It is possible that other frameworks are supported, but they are untested.
120
88
121
89
# Quickstart Guide: Multi-Node Inference
122
90
123
-
Follow these 6 simple steps to deploy your multi-node RayCluster using OCI AI Blueprints.
91
+
Follow these 6 simple steps to deploy your multi-node inference using OCI AI Blueprints.
124
92
125
-
1.**Create Your Deployment Blueprint**
93
+
1.**Deploy your shared node pool**
94
+
- Deploy a shared node pool containing at least 2 nodes for inference. Note: Existing shared node pools can be used!
95
+
- as a template, follow [this BM.A10](../sample_blueprints/shared_node_pool_A10_BM.json) or [this VM.A10](../sample_blueprints/shared_node_pool_A10_VM.json).
96
+
2.**Create Your Deployment Blueprint**
126
97
- Create a JSON configuration (blueprint) that defines your RayCluster. Key parameters include:
-A nested `"recipe_raycluster_params"` object with properties like `model_path_in_container`, `head_node_num_cpus`, `head_node_num_gpus`, `head_node_cpu_mem_in_gbs`, `num_worker_nodes`, etc.
102
+
-`multinode_num_nodes_to_use_from_shared_pool` (number of nodes to use from pool per replica)
133
103
- Refer to the [sample blueprint for parameter value examples](../sample_blueprints/multinode_inference_VM_A10.json)
134
104
- Refer to the [Required Blueprint Parameters](#Required_Blueprint_Parameters) section for full parameter details.
135
-
- Ensure that the head and worker nodes are provisioned uniformly, as required by the cluster’s configuration.
136
-
2.**Deploy the Blueprint via OCI AI Blueprints**
105
+
3.**Deploy the Blueprint via OCI AI Blueprints**
137
106
- Deploy the blueprint json via the `deployment` POST API
138
-
3.**Monitor Your Deployment**
107
+
4.**Monitor Your Deployment**
139
108
- Check deployment status using OCI AI Blueprint’s logs via the `deployment_logs` API endpoint
140
-
4.**Verify Cluster Endpoints**
109
+
5.**Verify Cluster Endpoints**
141
110
142
111
- Once deployed, locate your service endpoints:
143
-
-**Ray Dashboard:**Typically available at `https://dashboard.<deployment_name>.<assigned_service_endpoint>.com`
144
-
-**API Inference Endpoint:** Accessible via `https://<deployment_name>.<assigned_service_endpoint>.com`
145
-
- Use these URLs to confirm that the cluster is running and ready to handle inference requests.
112
+
-**API Inference Endpoint:**Accessible via `https://<deployment_name>.<assigned_service_endpoint>.nip.io`
113
+
114
+
6.**Start Inference and Scale as Needed**
146
115
147
-
5.**Start Inference and Scale as Needed**
148
116
- Test your deployment by sending a sample API request:
117
+
149
118
```bash
150
-
curl --request GET --location 'https://dashboard.<deployment_name>.<assigned_service_endpoint>.com/v1/models'
0 commit comments