Skip to content

Commit d0520c1

Browse files
Rework distributed inference docs for LWS + RDMA. (#49)
* PR: Distributed Inference Rework + RDMA Docs - Rework distributed inference docs for LWS. - Add docs for deploying / using RDMA connected nodes in cluster. - Update docs for deploying blueprints to specific nodes. * Update deployment documentation for blueprints and multi-node inference - Refine JSON formatting in the blueprint deployment section for clarity. - Add a new section on using RDMA with multi-node inference. - Update terminology from "Kuberay Operator" to "LWS Operator" for consistency. --------- Co-authored-by: grantneumanoracle <[email protected]>
1 parent d61e432 commit d0520c1

File tree

11 files changed

+362
-117
lines changed

11 files changed

+362
-117
lines changed

docs/common_workflows/deploying_blueprints_onto_specific_nodes/README.md

Lines changed: 51 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,51 @@ Assumption: the node exists and you are installing OCI AI Blueprints alongside t
66

77
## Label Nodes
88

9-
As a first step, we will tell OCI AI Blueprints about the node by manually labeling them and turning it in a shared node pool. Make sure to have the node ip address.
9+
If you have existing node pools in your original OKE cluster that you'd like Blueprints to be able to use, follow these steps after the stack is finished:
1010

11-
Let's pretend I wanted to create the shared node pool named "a100pool". We will use this in the examples going forward.
11+
1. Find the private IP address of the node you'd like to add.
12+
- Console:
13+
- Go to the OKE cluster in the console like you did above
14+
- Click on "Node pools"
15+
- Click on the pool with the node you want to add
16+
- Identify the private ip address of the node under "Nodes" in the page.
17+
- Command line with `kubectl` (assumes cluster access is setup):
18+
- run `kubectl get nodes`
19+
- run `kubectl describe node <nodename>` on each node until you find the node you want to add
20+
- The private ip appears under the `Name` field of the output of `kubectl get nodes`.
21+
2. Go to the stack and click "Application information". Click the API Url.
22+
3. Login with the `Admin Username` and `Admin Password` in the Application information tab.
23+
4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box.
24+
5. Paste in the sample blueprint json found [here](../../sample_blueprints/add_node_to_control_plane.json).
25+
6. Modify the "recipe_node_name" field to the private IP address you found in step 1 above.
26+
7. Click "POST". This is a fast operation.
27+
8. Wait about 20 seconds and refresh the page. It should look like:
1228

13-
```bash
14-
kubectl label node <node_ip> corrino=a100pool
15-
kubectl label node <node_ip> corrino/pool-shared-any=true
29+
```json
30+
[
31+
{
32+
"mode": "update",
33+
"recipe_id": null,
34+
"creation_date": "2025-03-28 11:12 AM UTC",
35+
"deployment_uuid": "750a________cc0bfd",
36+
"deployment_name": "startupaddnode",
37+
"deployment_status": "completed",
38+
"deployment_directive": "commission"
39+
}
40+
]
1641
```
1742

18-
This will actually simulate the labels OCI AI Blueprints uses in a shared pool. If you want to add a second node to that same pool, you'd just add those labels to the next node following the same process.
43+
### Adding additional labels
44+
45+
To add any additional labels to nodes that you may wish to use later to specify deployment targets, this field (`recipe_node_labels`) can take any arbitrary number of labels to apply to a given node. For example, in the blueprint json, you could add the following:
46+
47+
```json
48+
"recipe_node_labels": {
49+
"key1": "value1",
50+
"key2": "value2",
51+
"key3": "value3"
52+
}
53+
```
1954

2055
## Deploy a blueprint
2156

@@ -25,21 +60,26 @@ Now that you have artifically created a shared node pool using the node labels a
2560
{
2661
"recipe_id": "example",
2762
"recipe_mode": "service",
28-
"deployment_name": "a100 deployment",
63+
"deployment_name": "a10 deployment",
2964
"recipe_use_shared_node_pool": true,
30-
"recipe_shared_node_pool_selector": "a100pool",
3165
"recipe_image_uri": "hashicorp/http-echo",
3266
"recipe_container_command_args": ["-text=corrino"],
3367
"recipe_container_port": "5678",
34-
"recipe_node_shape": "BM.GPU.A100-v2.8",
68+
"recipe_node_shape": "BM.GPU.A10.4",
3569
"recipe_replica_count": 1,
36-
"recipe_nvidia_gpu_count": 4
70+
"recipe_nvidia_gpu_count": 4,
71+
"shared_node_pool_custom_node_selectors": [
72+
{
73+
"key": "corrino",
74+
"value": "a10pool"
75+
}
76+
]
3777
}
3878
```
3979

4080
Note: In the example above, we specified `recipe_nvidia_gpu_count` as 4 which means we want to use 4 of the GPUs on the node.
4181

42-
Note: We set `recipe_shared_node_pool_selector` to "a100pool" to match the name of the shared node pool we created with the exisiting node.
82+
Note: We set `shared_node_pool_custom_node_selectors` to "a10pool" to match the name of the shared node pool we created with the exisiting node. Here, we could add any additional labels added to target specific nodes for work.
4383

4484
Note: We set `recipe_use_shared_node_pool` to true so that we are using the shared node mode behavior for the blueprint (previously called recipe).
4585

docs/custom_blueprints/blueprint_json_schema.json

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,20 @@
208208
},
209209
"additionalProperties": false
210210
},
211+
"recipe_readiness_probe_params": {
212+
"type": "object",
213+
"properties": {
214+
"failure_threshold": { "type": "number" },
215+
"endpoint_path": { "type": "string" },
216+
"port": { "type": "integer" },
217+
"scheme": { "type": "string" },
218+
"initial_delay_seconds": { "type": "number" },
219+
"period_seconds": { "type": "number" },
220+
"success_threshold": { "type": "integer" },
221+
"timeout_seconds": { "type": "number" }
222+
},
223+
"additionalProperties": false
224+
},
211225
"recipe_container_port": {
212226
"type": "string"
213227
},
@@ -356,6 +370,21 @@
356370
"shared_node_pool_mig_config": {
357371
"type": "string"
358372
},
373+
"shared_node_pool_custom_node_selectors": {
374+
"type": "array",
375+
"items": {
376+
"additionalProperties": false,
377+
"required": ["key", "value"],
378+
"properties": {
379+
"key": {
380+
"type": "string"
381+
},
382+
"value": {
383+
"type": "string"
384+
}
385+
}
386+
}
387+
},
359388
"mig_resource_request": {
360389
"type": "string"
361390
},
@@ -368,6 +397,12 @@
368397
"type": "string"
369398
}
370399
},
400+
"multinode_num_nodes_to_use_from_shared_pool": {
401+
"type": "integer"
402+
},
403+
"multinode_rdma_enabled_in_shared_pool": {
404+
"type": "boolean"
405+
},
371406
"recipe_node_pool_name": {
372407
"type": "string"
373408
},

docs/multi_node_inference/README.md

Lines changed: 50 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -30,124 +30,103 @@ Use multi-node inference whenever you are trying to use a very large model that
3030
4. Determine which shapes you have access to and how much GPU memory each instance of that shape has: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm (ex: VM.GPU2.1 has 16 GB of GPU memory per instance). Note that as of right now, you must use the same shape across the entire node pool when using multi-node inference. Mix and match of shape types is not supported within the node pool used for the multi-node inference blueprint.
3131
5. Divide the total GPU memory size needed (from Step #3) by the amount of GPU memory per instance of the shape you chose in Step #4. Round up to the nearest whole number. This will be the total number of nodes you will need in your node pool for the given shape and model.
3232

33-
## How to use it?
34-
35-
We are using [vLLM](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) and [KubeRay](https://github.com/ray-project/kuberay?tab=readme-ov-file) which is the Kubernetes operator for [Ray applications](https://github.com/ray-project/ray).
36-
37-
In order to use multi-node inference in an OCI Blueprint, use the following blueprint as a starter: [LINK](../sample_blueprints/multinode_inference_VM_A10.json)
38-
39-
The blueprint creates a RayCluster which is made up of one head node and worker nodes. The head node is identical to other worker nodes (in terms of ability to run workloads on it), except that it also runs singleton processes responsible for cluster management.
40-
41-
More documentation on RayCluster terminology [here](https://docs.ray.io/en/latest/cluster/key-concepts.html#ray-cluster).
42-
43-
## Required Blueprint Parameters
44-
45-
The following parameters are required:
46-
47-
- `"blueprint_mode": "raycluster"` -> blueprint_mode must be set to raycluster
48-
49-
- `blueprint_container_port` -> the port to access the inference endpoint
50-
51-
- `deployment_name` -> name of this deployment
33+
## RDMA + Multinode Inference
5234

53-
- `blueprint_node_shape` -> OCI name of the Compute shape chosen (use exact names as found here: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm)
35+
Want to use RDMA with multinode inference? [See here for details](../deploy_ai_blueprints_onto_hpc_cluster)
5436

55-
- `input_object_storage` (plus the parameters required inside this object)
37+
## How to use it?
5638

57-
- `blueprint_node_pool_size` -> the number of physical nodes to launch (will be equal to `num_worker_nodes` plus 1 for the head node)
39+
We are using [vLLM](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) and [Ray](https://github.com/ray-project/ray) using the [LeaderWorkerSet (LWS)](https://github.com/kubernetes-sigs/lws) to manage state between multiple nodes.
5840

59-
- `blueprint_node_boot_volume_size_in_gbs` -> size of boot volume for each node launched in the node pool (make sure it is at least 1.5x the size of your model)
41+
In order to use multi-node inference in an OCI Blueprint, first deploy a shared node pool with blueprints using [this recipe](../sample_blueprints/shared_node_pool_A10_VM.json).
6042

61-
- `blueprint_ephemeral_storage_size` -> size of the attached block volume that will be used to store the model for reference by each node (make sure it is at least 1.5x the size of your model)
43+
Then, use the following blueprint to deploy serving software: [LINK](../sample_blueprints/multinode_inference_VM_A10.json)
6244

63-
- `blueprint_nvidia_gpu_count` -> the number of GPUs per node (since head and worker nodes are identical, it is the number of GPUs in the shape you have specified. Ex: VM.GPU.A10.2 would have 2 GPUs)
45+
The blueprint creates a LeaderWorkerSet which is made up of one head node and worker nodes. The head node is identical to other worker nodes (in terms of ability to run workloads on it), except that it also runs singleton processes responsible for cluster management.
6446

65-
- `"blueprint_raycluster_params"` object -> which includes the following properties:
47+
More documentation on LWS terminology [here](https://lws.sigs.k8s.io/docs/).
6648

67-
- `model_path_in_container` : the file path to the model in the container
49+
## Required Blueprint Parameters
6850

69-
- `head_node_num_cpus` : the number of OCPUs allocated to the head node (must match `worker_node_num_cpus`)
51+
The following parameters are required:
7052

71-
- `head_node_num_gpus` : the number of GPUs allocated the head node (must match `worker_node_num_gpus`)
53+
- `"recipe_mode": "service"` -> recipe_mode must be set to `service`
7254

73-
- `head_node_cpu_mem_in_gbs` : the amount of CPU memory allocated to the head node (must match `worker_node_cpu_mem_in_gbs`)
55+
- `"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:ray2430_vllmv083"` -> currently, the only image we have supporting distributed inference.
7456

75-
- `num_worker_nodes` : the number of worker nodes you want to deploy (must be equal to `blueprint_node_pool_size` - 1)
57+
- `recipe_container_port` -> the port to access the inference endpoint
7658

77-
- `worker_node_num_cpus` : the number of OCPUs allocated to the head node (must match `head_node_num_cpus`)
59+
- `deployment_name` -> name of this deployment
7860

79-
- `worker_node_num_gpus` : the number of GPUs allocated the head node (must match `head_node_num_gpus`)
61+
- `recipe_replica_count` -> the number of replicas (copies) of your blueprint.
8062

81-
- `worker_node_cpu_mem_in_gbs` : the amount of CPU memory allocated to the head node (must match `head_node_cpu_mem_in_gbs`)
63+
- `recipe_node_shape` -> OCI name of the Compute shape chosen (use exact names as found here: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm)
8264

83-
- [OPTIONAL] `redis_port` : the port to use for Redis inside the cluster (default is 6379)
65+
- `input_object_storage` (plus the parameters required inside this object). volume_size_in_gbs creates a block volume to store your model, so ensure this is sufficient to hold your model (roughly 1.5x model size).
8466

85-
- [OPTIONAL] `dashboard_port` : port on which the Ray dashboard will be available on inside the cluster (default is 8265)
67+
- `recipe_ephemeral_storage_size` -> size of the attached block volume that will be used to store any ephemeral data (a separate block volume is managed by input_object_storage to house model).
8668

87-
- [OPTIONAL] `metrics_export_port`: port where metrics are exposed from inside the cluster (default is 8080)
69+
- `recipe_nvidia_gpu_count` -> the number of GPUs per node (since head and worker nodes are identical, it is the number of GPUs in the shape you have specified. Ex: VM.GPU.A10.2 would have 2 GPUs)
8870

89-
- [OPTIONAL] `rayclient_server_port`: Ray client server port for external connections (default is 10001)
71+
- `recipe_use_shared_node_pool` -> `true` - currently, multinode inference is only available on shared node pool deployments (for compatibility with RDMA shapes).
9072

91-
- [OPTIONAL] `head_image_uri`: Container image for the head node of the ray cluster (default is `vllm/vllm-openai:v0.7.2`)
73+
- `multinode_num_nodes_to_use_from_shared_pool` -> the total number of nodes (as an integer) you want to use to serve this model. This number must be less than the size of the shared node pool, and will only use schedulable nodes in the pool.
9274

93-
- [OPTIONAL] `worker_image_uri`: Container image for the worker nodes of the ray cluster (default is `vllm/vllm-openai:v0.7.2`)
75+
- [OPTIONAL] `"multinode_rdma_enabled_in_shared_pool": "true"` -> If you have deployed an HPC cluster with RDMA enabled for node pools - [see here for details](../deploy_ai_blueprints_onto_hpc_cluster) - enable RDMA communication between nodes (currently only supported for BM.GPU.H100.8). This will fail validation if RDMA is not supported for shape type, or node is missing appropriate labels described in linked doc.
9476

95-
- [OPTIONAL] `rayjob_image_uri`: Container image for the K8s Job that is applied after the head and worker nodes are in ready state (in the future, we will change this to be a RayJob CRD but are using K8s Job for now) (default is `vllm/vllm-openai:v0.7.2`)
77+
- [OPTIONAL] `recipe_readiness_probe_params` -> Readiness probe to ensure that service is ready to serve requests. Parameter details found [here](../startup_liveness_readiness_probes/README.md).
9678

9779
## Requirements
9880

99-
- **Kuberay Operator Installed** = Make sure that the kuberay operator is installed (this is installed via the Resource Manager if the Kuberay option is selected - default is selected). Any OCI AI Blueprints installation before 2/24/25 will need to be reinstalled via the latest quickstarts release in order to ensure Kuberay is installed in your OCI AI Blueprints instance.
81+
- **LWS Operator Installed** = Make sure that the leaderworkerset (LWS) operator is installed (this is installed via the Resource Manager). Any OCI AI Blueprints installation before 4/17/25 will need to be reinstalled via the latest quickstarts release in order to ensure LWS is installed in your OCI AI Blueprints instance.
10082

10183
- **Same shape for worker and head nodes** = Cluster must be uniform in regards to node shape and size (same shape, number of GPUs, number of CPUs etc.) for the worker nodes and head nodes.
10284

10385
- **Chosen shape must have GPUs** = no CPU inferencing is available at the moment
10486

105-
- Only job supported right now using Ray cluster and OCI Blueprints is vLLM Distributed Inference. This will change in the future.
106-
107-
- All nodes in the multi-node inferencing blueprint's node pool will be allocated to Ray (subject to change). You cannot assign just a portion; the entire node pool is reserved for the Ray cluster.
108-
109-
## Interacting with Ray Cluster
110-
111-
Once the multi-node inference blueprint has been successfully deployed, you will have access to the following URLs:
112-
113-
1. **Ray Dashboard:** Ray provides a web-based dashboard for monitoring and debugging Ray applications. The visual representation of the system state, allows users to track the performance of applications and troubleshoot issues.
114-
**To find the URL for the API Inference Endpoint:** Go to `workspace` API endpoint and the URL will be under "blueprints" object. The object will be labeled `<deployment_name>-raycluster-dashboard`. The format for the URL is `<deployment_name>.<assigned_service_endpoint>.com`
115-
**Example URL:** `https://dashboard.rayclustervmtest10.132-226-50-64.nip.io`
116-
117-
2. **API Inference Endpoint:** This is the API endpoint you will use to do inferencing across the multiple nodes. It follows the [OpenAI API spec](https://platform.openai.com/docs/api-reference/introduction)
118-
**To find the URL for the API Inference Endpoint:** Go to `workspace` API endpoint and the URL will be under "recipes" object. The object will be labeled `<deployment_name>-raycluster-app`. The format for the URL is `<deployment_name>.<assigned_service_endpoint>.com`
119-
**Example curl command:** `curl --request GET --location 'rayclustervmtest10.132-226-50-64.nip.io/v1/models'`
87+
- We only provide one distributed inference image which contains vLLM + Ray and some custom launching with LWS. It is possible that other frameworks are supported, but they are untested.
12088

12189
# Quickstart Guide: Multi-Node Inference
12290

123-
Follow these 6 simple steps to deploy your multi-node RayCluster using OCI AI Blueprints.
91+
Follow these 6 simple steps to deploy your multi-node inference using OCI AI Blueprints.
12492

125-
1. **Create Your Deployment Blueprint**
93+
1. **Deploy your shared node pool**
94+
- Deploy a shared node pool containing at least 2 nodes for inference. Note: Existing shared node pools can be used!
95+
- as a template, follow [this BM.A10](../sample_blueprints/shared_node_pool_A10_BM.json) or [this VM.A10](../sample_blueprints/shared_node_pool_A10_VM.json).
96+
2. **Create Your Deployment Blueprint**
12697
- Create a JSON configuration (blueprint) that defines your RayCluster. Key parameters include:
127-
- `"recipe_mode": "raycluster"`
98+
- `"recipe_mode": "service"`
12899
- `deployment_name`, `recipe_node_shape`, `recipe_container_port`
129100
- `input_object_storage` (and its required parameters)
130-
- `recipe_node_pool_size` (head node + worker nodes)
131101
- `recipe_nvidia_gpu_count` (GPUs per node)
132-
- A nested `"recipe_raycluster_params"` object with properties like `model_path_in_container`, `head_node_num_cpus`, `head_node_num_gpus`, `head_node_cpu_mem_in_gbs`, `num_worker_nodes`, etc.
102+
- `multinode_num_nodes_to_use_from_shared_pool` (number of nodes to use from pool per replica)
133103
- Refer to the [sample blueprint for parameter value examples](../sample_blueprints/multinode_inference_VM_A10.json)
134104
- Refer to the [Required Blueprint Parameters](#Required_Blueprint_Parameters) section for full parameter details.
135-
- Ensure that the head and worker nodes are provisioned uniformly, as required by the cluster’s configuration.
136-
2. **Deploy the Blueprint via OCI AI Blueprints**
105+
3. **Deploy the Blueprint via OCI AI Blueprints**
137106
- Deploy the blueprint json via the `deployment` POST API
138-
3. **Monitor Your Deployment**
107+
4. **Monitor Your Deployment**
139108
- Check deployment status using OCI AI Blueprint’s logs via the `deployment_logs` API endpoint
140-
4. **Verify Cluster Endpoints**
109+
5. **Verify Cluster Endpoints**
141110

142111
- Once deployed, locate your service endpoints:
143-
- **Ray Dashboard:** Typically available at `https://dashboard.<deployment_name>.<assigned_service_endpoint>.com`
144-
- **API Inference Endpoint:** Accessible via `https://<deployment_name>.<assigned_service_endpoint>.com`
145-
- Use these URLs to confirm that the cluster is running and ready to handle inference requests.
112+
- **API Inference Endpoint:** Accessible via `https://<deployment_name>.<assigned_service_endpoint>.nip.io`
113+
114+
6. **Start Inference and Scale as Needed**
146115

147-
5. **Start Inference and Scale as Needed**
148116
- Test your deployment by sending a sample API request:
117+
149118
```bash
150-
curl --request GET --location 'https://dashboard.<deployment_name>.<assigned_service_endpoint>.com/v1/models'
119+
curl -L 'https://<deployment_name>.<assigned_service_endpoint>.nip.io/metrics'
120+
...
121+
curl -L https://<deployment_name>.<assigned_service_endpoint>.nip.io/v1/completions \
122+
-H "Content-Type: application/json" \
123+
-d '{
124+
"model": "/models",
125+
"prompt": "San Francisco is a",
126+
"max_tokens": 512,
127+
"temperature": 0
128+
}' | jq
129+
151130
```
152131

153132
Happy deploying!

0 commit comments

Comments
 (0)