oracle-quickstart
diff --git a/‎docs/common_workflows/deploying_blueprints_onto_specific_nodes/README.md
Lines changed: 51 additions & 11 deletions b/‎docs/common_workflows/deploying_blueprints_onto_specific_nodes/README.md
Lines changed: 51 additions & 11 deletions
diff --git a/‎docs/custom_blueprints/blueprint_json_schema.json
Lines changed: 35 additions & 0 deletions b/‎docs/custom_blueprints/blueprint_json_schema.json
Lines changed: 35 additions & 0 deletions
diff --git a/‎docs/multi_node_inference/README.md
Lines changed: 50 additions & 71 deletions b/‎docs/multi_node_inference/README.md
Lines changed: 50 additions & 71 deletions
@@ -6,16 +6,51 @@ Assumption: the node exists and you are installing OCI AI Blueprints alongside t
 
 ## Label Nodes
 
-As a first step, we will tell OCI AI Blueprints about the node by manually labeling them and turning it in a shared node pool. Make sure to have the node ip address.
+If you have existing node pools in your original OKE cluster that you'd like Blueprints to be able to use, follow these steps after the stack is finished:
 
-Let's pretend I wanted to create the shared node pool named "a100pool". We will use this in the examples going forward.
+1. Find the private IP address of the node you'd like to add.
+   - Console:
+     - Go to the OKE cluster in the console like you did above
+     - Click on "Node pools"
+     - Click on the pool with the node you want to add
+     - Identify the private ip address of the node under "Nodes" in the page.
+   - Command line with `kubectl` (assumes cluster access is setup):
+     - run `kubectl get nodes`
+     - run `kubectl describe node <nodename>` on each node until you find the node you want to add
+     - The private ip appears under the `Name` field of the output of `kubectl get nodes`.
+2. Go to the stack and click "Application information". Click the API Url.
+3. Login with the `Admin Username` and `Admin Password` in the Application information tab.
+4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box.
+5. Paste in the sample blueprint json found [here](../../sample_blueprints/add_node_to_control_plane.json).
+6. Modify the "recipe_node_name" field to the private IP address you found in step 1 above.
+7. Click "POST". This is a fast operation.
+8. Wait about 20 seconds and refresh the page. It should look like:
 
-```bash
-kubectl label node <node_ip> corrino=a100pool
-kubectl label node <node_ip> corrino/pool-shared-any=true
+```json
+[
+  {
+    "mode": "update",
+    "recipe_id": null,
+    "creation_date": "2025-03-28 11:12 AM UTC",
+    "deployment_uuid": "750a________cc0bfd",
+    "deployment_name": "startupaddnode",
+    "deployment_status": "completed",
+    "deployment_directive": "commission"
+  }
+]
 ```
 
-This will actually simulate the labels OCI AI Blueprints uses in a shared pool. If you want to add a second node to that same pool, you'd just add those labels to the next node following the same process.
+### Adding additional labels
+
+To add any additional labels to nodes that you may wish to use later to specify deployment targets, this field (`recipe_node_labels`) can take any arbitrary number of labels to apply to a given node. For example, in the blueprint json, you could add the following:
+
+```json
+"recipe_node_labels": {
+  "key1": "value1",
+  "key2": "value2",
+  "key3": "value3"
+}
+```
 
 ## Deploy a blueprint
 
@@ -25,21 +60,26 @@ Now that you have artifically created a shared node pool using the node labels a
 {
   "recipe_id": "example",
   "recipe_mode": "service",
-  "deployment_name": "a100 deployment",
+  "deployment_name": "a10 deployment",
   "recipe_use_shared_node_pool": true,
-  "recipe_shared_node_pool_selector": "a100pool",
   "recipe_image_uri": "hashicorp/http-echo",
   "recipe_container_command_args": ["-text=corrino"],
   "recipe_container_port": "5678",
-  "recipe_node_shape": "BM.GPU.A100-v2.8",
+  "recipe_node_shape": "BM.GPU.A10.4",
   "recipe_replica_count": 1,
-  "recipe_nvidia_gpu_count": 4
+  "recipe_nvidia_gpu_count": 4,
+  "shared_node_pool_custom_node_selectors": [
+    {
+      "key": "corrino",
+      "value": "a10pool"
+    }
+  ]
 }
 ```
 
 Note: In the example above, we specified `recipe_nvidia_gpu_count` as 4 which means we want to use 4 of the GPUs on the node.
 
-Note: We set `recipe_shared_node_pool_selector` to "a100pool" to match the name of the shared node pool we created with the exisiting node.
+Note: We set `shared_node_pool_custom_node_selectors` to "a10pool" to match the name of the shared node pool we created with the exisiting node. Here, we could add any additional labels added to target specific nodes for work.
 
 Note: We set `recipe_use_shared_node_pool` to true so that we are using the shared node mode behavior for the blueprint (previously called recipe).
 
 
@@ -208,6 +208,20 @@
             },
             "additionalProperties": false
           },
+          "recipe_readiness_probe_params": {
+             "type": "object",
+             "properties": {
+               "failure_threshold": { "type": "number" },
+               "endpoint_path": { "type": "string" },
+               "port": { "type": "integer" },
+               "scheme": { "type": "string" },
+               "initial_delay_seconds": { "type": "number" },
+               "period_seconds": { "type": "number" },
+               "success_threshold": { "type": "integer" },
+               "timeout_seconds": { "type": "number" }
+             },
+             "additionalProperties": false
+           },
           "recipe_container_port": {
             "type": "string"
           },
@@ -356,6 +370,21 @@
           "shared_node_pool_mig_config": {
             "type": "string"
           },
+          "shared_node_pool_custom_node_selectors": {
+             "type": "array",
+             "items": {
+               "additionalProperties": false,
+               "required": ["key", "value"],
+               "properties": {
+                 "key": {
+                   "type": "string"
+                 },
+                 "value": {
+                   "type": "string"
+                 }
+               }
+             }
+           },
           "mig_resource_request": {
             "type": "string"
           },
@@ -368,6 +397,12 @@
               "type": "string"
             }
           },
+          "multinode_num_nodes_to_use_from_shared_pool": {
+             "type": "integer"
+           },
+           "multinode_rdma_enabled_in_shared_pool": {
+             "type": "boolean"
+           },
           "recipe_node_pool_name": {
             "type": "string"
           },
 
@@ -30,124 +30,103 @@ Use multi-node inference whenever you are trying to use a very large model that
 4. Determine which shapes you have access to and how much GPU memory each instance of that shape has: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm (ex: VM.GPU2.1 has 16 GB of GPU memory per instance). Note that as of right now, you must use the same shape across the entire node pool when using multi-node inference. Mix and match of shape types is not supported within the node pool used for the multi-node inference blueprint.
 5. Divide the total GPU memory size needed (from Step #3) by the amount of GPU memory per instance of the shape you chose in Step #4. Round up to the nearest whole number. This will be the total number of nodes you will need in your node pool for the given shape and model.
 
-## How to use it?
-
-We are using [vLLM](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) and [KubeRay](https://github.com/ray-project/kuberay?tab=readme-ov-file) which is the Kubernetes operator for [Ray applications](https://github.com/ray-project/ray).
-
-In order to use multi-node inference in an OCI Blueprint, use the following blueprint as a starter: [LINK](../sample_blueprints/multinode_inference_VM_A10.json)
-
-The blueprint creates a RayCluster which is made up of one head node and worker nodes. The head node is identical to other worker nodes (in terms of ability to run workloads on it), except that it also runs singleton processes responsible for cluster management.
-
-More documentation on RayCluster terminology [here](https://docs.ray.io/en/latest/cluster/key-concepts.html#ray-cluster).
-
-## Required Blueprint Parameters
-
-The following parameters are required:
-
-- `"blueprint_mode": "raycluster"` -> blueprint_mode must be set to raycluster
-
-- `blueprint_container_port` -> the port to access the inference endpoint
-
-- `deployment_name` -> name of this deployment
+## RDMA + Multinode Inference
 
-- `blueprint_node_shape` -> OCI name of the Compute shape chosen (use exact names as found here: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm)
+Want to use RDMA with multinode inference? [See here for details](../deploy_ai_blueprints_onto_hpc_cluster)
 
-- `input_object_storage` (plus the parameters required inside this object)
+## How to use it?
 
-- `blueprint_node_pool_size` -> the number of physical nodes to launch (will be equal to `num_worker_nodes` plus 1 for the head node)
+We are using [vLLM](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) and [Ray](https://github.com/ray-project/ray) using the [LeaderWorkerSet (LWS)](https://github.com/kubernetes-sigs/lws) to manage state between multiple nodes.
 
-- `blueprint_node_boot_volume_size_in_gbs` -> size of boot volume for each node launched in the node pool (make sure it is at least 1.5x the size of your model)
+In order to use multi-node inference in an OCI Blueprint, first deploy a shared node pool with blueprints using [this recipe](../sample_blueprints/shared_node_pool_A10_VM.json).
 
-- `blueprint_ephemeral_storage_size` -> size of the attached block volume that will be used to store the model for reference by each node (make sure it is at least 1.5x the size of your model)
+Then, use the following blueprint to deploy serving software: [LINK](../sample_blueprints/multinode_inference_VM_A10.json)
 
-- `blueprint_nvidia_gpu_count` -> the number of GPUs per node (since head and worker nodes are identical, it is the number of GPUs in the shape you have specified. Ex: VM.GPU.A10.2 would have 2 GPUs)
+The blueprint creates a LeaderWorkerSet which is made up of one head node and worker nodes. The head node is identical to other worker nodes (in terms of ability to run workloads on it), except that it also runs singleton processes responsible for cluster management.
 
-- `"blueprint_raycluster_params"` object -> which includes the following properties:
+More documentation on LWS terminology [here](https://lws.sigs.k8s.io/docs/).
 
-- `model_path_in_container` : the file path to the model in the container
+## Required Blueprint Parameters
 
-- `head_node_num_cpus` : the number of OCPUs allocated to the head node (must match `worker_node_num_cpus`)
+The following parameters are required:
 
-- `head_node_num_gpus` : the number of GPUs allocated the head node (must match `worker_node_num_gpus`)
+- `"recipe_mode": "service"` -> recipe_mode must be set to `service`
 
-- `head_node_cpu_mem_in_gbs` : the amount of CPU memory allocated to the head node (must match `worker_node_cpu_mem_in_gbs`)
+- `"recipe_image_uri": "iad.ocir.io/iduyx1qnmway/corrino-devops-repository:ray2430_vllmv083"` -> currently, the only image we have supporting distributed inference.
 
-- `num_worker_nodes` : the number of worker nodes you want to deploy (must be equal to `blueprint_node_pool_size` - 1)
+- `recipe_container_port` -> the port to access the inference endpoint
 
-- `worker_node_num_cpus` : the number of OCPUs allocated to the head node (must match `head_node_num_cpus`)
+- `deployment_name` -> name of this deployment
 
-- `worker_node_num_gpus` : the number of GPUs allocated the head node (must match `head_node_num_gpus`)
+- `recipe_replica_count` -> the number of replicas (copies) of your blueprint.
 
-- `worker_node_cpu_mem_in_gbs` : the amount of CPU memory allocated to the head node (must match `head_node_cpu_mem_in_gbs`)
+- `recipe_node_shape` -> OCI name of the Compute shape chosen (use exact names as found here: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm)
 
-- [OPTIONAL] `redis_port` : the port to use for Redis inside the cluster (default is 6379)
+- `input_object_storage` (plus the parameters required inside this object). volume_size_in_gbs creates a block volume to store your model, so ensure this is sufficient to hold your model (roughly 1.5x model size).
 
-- [OPTIONAL] `dashboard_port` : port on which the Ray dashboard will be available on inside the cluster (default is 8265)
+- `recipe_ephemeral_storage_size` -> size of the attached block volume that will be used to store any ephemeral data (a separate block volume is managed by input_object_storage to house model).
 
-- [OPTIONAL] `metrics_export_port`: port where metrics are exposed from inside the cluster (default is 8080)
+- `recipe_nvidia_gpu_count` -> the number of GPUs per node (since head and worker nodes are identical, it is the number of GPUs in the shape you have specified. Ex: VM.GPU.A10.2 would have 2 GPUs)
 
-- [OPTIONAL] `rayclient_server_port`: Ray client server port for external connections (default is 10001)
+- `recipe_use_shared_node_pool` -> `true` - currently, multinode inference is only available on shared node pool deployments (for compatibility with RDMA shapes).
 
-- [OPTIONAL] `head_image_uri`: Container image for the head node of the ray cluster (default is `vllm/vllm-openai:v0.7.2`)
+- `multinode_num_nodes_to_use_from_shared_pool` -> the total number of nodes (as an integer) you want to use to serve this model. This number must be less than the size of the shared node pool, and will only use schedulable nodes in the pool.
 
-- [OPTIONAL] `worker_image_uri`: Container image for the worker nodes of the ray cluster (default is `vllm/vllm-openai:v0.7.2`)
+- [OPTIONAL] `"multinode_rdma_enabled_in_shared_pool": "true"` -> If you have deployed an HPC cluster with RDMA enabled for node pools - [see here for details](../deploy_ai_blueprints_onto_hpc_cluster) - enable RDMA communication between nodes (currently only supported for BM.GPU.H100.8). This will fail validation if RDMA is not supported for shape type, or node is missing appropriate labels described in linked doc.
 
-- [OPTIONAL] `rayjob_image_uri`: Container image for the K8s Job that is applied after the head and worker nodes are in ready state (in the future, we will change this to be a RayJob CRD but are using K8s Job for now) (default is `vllm/vllm-openai:v0.7.2`)
+- [OPTIONAL] `recipe_readiness_probe_params` -> Readiness probe to ensure that service is ready to serve requests. Parameter details found [here](../startup_liveness_readiness_probes/README.md).
 
 ## Requirements
 
-- **Kuberay Operator Installed** = Make sure that the kuberay operator is installed (this is installed via the Resource Manager if the Kuberay option is selected - default is selected). Any OCI AI Blueprints installation before 2/24/25 will need to be reinstalled via the latest quickstarts release in order to ensure Kuberay is installed in your OCI AI Blueprints instance.
+- **LWS Operator Installed** = Make sure that the leaderworkerset (LWS) operator is installed (this is installed via the Resource Manager). Any OCI AI Blueprints installation before 4/17/25 will need to be reinstalled via the latest quickstarts release in order to ensure LWS is installed in your OCI AI Blueprints instance.
 
 - **Same shape for worker and head nodes** = Cluster must be uniform in regards to node shape and size (same shape, number of GPUs, number of CPUs etc.) for the worker nodes and head nodes.
 
 - **Chosen shape must have GPUs** = no CPU inferencing is available at the moment
 
-- Only job supported right now using Ray cluster and OCI Blueprints is vLLM Distributed Inference. This will change in the future.
-
-- All nodes in the multi-node inferencing blueprint's node pool will be allocated to Ray (subject to change). You cannot assign just a portion; the entire node pool is reserved for the Ray cluster.
-
-## Interacting with Ray Cluster
-
-Once the multi-node inference blueprint has been successfully deployed, you will have access to the following URLs:
-
-1. **Ray Dashboard:** Ray provides a web-based dashboard for monitoring and debugging Ray applications. The visual representation of the system state, allows users to track the performance of applications and troubleshoot issues.
-   **To find the URL for the API Inference Endpoint:** Go to `workspace` API endpoint and the URL will be under "blueprints" object. The object will be labeled `<deployment_name>-raycluster-dashboard`. The format for the URL is `<deployment_name>.<assigned_service_endpoint>.com`
-   **Example URL:** `https://dashboard.rayclustervmtest10.132-226-50-64.nip.io`
-
-2. **API Inference Endpoint:** This is the API endpoint you will use to do inferencing across the multiple nodes. It follows the [OpenAI API spec](https://platform.openai.com/docs/api-reference/introduction)
-   **To find the URL for the API Inference Endpoint:** Go to `workspace` API endpoint and the URL will be under "recipes" object. The object will be labeled `<deployment_name>-raycluster-app`. The format for the URL is `<deployment_name>.<assigned_service_endpoint>.com`
-   **Example curl command:** `curl --request GET --location 'rayclustervmtest10.132-226-50-64.nip.io/v1/models'`
+- We only provide one distributed inference image which contains vLLM + Ray and some custom launching with LWS. It is possible that other frameworks are supported, but they are untested.
 
 # Quickstart Guide: Multi-Node Inference
 
-Follow these 6 simple steps to deploy your multi-node RayCluster using OCI AI Blueprints.
+Follow these 6 simple steps to deploy your multi-node inference using OCI AI Blueprints.
 
-1. **Create Your Deployment Blueprint**
+1. **Deploy your shared node pool**
+   - Deploy a shared node pool containing at least 2 nodes for inference. Note: Existing shared node pools can be used!
+     - as a template, follow [this BM.A10](../sample_blueprints/shared_node_pool_A10_BM.json) or [this VM.A10](../sample_blueprints/shared_node_pool_A10_VM.json).
+2. **Create Your Deployment Blueprint**
    - Create a JSON configuration (blueprint) that defines your RayCluster. Key parameters include:
-     - `"recipe_mode": "raycluster"`
+     - `"recipe_mode": "service"`
      - `deployment_name`, `recipe_node_shape`, `recipe_container_port`
      - `input_object_storage` (and its required parameters)
-     - `recipe_node_pool_size` (head node + worker nodes)
      - `recipe_nvidia_gpu_count` (GPUs per node)
-     - A nested `"recipe_raycluster_params"` object with properties like `model_path_in_container`, `head_node_num_cpus`, `head_node_num_gpus`, `head_node_cpu_mem_in_gbs`, `num_worker_nodes`, etc.
+     - `multinode_num_nodes_to_use_from_shared_pool` (number of nodes to use from pool per replica)
    - Refer to the [sample blueprint for parameter value examples](../sample_blueprints/multinode_inference_VM_A10.json)
    - Refer to the [Required Blueprint Parameters](#Required_Blueprint_Parameters) section for full parameter details.
-   - Ensure that the head and worker nodes are provisioned uniformly, as required by the cluster’s configuration.
-2. **Deploy the Blueprint via OCI AI Blueprints**
+3. **Deploy the Blueprint via OCI AI Blueprints**
    - Deploy the blueprint json via the `deployment` POST API
-3. **Monitor Your Deployment**
+4. **Monitor Your Deployment**
    - Check deployment status using OCI AI Blueprint’s logs via the `deployment_logs` API endpoint
-4. **Verify Cluster Endpoints**
+5. **Verify Cluster Endpoints**
 
    - Once deployed, locate your service endpoints:
-     - **Ray Dashboard:** Typically available at `https://dashboard.<deployment_name>.<assigned_service_endpoint>.com`
-     - **API Inference Endpoint:** Accessible via `https://<deployment_name>.<assigned_service_endpoint>.com`
-   - Use these URLs to confirm that the cluster is running and ready to handle inference requests.
+     - **API Inference Endpoint:** Accessible via `https://<deployment_name>.<assigned_service_endpoint>.nip.io`
+
+6. **Start Inference and Scale as Needed**
 
-5. **Start Inference and Scale as Needed**
    - Test your deployment by sending a sample API request:
+
      ```bash
-     curl --request GET --location 'https://dashboard.<deployment_name>.<assigned_service_endpoint>.com/v1/models'
+     curl -L 'https://<deployment_name>.<assigned_service_endpoint>.nip.io/metrics'
+     ...
+     curl -L https://<deployment_name>.<assigned_service_endpoint>.nip.io/v1/completions \
+     -H "Content-Type: application/json" \
+     -d '{
+         "model": "/models",
+         "prompt": "San Francisco is a",
+         "max_tokens": 512,
+         "temperature": 0
+     }' | jq
+
      ```
 
 Happy deploying!