Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 5 additions & 18 deletions docs/en/installation/ai-cluster.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ Once **Knative Operator** is installed, you need to create the `KnativeServing`
spec:
# For ACP 4.0, use version 1.18.1
# For ACP 4.1 and above, use version 1.19.6
version: "1.18.1" # [!code callout]
version: "1.19.6" # [!code callout]
config:
deployment:
registries-skipping-tag-resolving: kind.local,ko.local,dev.local,private-registry # [!code callout]
Expand Down Expand Up @@ -343,10 +343,10 @@ In **Administrator** view:

13. Under **Model Catalog** section, configure the following parameters:

- **Database Password Secret Namespace**: Namespace of the secret containing the PostgreSQL password for Model Catalog.
- **Database Password Secret Name**: Name of the secret containing the PostgreSQL password for Model Catalog.
- **Database Password Secret Namespace**: Namespace of the Secret containing the PostgreSQL password for Model Catalog.
- **Database Password Secret Name**: Name of the Secret containing the PostgreSQL password for Model Catalog.

Create the secret before creating the Alauda AI instance. If you use the following example, set **Database Password Secret Namespace** to `aml-operator` and **Database Password Secret Name** to `model-catalog`.
Create this Secret before creating the Alauda AI instance. If you use the following example, set **Database Password Secret Namespace** to `aml-operator` and **Database Password Secret Name** to `model-catalog`.

```yaml
apiVersion: v1
Expand All @@ -363,23 +363,10 @@ In **Administrator** view:

1. `metadata.name` is the value for **Database Password Secret Name**.
2. `metadata.namespace` is the value for **Database Password Secret Namespace**.
3. `stringData.password` is the PostgreSQL password in plain text. Kubernetes stores it as base64-encoded `data.password` after the Secret is created.
3. `stringData.password` is the PostgreSQL password in plain text. Kubernetes encodes it into `data.password` when the Secret is created, so you do not need to base64-encode the value yourself.

</Callouts>

After creation, the stored Secret has a base64-encoded `data.password` field, for example:

```yaml
apiVersion: v1
data:
password: cGc=
kind: Secret
metadata:
name: model-catalog
namespace: aml-operator
type: Opaque
```

- **Model OCI Registry Address**: Registry address hosting model OCI artifacts for Model Catalog. The default value is `build-harbor.alauda.cn`.

This registry stores the model OCI images used by Model Catalog. Use Harbor or another production-mode OCI registry with HTTPS access enabled. The Harbor project or repository used for Model Catalog must allow anonymous pull access from inference cluster nodes.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -136,8 +136,8 @@ You'll need to create the corresponding inference runtime `ClusterServingRuntime
- bash
resources:
limits:
cpu: 2
memory: 6Gi
cpu: 2 # [!code callout]
memory: 6Gi # [!code callout]
requests:
cpu: 2
memory: 6Gi
Expand All @@ -154,6 +154,10 @@ You'll need to create the corresponding inference runtime `ClusterServingRuntime
version: "1"

```
<Callouts>
1. Set the CPU limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
2. Set the memory limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
</Callouts>
* **Tip**: Make sure to replace the `image` field value with the path to your actual prepared runtime image. You can also modify the `annotations.cpaas.io/display-name` field to **customize the display name** of the runtime in the AI Platform UI.

2. **Apply the YAML File to Create the Resource**:
Expand Down Expand Up @@ -251,8 +255,8 @@ spec:
name: kserve-container
resources:
limits:
cpu: 2
memory: 6Gi
cpu: 2 # [!code callout]
memory: 6Gi # [!code callout]
requests:
cpu: 2
memory: 6Gi
Expand Down Expand Up @@ -282,6 +286,11 @@ spec:

```

<Callouts>
1. Set the CPU limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
2. Set the memory limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
</Callouts>

### Triton Inference Server

The Triton Inference Server runtime is designed for NVIDIA GPUs and supports multiple model formats. Similar to MLServer, you need to create the `ClusterServingRuntime` resource first, then create your inference service.
Expand Down Expand Up @@ -314,8 +323,8 @@ spec:
name: kserve-container
resources:
limits:
cpu: 2
memory: 6Gi
cpu: 2 # [!code callout]
memory: 6Gi # [!code callout]
requests:
cpu: 2
memory: 6Gi
Expand All @@ -340,6 +349,11 @@ spec:
version: "1"
```

<Callouts>
1. Set the CPU limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
2. Set the memory limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
</Callouts>

**Usage Instructions:**

1. **Create the ClusterServingRuntime**: Apply the YAML configuration above using `kubectl apply -f triton-runtime.yaml`
Expand All @@ -357,6 +371,79 @@ This example was validated on `Ascend 910B4`. It should also work with other
Ascend NPU models, but you should adjust the resource key, image, and related
version fields according to your actual environment.

#### Modelcar Permission Modes for Ascend 910 \{#modelcar-permission-modes-for-ascend-910}

Alauda AI runs KServe Modelcar with non-root UID `1000` by default. This
default is designed for the platform security baseline and works for common GPU
deployments, such as NVIDIA GPU inference services. It also works with the
community `vLLM-ascend` image in validated Ascend 910 single-card deployments.

Ascend 910 single-node multi-card inference has additional requirements because
`vLLM-ascend` uses HCCL for distributed initialization. In this scenario, choose
one of the following Modelcar permission modes according to the runtime image
and validation result:

| Scenario | Namespace PSA `Enforce` | Modelcar UID | Runtime image requirement | Recommendation |
| --- | --- | --- | --- | --- |
| Ascend 910 single-card inference | `restricted` | `1000` | Community `vLLM-ascend` image | Keep the default non-root mode |
| Ascend 910 single-node multi-card inference with a UID `1000` compatible image | `restricted` | `1000` | The image allows UID `1000` to access Ascend devices and supports HCCL initialization | Use non-root mode after validation |
| Ascend 910 single-node multi-card inference with the community image when UID `1000` compatibility is not validated | `baseline` | `0` | Community `vLLM-ascend` image | Use root mode |

:::warning
These modes are mutually exclusive in one cluster because the Modelcar UID is
configured in the cluster-level KServe Modelcar settings. Keep the default
non-root UID `1000` for general workloads and for validated Ascend 910
single-card services. Switch to root mode only for Ascend 910 multi-card
`vLLM-ascend` deployments that require it.
:::

##### Non-root Mode

Non-root mode keeps the platform default Modelcar UID `1000`, and the namespace
can keep PSA `Enforce` set to `restricted`.

Use this mode for Ascend 910 single-card `vLLM-ascend` inference with the
community image. For Ascend 910 single-node multi-card inference, use this mode
only after validating that the image supports UID `1000` and HCCL
initialization. The image must allow the runtime process to access Ascend device
files, and TP>1 / HCCL initialization may require a matching UID `1000` user
entry in `/etc/passwd`.

##### Root Mode

Root mode is a compatibility option for Ascend 910 single-node multi-card
`vLLM-ascend` inference when the selected image cannot run reliably with the
default Modelcar UID `1000`.

Root mode changes the cluster-level Modelcar UID to `0`. This affects Modelcar
workloads in the cluster, not only a single inference service. Do not use root
mode for single-card deployments unless required by your image or environment.

To enable root mode:

1. In the platform console, go to **Project** > **Namespace**, select the
namespace used by the inference service, and set Pod Security Admission
`Enforce` to `baseline`.
2. Set KServe Modelcar UID to `0` in the `AmlCluster` configuration:

```yaml
spec:
components:
kserve:
values:
kserve:
storage:
uidModelcar: 0
```

For root-mode deployments, do not reuse a runtime configuration that forces a
non-root UID, such as `runAsNonRoot: true` or `runAsUser: 1000`. The
`ClusterServingRuntime` example below is for non-root mode.

When root mode is no longer required, restore the cluster-level Modelcar UID to
the platform default non-root value `1000` and set namespace PSA according to
your workload security requirements.

**1. ClusterServingRuntime**

```yaml
Expand Down Expand Up @@ -435,6 +522,10 @@ spec:
value: '{{ index .Annotations "aml-model-repo" }}'
- name: GPU_MEMORY_UTILIZATION
value: "0.95"
- name: HOME # [!code callout]
value: /tmp
- name: USER # [!code callout]
value: vllm
image: quay.io/ascend/vllm-ascend:v0.18.0rc1
name: kserve-container
ports:
Expand All @@ -443,8 +534,8 @@ spec:
protocol: TCP
resources:
limits:
cpu: 2
memory: 6Gi
cpu: 2 # [!code callout]
memory: 6Gi # [!code callout]
requests:
cpu: 2
memory: 6Gi
Expand All @@ -455,7 +546,7 @@ spec:
- ALL
privileged: false
runAsNonRoot: true
runAsUser: 65534
runAsUser: 1000 # [!code callout]
seccompProfile:
type: RuntimeDefault
startupProbe:
Expand Down Expand Up @@ -486,10 +577,23 @@ spec:
name: devshm
```

<Callouts>
1. `HOME` points temporary files and caches to `/tmp`, which is writable for the runtime container.
2. `USER` prevents Python's `getpass.getuser()` fallback from querying `/etc/passwd` for the container UID. This avoids startup failures in images that run as a non-root UID without a matching passwd entry, while still allowing `torch_npu` to auto-load normally. For TP>1 / HCCL initialization, still validate whether the image needs an actual UID `1000` entry in `/etc/passwd`.
3. `runAsUser: 1000` aligns the runtime container with the platform default Modelcar UID and the common Ascend device permission model.
4. Set the CPU limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
5. Set the memory limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
</Callouts>

**2. Required Changes to the InferenceService Example**

When publishing an inference service with `vLLM-ascend`, make the following required
changes to your `InferenceService` example:
The `HOME` and `USER` environment variables are set in the `ClusterServingRuntime`
above so every service that uses the runtime inherits them. When publishing an
inference service with `vLLM-ascend`, choose the `InferenceService`
security context according to the Modelcar permission mode.

For non-root mode, add `fsGroup: 1000` and `supplementalGroups: [1000]` so the
service can access Ascend device files through the expected group permissions:

```yaml
kind: InferenceService
Expand All @@ -506,17 +610,14 @@ metadata:
spec:
predictor:
model:
env:
- name: HOME # [!code callout]
value: /tmp
modelFormat:
name: transformers
protocolVersion: v2
resources:
limits:
cpu: "4"
cpu: "4" # [!code callout]
huawei.com/Ascend910B4: "1"
memory: 16Gi
memory: 16Gi # [!code callout]
requests:
cpu: "2"
memory: 8Gi
Expand All @@ -531,12 +632,26 @@ spec:
```

<Callouts>
1. `HOME` points temporary files and caches to `/tmp`, which is writable for the runtime container.
2. `fsGroup: 1000` makes the mounted files inherit group `1000`, helping align file permissions with the group that is allowed to access Ascend devices.
3. `supplementalGroups: [1000]` adds the container process to group `1000`, so it can access Ascend devices and related mounted files with the expected group permissions.

1. Set the CPU limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
2. Set the memory limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
3. `fsGroup: 1000` makes the mounted files inherit group `1000`, helping align file permissions with the group that is allowed to access Ascend devices.
4. `supplementalGroups: [1000]` adds the container process to group `1000`, so it can access Ascend devices and related mounted files with the expected group permissions.
</Callouts>

For root-mode Modelcar deployments, the UID `1000` group settings are not
required. Keep the `InferenceService` security context minimal:

```yaml
spec:
predictor:
model:
runtime: aml-vllm-ascend-0.18.0rc1
storageUri: oci://<registry>/<repository>:<tag>
securityContext:
seccompProfile:
type: RuntimeDefault
```

### MindIE (Ascend NPU)

MindIE is specifically designed for Huawei Ascend hardware. Its configuration differs significantly in resource management and metadata.
Expand Down Expand Up @@ -835,8 +950,8 @@ spec:
name: kserve-container
resources:
limits:
cpu: 2
memory: 6Gi
cpu: 2 # [!code callout]
memory: 6Gi # [!code callout]
requests:
cpu: 2
memory: 6Gi
Expand All @@ -862,6 +977,11 @@ spec:

```

<Callouts>
1. Set the CPU limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
2. Set the memory limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
</Callouts>

**2.Mandatory Annotations for InferenceService**

Unlike other runtimes, MindIE **must** include the following annotations in the
Expand All @@ -887,5 +1007,5 @@ Before proceeding, refer to this table to understand the specific requirements f
| **Xinference** | CPU / NVIDIA GPU | transformers, pytorch | **Must** set `MODEL_FAMILY` environment variable |
| **MLServer** | CPU / NVIDIA GPU | sklearn, xgboost, mlflow | Standard configuration |
| **Triton** | NVIDIA GPU | triton (TensorFlow, PyTorch, ONNX, etc.) | Standard configuration |
| **vLLM-ascend** | Huawei Ascend NPU (validated on 910B4) | transformers | **Must** add `HOME`, `fsGroup`, and `supplementalGroups` to the `InferenceService` |
| **vLLM-ascend** | Huawei Ascend NPU (validated on 910B4) | transformers | **Must** set `HOME` and `USER` in the `ClusterServingRuntime`, choose the proper Modelcar permission mode for Ascend 910 single-card or multi-card deployments, and add UID `1000` group settings only for non-root mode |
| **MindIE** | Huawei Ascend NPU (validated on 310P) | mindspore, transformers | **Must** add the required NPU annotations to the `InferenceService` |
Original file line number Diff line number Diff line change
Expand Up @@ -203,8 +203,11 @@ If the service fails to start or remains in an unready state, you can troublesho
### Common Issues

1. **Permission Errors**: Ensure the model files in the image have proper permissions
2. **Registry Authentication**: Verify that the cluster has access to the container registry
2. **Ascend 910 vLLM-ascend Permission Mode**: For Huawei Ascend 910
`vLLM-ascend` deployments, especially single-node multi-card services, see
[Modelcar Permission Modes for Ascend 910](./custom_inference_runtime.mdx#modelcar-permission-modes-for-ascend-910).
3. **Registry Authentication**: Verify that the cluster has access to the container registry

## Conclusion

Using KServe Modelcar (OCI container-based model storage) provides an efficient way to deploy models in Alauda AI platform. By following the steps outlined in this guide, you can package your models as OCI images and deploy them with faster startup times and improved resource utilization.
Using KServe Modelcar (OCI container-based model storage) provides an efficient way to deploy models in Alauda AI platform. By following the steps outlined in this guide, you can package your models as OCI images and deploy them with faster startup times and improved resource utilization.