alauda · fyuan1316 · May 23, 2026 · May 23, 2026 · May 23, 2026 · May 26, 2026
diff --git a/docs/en/installation/ai-cluster.mdx b/docs/en/installation/ai-cluster.mdx
@@ -234,7 +234,7 @@ Once **Knative Operator** is installed, you need to create the `KnativeServing`
    spec:
      # For ACP 4.0, use version 1.18.1
      # For ACP 4.1 and above, use version 1.19.6
-     version: "1.18.1" # [!code callout]
+     version: "1.19.6" # [!code callout]
      config:
        deployment:
          registries-skipping-tag-resolving: kind.local,ko.local,dev.local,private-registry # [!code callout]
@@ -343,10 +343,10 @@ In **Administrator** view:
 
 13. Under **Model Catalog** section, configure the following parameters:
 
-    - **Database Password Secret Namespace**: Namespace of the secret containing the PostgreSQL password for Model Catalog.
-    - **Database Password Secret Name**: Name of the secret containing the PostgreSQL password for Model Catalog.
+    - **Database Password Secret Namespace**: Namespace of the Secret containing the PostgreSQL password for Model Catalog.
+    - **Database Password Secret Name**: Name of the Secret containing the PostgreSQL password for Model Catalog.
 
-      Create the secret before creating the Alauda AI instance. If you use the following example, set **Database Password Secret Namespace** to `aml-operator` and **Database Password Secret Name** to `model-catalog`.
+      Create this Secret before creating the Alauda AI instance. If you use the following example, set **Database Password Secret Namespace** to `aml-operator` and **Database Password Secret Name** to `model-catalog`.
 
       ```yaml
       apiVersion: v1
@@ -363,23 +363,10 @@ In **Administrator** view:
 
       1. `metadata.name` is the value for **Database Password Secret Name**.
       2. `metadata.namespace` is the value for **Database Password Secret Namespace**.
-      3. `stringData.password` is the PostgreSQL password in plain text. Kubernetes stores it as base64-encoded `data.password` after the Secret is created.
+      3. `stringData.password` is the PostgreSQL password in plain text. Kubernetes encodes it into `data.password` when the Secret is created, so you do not need to base64-encode the value yourself.
 
       </Callouts>
 
-      After creation, the stored Secret has a base64-encoded `data.password` field, for example:
-
-      ```yaml
-      apiVersion: v1
-      data:
-        password: cGc=
-      kind: Secret
-      metadata:
-        name: model-catalog
-        namespace: aml-operator
-      type: Opaque
-      ```
-
     - **Model OCI Registry Address**: Registry address hosting model OCI artifacts for Model Catalog. The default value is `build-harbor.alauda.cn`.
 
       This registry stores the model OCI images used by Model Catalog. Use Harbor or another production-mode OCI registry with HTTPS access enabled. The Harbor project or repository used for Model Catalog must allow anonymous pull access from inference cluster nodes.

diff --git a/docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx b/docs/en/model_inference/inference_service/how_to/custom_inference_runtime.mdx
@@ -136,8 +136,8 @@ You'll need to create the corresponding inference runtime `ClusterServingRuntime
             - bash
             resources:
               limits:
-                cpu: 2
-                memory: 6Gi
+                cpu: 2 # [!code callout]
+                memory: 6Gi # [!code callout]
               requests:
                 cpu: 2
                 memory: 6Gi
@@ -154,6 +154,10 @@ You'll need to create the corresponding inference runtime `ClusterServingRuntime
               version: "1"
 
         ```
+        <Callouts>
+        1. Set the CPU limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
+        2. Set the memory limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
+        </Callouts>
         * **Tip**: Make sure to replace the `image` field value with the path to your actual prepared runtime image. You can also modify the `annotations.cpaas.io/display-name` field to **customize the display name** of the runtime in the AI Platform UI.
 
 2.  **Apply the YAML File to Create the Resource**:
@@ -251,8 +255,8 @@ spec:
       name: kserve-container
       resources:
         limits:
-          cpu: 2
-          memory: 6Gi
+          cpu: 2 # [!code callout]
+          memory: 6Gi # [!code callout]
         requests:
           cpu: 2
           memory: 6Gi
@@ -282,6 +286,11 @@ spec:
 
  ```
 
+<Callouts>
+1. Set the CPU limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
+2. Set the memory limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
+</Callouts>
+
 ### Triton Inference Server
 
 The Triton Inference Server runtime is designed for NVIDIA GPUs and supports multiple model formats. Similar to MLServer, you need to create the `ClusterServingRuntime` resource first, then create your inference service.
@@ -314,8 +323,8 @@ spec:
       name: kserve-container
       resources:
         limits:
-          cpu: 2
-          memory: 6Gi
+          cpu: 2 # [!code callout]
+          memory: 6Gi # [!code callout]
         requests:
           cpu: 2
           memory: 6Gi
@@ -340,6 +349,11 @@ spec:
       version: "1"
 ```
 
+<Callouts>
+1. Set the CPU limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
+2. Set the memory limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
+</Callouts>
+
 **Usage Instructions:**
 
 1. **Create the ClusterServingRuntime**: Apply the YAML configuration above using `kubectl apply -f triton-runtime.yaml`
@@ -357,6 +371,79 @@ This example was validated on `Ascend 910B4`. It should also work with other
 Ascend NPU models, but you should adjust the resource key, image, and related
 version fields according to your actual environment.
 
+#### Modelcar Permission Modes for Ascend 910 \{#modelcar-permission-modes-for-ascend-910}
+
+Alauda AI runs KServe Modelcar with non-root UID `1000` by default. This
+default is designed for the platform security baseline and works for common GPU
+deployments, such as NVIDIA GPU inference services. It also works with the
+community `vLLM-ascend` image in validated Ascend 910 single-card deployments.
+
+Ascend 910 single-node multi-card inference has additional requirements because
+`vLLM-ascend` uses HCCL for distributed initialization. In this scenario, choose
+one of the following Modelcar permission modes according to the runtime image
+and validation result:
+
+| Scenario | Namespace PSA `Enforce` | Modelcar UID | Runtime image requirement | Recommendation |
+| --- | --- | --- | --- | --- |
+| Ascend 910 single-card inference | `restricted` | `1000` | Community `vLLM-ascend` image | Keep the default non-root mode |
+| Ascend 910 single-node multi-card inference with a UID `1000` compatible image | `restricted` | `1000` | The image allows UID `1000` to access Ascend devices and supports HCCL initialization | Use non-root mode after validation |
+| Ascend 910 single-node multi-card inference with the community image when UID `1000` compatibility is not validated | `baseline` | `0` | Community `vLLM-ascend` image | Use root mode |
+
+:::warning
+These modes are mutually exclusive in one cluster because the Modelcar UID is
+configured in the cluster-level KServe Modelcar settings. Keep the default
+non-root UID `1000` for general workloads and for validated Ascend 910
+single-card services. Switch to root mode only for Ascend 910 multi-card
+`vLLM-ascend` deployments that require it.
+:::
+
+##### Non-root Mode
+
+Non-root mode keeps the platform default Modelcar UID `1000`, and the namespace
+can keep PSA `Enforce` set to `restricted`.
+
+Use this mode for Ascend 910 single-card `vLLM-ascend` inference with the
+community image. For Ascend 910 single-node multi-card inference, use this mode
+only after validating that the image supports UID `1000` and HCCL
+initialization. The image must allow the runtime process to access Ascend device
+files, and TP>1 / HCCL initialization may require a matching UID `1000` user
+entry in `/etc/passwd`.
+
+##### Root Mode
+
+Root mode is a compatibility option for Ascend 910 single-node multi-card
+`vLLM-ascend` inference when the selected image cannot run reliably with the
+default Modelcar UID `1000`.
+
+Root mode changes the cluster-level Modelcar UID to `0`. This affects Modelcar
+workloads in the cluster, not only a single inference service. Do not use root
+mode for single-card deployments unless required by your image or environment.
+
+To enable root mode:
+
+1. In the platform console, go to **Project** > **Namespace**, select the
+   namespace used by the inference service, and set Pod Security Admission
+   `Enforce` to `baseline`.
+2. Set KServe Modelcar UID to `0` in the `AmlCluster` configuration:
+
+   ```yaml
+   spec:
+     components:
+       kserve:
+         values:
+           kserve:
+             storage:
+               uidModelcar: 0
+   ```
+
+For root-mode deployments, do not reuse a runtime configuration that forces a
+non-root UID, such as `runAsNonRoot: true` or `runAsUser: 1000`. The
+`ClusterServingRuntime` example below is for non-root mode.
+
+When root mode is no longer required, restore the cluster-level Modelcar UID to
+the platform default non-root value `1000` and set namespace PSA according to
+your workload security requirements.
+
 **1. ClusterServingRuntime**
 
 ```yaml
@@ -435,6 +522,10 @@ spec:
           value: '{{ index .Annotations "aml-model-repo" }}'
         - name: GPU_MEMORY_UTILIZATION
           value: "0.95"
+        - name: HOME # [!code callout]
+          value: /tmp
+        - name: USER # [!code callout]
+          value: vllm
       image: quay.io/ascend/vllm-ascend:v0.18.0rc1
       name: kserve-container
       ports:
@@ -443,8 +534,8 @@ spec:
           protocol: TCP
       resources:
         limits:
-          cpu: 2
-          memory: 6Gi
+          cpu: 2 # [!code callout]
+          memory: 6Gi # [!code callout]
         requests:
           cpu: 2
           memory: 6Gi
@@ -455,7 +546,7 @@ spec:
             - ALL
         privileged: false
         runAsNonRoot: true
-        runAsUser: 65534
+        runAsUser: 1000 # [!code callout]
         seccompProfile:
           type: RuntimeDefault
       startupProbe:
@@ -486,10 +577,23 @@ spec:
       name: devshm
 ```
 
+<Callouts>
+1. `HOME` points temporary files and caches to `/tmp`, which is writable for the runtime container.
+2. `USER` prevents Python's `getpass.getuser()` fallback from querying `/etc/passwd` for the container UID. This avoids startup failures in images that run as a non-root UID without a matching passwd entry, while still allowing `torch_npu` to auto-load normally. For TP>1 / HCCL initialization, still validate whether the image needs an actual UID `1000` entry in `/etc/passwd`.
+3. `runAsUser: 1000` aligns the runtime container with the platform default Modelcar UID and the common Ascend device permission model.
+4. Set the CPU limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
+5. Set the memory limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
+</Callouts>
+
 **2. Required Changes to the InferenceService Example**
 
-When publishing an inference service with `vLLM-ascend`, make the following required
-changes to your `InferenceService` example:
+The `HOME` and `USER` environment variables are set in the `ClusterServingRuntime`
+above so every service that uses the runtime inherits them. When publishing an
+inference service with `vLLM-ascend`, choose the `InferenceService`
+security context according to the Modelcar permission mode.
+
+For non-root mode, add `fsGroup: 1000` and `supplementalGroups: [1000]` so the
+service can access Ascend device files through the expected group permissions:
 
 ```yaml
 kind: InferenceService
@@ -506,17 +610,14 @@ metadata:
 spec:
   predictor:
     model:
-      env:
-        - name: HOME # [!code callout]
-          value: /tmp
       modelFormat:
         name: transformers
       protocolVersion: v2
       resources:
         limits:
-          cpu: "4"
+          cpu: "4" # [!code callout]
           huawei.com/Ascend910B4: "1"
-          memory: 16Gi
+          memory: 16Gi # [!code callout]
         requests:
           cpu: "2"
           memory: 8Gi
@@ -531,12 +632,26 @@ spec:
 ```
 
 <Callouts>
-1. `HOME` points temporary files and caches to `/tmp`, which is writable for the runtime container.
-2. `fsGroup: 1000` makes the mounted files inherit group `1000`, helping align file permissions with the group that is allowed to access Ascend devices.
-3. `supplementalGroups: [1000]` adds the container process to group `1000`, so it can access Ascend devices and related mounted files with the expected group permissions.
-
+1. Set the CPU limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
+2. Set the memory limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
+3. `fsGroup: 1000` makes the mounted files inherit group `1000`, helping align file permissions with the group that is allowed to access Ascend devices.
+4. `supplementalGroups: [1000]` adds the container process to group `1000`, so it can access Ascend devices and related mounted files with the expected group permissions.
 </Callouts>
 
+For root-mode Modelcar deployments, the UID `1000` group settings are not
+required. Keep the `InferenceService` security context minimal:
+
+```yaml
+spec:
+  predictor:
+    model:
+      runtime: aml-vllm-ascend-0.18.0rc1
+      storageUri: oci://<registry>/<repository>:<tag>
+    securityContext:
+      seccompProfile:
+        type: RuntimeDefault
+```
+
 ### MindIE (Ascend NPU)
 
 MindIE is specifically designed for Huawei Ascend hardware. Its configuration differs significantly in resource management and metadata.
@@ -835,8 +950,8 @@ spec:
       name: kserve-container
       resources:
         limits:
-          cpu: 2
-          memory: 6Gi
+          cpu: 2 # [!code callout]
+          memory: 6Gi # [!code callout]
         requests:
           cpu: 2
           memory: 6Gi
@@ -862,6 +977,11 @@ spec:
 
 ```
 
+<Callouts>
+1. Set the CPU limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
+2. Set the memory limit according to the actual model size, runtime engine, hardware type, and expected workload in your environment.
+</Callouts>
+
 **2.Mandatory Annotations for InferenceService**
 
 Unlike other runtimes, MindIE **must** include the following annotations in the
@@ -887,5 +1007,5 @@ Before proceeding, refer to this table to understand the specific requirements f
 | **Xinference** | CPU / NVIDIA GPU | transformers, pytorch | **Must** set `MODEL_FAMILY` environment variable |
 | **MLServer** | CPU / NVIDIA GPU | sklearn, xgboost, mlflow | Standard configuration |
 | **Triton** | NVIDIA GPU | triton (TensorFlow, PyTorch, ONNX, etc.) | Standard configuration |
-| **vLLM-ascend** | Huawei Ascend NPU (validated on 910B4) | transformers | **Must** add `HOME`, `fsGroup`, and `supplementalGroups` to the `InferenceService` |
+| **vLLM-ascend** | Huawei Ascend NPU (validated on 910B4) | transformers | **Must** set `HOME` and `USER` in the `ClusterServingRuntime`, choose the proper Modelcar permission mode for Ascend 910 single-card or multi-card deployments, and add UID `1000` group settings only for non-root mode |
 | **MindIE** | Huawei Ascend NPU (validated on 310P) | mindspore, transformers | **Must** add the required NPU annotations to the `InferenceService` |
diff --git a/docs/en/model_inference/inference_service/how_to/using_modelcar.mdx b/docs/en/model_inference/inference_service/how_to/using_modelcar.mdx
@@ -203,8 +203,11 @@ If the service fails to start or remains in an unready state, you can troublesho
 ### Common Issues
 
 1. **Permission Errors**: Ensure the model files in the image have proper permissions
-2. **Registry Authentication**: Verify that the cluster has access to the container registry
+2. **Ascend 910 vLLM-ascend Permission Mode**: For Huawei Ascend 910
+   `vLLM-ascend` deployments, especially single-node multi-card services, see
+   [Modelcar Permission Modes for Ascend 910](./custom_inference_runtime.mdx#modelcar-permission-modes-for-ascend-910).
+3. **Registry Authentication**: Verify that the cluster has access to the container registry
 
 ## Conclusion
 
-Using KServe Modelcar (OCI container-based model storage) provides an efficient way to deploy models in Alauda AI platform. By following the steps outlined in this guide, you can package your models as OCI images and deploy them with faster startup times and improved resource utilization.
+Using KServe Modelcar (OCI container-based model storage) provides an efficient way to deploy models in Alauda AI platform. By following the steps outlined in this guide, you can package your models as OCI images and deploy them with faster startup times and improved resource utilization.