Skip to content

Commit c1515d5

Browse files
authored
Merge pull request #568 from elizjo/main
Corrected: Additional Details Added to Aqua Troubleshooting Documentation For Auth Errors
2 parents 5017e6c + 9fcbf19 commit c1515d5

File tree

1 file changed

+145
-52
lines changed

1 file changed

+145
-52
lines changed

ai-quick-actions/troubleshooting-tips.md

+145-52
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,38 @@
1-
# Model Deployment
1+
2+
3+
<!-- TOC -->
4+
<!-- /TOC -->
5+
6+
- [Troubleshooting Model Deployment](#troubleshooting-model-deployment)
7+
- [Logs](#logs)
8+
- [Understanding GPU requirement for models](#understanding-gpu-requirement-for-models)
9+
- [Issues and Resolutions](#issues-and-resolutions)
10+
- [Service Timeout Error](#service-timeout-error)
11+
- [Out of Memory (OOM) Error](#out-of-memory-oom-error)
12+
- [Trusting Remote Code](#trusting-remote-code)
13+
- [Architecture Not Supported](#architecture-not-supported)
14+
- [Capacity Issues](#capacity-issues)
15+
- [Chat payload is Not Working](#chat-payload-is-not-working)
16+
- [Image Payload is Not Working](#image-payload-is-not-working)
17+
- [Prompt Completion Payload is Not Working](#prompt-completion-payload-is-not-working)
18+
- [Authorization Issues](#authorization-issues)
19+
- [Types of Authorization Errors](#types-of-authorization-errors)
20+
- [Create Model](#create-model)
21+
- [List Models](#list-models)
22+
- [Create Model Deployment](#create-model-deployment)
23+
- [List Model Deployment](#list-model-deployment)
24+
- [Create Model Version Sets](#create-model-version-sets)
25+
- [List Model Version Sets](#list-model-version-sets)
26+
- [Create Job](#create-job)
27+
- [Create Job Run](#create-job-run)
28+
- [List Log Groups](#list-log-groups)
29+
- [List Data Science Private Endpoints](#list-data-science-private-endpoints)
30+
- [Get Namespace](#get-namespace)
31+
- [Put Object](#put-object)
32+
- [List Buckets](#list-buckets)
33+
- [Update Model](#update-model)
34+
- [Evaluation and Fine Tuning](#evaluation-and-fine-tuning)
35+
# Troubleshooting Model Deployment
236

337
## Logs
438

@@ -29,7 +63,7 @@ If logs are attached, run the `ads watch` command to retrieve the logs. Once log
2963

3064
Here are some frequently encountered issues. Please note it could fail for reasons not listed here, but these form most commonly encountered ones -
3165

32-
#### Out of Memory (OOM) error.
66+
#### Out of Memory (OOM) Error
3367

3468
Check the error message in the logging to understand if you need to allocate more GPUs or need to limit the context length. Here are some tips
3569

@@ -57,7 +91,7 @@ If your log message appears like the above, then you have two options -
5791

5892
2) Try quantization:
5993

60-
You can reduce the memory footprint of the model by enabling quantization. Here are the steps to enable quantization -
94+
You can set reduce the memory footprint of the model by enabling quantization. Here are the steps to enable quantization -
6195
1. Go to create model deployment and select the model you want to deploy
6296
2. Click on advanced section
6397
3. Input the quantization option as per the documentation of the inference container. Eg. If you are using vLLM, you can input `--quantization` for Name and `fp8` for value. This will load the model in 8bit reducing the memory requirement by half. You can try `--quantization bitsandbytes` and `--load-format bitsandbytes` to load in 4 bits.
@@ -86,7 +120,7 @@ If you see such a message, constrain the context length by -
86120
2. Click on advanced section
87121
3. Add name as `--max-model-len` and for value, use the hint in the log. As per the above log we can set `37696`. Better to leave some room and go lower.
88122

89-
#### Trusting remote code
123+
#### Trusting Remote Code
90124

91125
Sometimes, the inference container will not have native support for the model, but the model can still be side loaded using the code provided by the model provider. In such cases error message could look like below -
92126

@@ -99,7 +133,7 @@ If you see such a message, -
99133
3. Add name as `--trust-remote-code` and leave value as blank.
100134

101135

102-
#### Architecture Not supported
136+
#### Architecture Not Supported
103137

104138
vLLM container may not support the model that you are trying to load. Here is a sample log snippet in such cases -
105139

@@ -110,75 +144,134 @@ Exiting vLLM.
110144
In such cases, you will have to follow [BYOC](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/LLM/deploy-llm-byoc.md) approach. Check [here](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/ai-quick-actions/ai-quick-actions-containers.md) for the supported containers by AI Quick Actions.
111145

112146
Visit [vLLM supported models](https://docs.vllm.ai/en/latest/models/supported_models.html) to know what models are supported.
113-
114147
If you are using Text Generation Inference, visit [TGI Support models page](https://huggingface.co/docs/text-generation-inference/en/supported_models)
115148

116149
### Capacity Issues
117150

118-
You see a message "There is currently no capacity for the specified shape. Choose a different shape or region". This happens because currently all the instances of the selected shape are in use in that region. This is different from the limits.
151+
You see a message "There is currently no capacity for the specified shape. Choose a different shape or region". This happens because there currently all the instances of the selected shape are in use in that region. This is different from the limits.
119152

120153
The shapes are provisioned from a common pool by default. You could create a capacity reservation for more predictable availability of the shape. More information [here](https://docs.oracle.com/en-us/iaas/data-science/using/gpu-using.htm#gpu-use-reserve)
121154

122-
### Chat payload is not working
155+
### Chat payload is Not Working
123156
TODO
124157

125-
### Image Payload not working
158+
### Image Payload is Not Working
126159
TODO
127160

128-
### Prompt completion payload is not working
161+
### Prompt Completion Payload is Not Working
129162
TODO
130-
131163
# Authorization Issues
132164

133-
Authorization issues arise due to missing policy. Please refer to [policy document](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/ai-quick-actions/policies/README.md) to setup policies. We strongly encourage using ORM option mentioned in the policy document.
165+
Authorization issues arise due to missing policy and/or using non-versioned OCI Object Storage Buckets with AQUA.
166+
1. Set up policies for AQUA as seen [here](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/ai-quick-actions/policies/README.md)
167+
- We strongly encourage using ORM option (automated setup of policies, not manual) mentioned in the policy document.
168+
2. The notebook session has to be in the **same compartment** as the one defined by the dynamic group.
169+
- The dynamic group definition used while setting up ORM stack identifies the notebook from where AI Quick Actions is being used.
170+
3. Ensure that the bucket used with AQUA has object versioning enabled
171+
172+
![object versioning](./web_assets/object-versioning.png)
134173

135-
**Note**: `<Your dynamic group>` in the policy below has to be replaced with dynamic group that you defined while using the ORM stack.
136174

137-
If you see authorization issues after setting up the policies here are possible cases -
138-
1. The dynamic group definition used while setting up ORM stack identifies the notebook from where AI quick actions is being used. The notebook session has to be in the same compartment as the one defined by the dynamic group.
139-
2. If the UI is not able to list the buckets or fetch namespace you maybe missing following policy -
140-
```
141-
Allow dynamic-group <Your dynamic group> to read buckets in compartment <your-compartment-name>
142-
Allow dynamic-group <Your dynamic group> to read objectstorage-namespaces in compartment <your-compartment-name>
143-
```
144-
3. While registering the model, AI Quick Actions is not able to reach the object storage location specified -
145-
```
146-
Allow dynamic-group <Your dynamic group> to manage object-family in compartment <your-compartment-name> where any {target.bucket.name='<your-bucket-name>'}
147-
```
148-
4. While registering the model, AI Quick Actions is not able to create model in model catalog -
149-
```
150-
Allow dynamic-group <Your dynamic group> to manage data-science-models in compartment <your-compartment-name>
151-
```
152-
5. Unable to fetch model details for fine tuned models -
175+
## Types of Authorization Errors
176+
If you see authorization issues after setting up the policies, ensuring that the notebook is in the **same compartment** as the one defined by the dynamic group, and the bucket is versioned, here are the following cases:
177+
178+
#### Create Model
179+
1. AI Quick Actions is not able to reach the object storage location specified when registering the model.
180+
```
181+
Allow dynamic-group <Your dynamic group> to manage object-family in compartment <your-compartment-name> where any {target.bucket.name='<your-bucket-name>'}
182+
```
183+
1. AI Quick Actions is not able to create model in model catalog, ensure that policy below is in place.
184+
```
185+
Allow dynamic-group <Your dynamic group> to manage data-science-models in compartment <your-compartment-name>
186+
```
187+
1. The AQUA UI currently does not support adding freeform tags. Use the AQUA CLI to register a model with freeform tags.
188+
189+
```
190+
ads aqua model register --model <model-ocid> --os_path <oss-path> --download_from_hf True --compartment_id ocid1.compartment.xxx --defined_tags '{"key1":"value1", ...}' --freeform_tags '{"key1":"value1", ...}'
191+
```
192+
#### List Models
193+
Authorization error related to listing, creating, or registering models, ensure that policy below is in place.
194+
```
195+
Allow dynamic-group <Your dynamic group> to manage data-science-models in compartment <your-compartment-name>
196+
```
197+
198+
#### Create Model Deployment
199+
#### List Model Deployment
200+
Authorization error related to creating, listing, or managing model deployments, ensure that policy below is in place.
201+
```
202+
Allow dynamic-group <Your dynamic group> to manage data-science-model-deployments in compartment <your-compartment-name>
203+
```
204+
205+
#### Create Model Version Sets
206+
#### List Model Version Sets
207+
Unable to create a model version set or not able to fetch model version set information during fine tuning or evaluation step, ensure the policy below is in place.
208+
```
209+
Allow dynamic-group <Your dynamic group> to manage data-science-modelversionsets in compartment <your-compartment-name>
210+
```
211+
212+
#### Create Job
213+
Unable to create a job during evaluation or fine tuning. Ensure the policy below is in place.
214+
```
215+
Allow dynamic-group <Your dynamic group> to manage data-science-job-runs in compartment <your-compartment-name>
216+
```
217+
218+
#### Create Job Run
219+
Unable to create a job run during fine tuning or evaluation. Ensure the policy below is in place.
220+
```
221+
Allow dynamic-group aqua-dynamic-group to manage data-science-job-runs in compartment <your-compartment-name>
222+
```
223+
224+
#### List Log Groups
225+
The dropdown for log group or log does not show anything and gives authorization error, ensure policy below is in place.
226+
```
227+
Allow dynamic-group <Your dynamic group> to use logging-family in compartment <your-compartment-name>
228+
```
229+
#### List Data Science Private Endpoints
230+
Authorization error does not list the private endpoints in the specified compartment on UI, ensure policy below is in place.
231+
```
232+
Allow dynamic-group <Your dynamic group> to use virtual-network-family in compartment <your-compartment-name>
233+
```
234+
235+
#### Get Namespace
236+
If the UI is unable to fetch namespace or list object storage buckets ensure policy below is in place.
237+
```
238+
Allow dynamic-group <Your dynamic group> to read buckets in compartment <your-compartment-name>
239+
Allow dynamic-group <Your dynamic group> to read objectstorage-namespaces in compartment <your-compartment-name>
240+
```
241+
242+
#### Put Object
243+
If an object storage bucket (with Object Versioning enabled) is unable to be accessed, ensure these policies are in place.
244+
```
245+
Allow dynamic-group <Your dynamic group> to manage object-family in compartment <your-compartment-name> where any {target.bucket.name='<your-bucket-name>'}
246+
Allow dynamic-group <Your dynamic group> to read buckets in compartment <your-compartment-name>
247+
Allow dynamic-group <Your dynamic group> to read objectstorage-namespaces in compartment <your-compartment-name>
248+
```
249+
250+
#### List Buckets
251+
If the UI is unable to list buckets, ensure the following:
252+
- If using custom networking, configure NAT gateway and SGW gateway
253+
- ensure the policy below is in place
254+
```
255+
Allow dynamic-group <dynamic group> to read buckets in compartment <your-compartment-name>
256+
```
257+
258+
#### Update Model
259+
When creating a fine-tuned model deployment and an error occurs when submitting the UI form, add the following policy.
260+
```
261+
Allow dynamic-group <Your dynamic group> to use tag-namespaces in tenancy
262+
```
263+
264+
#### Evaluation and Fine Tuning
265+
1. Unable to fetch model details for fine tuned models
153266
```
154267
Allow dynamic-group <Your dynamic group> to manage data-science-models in compartment <your-compartment-name>
155268
```
156-
6. Unable to create a model version set or not able to fetch model version set information during fine tuning or evaluation step -
157-
```
158-
Allow dynamic-group <Your dynamic group> to manage data-science-modelversionsets in compartment <your-compartment-name>
159-
```
160-
7. Unable to fetch resource limits information where you select shape -
269+
2. Unable to fetch resource limits information when selecting instance shape -
161270
```
162271
Allow dynamic-group <Your dynamic group> to read resource-availability in compartment <your-compartment-name>
163272
```
164-
8. The dropdown for log group or log does not show anything and gives authorization error -
165-
```
166-
Allow dynamic-group <Your dynamic group> to use logging-family in compartment <your-compartment-name>
167-
```
168-
9. Unable to list any VCN or subnet while creating Fine Tuning job or Evaluation Job -
273+
3. Unable to list any VCN or subnet while creating Fine Tuning job or Evaluation Job -
169274
```
170275
Allow dynamic-group <Your dynamic group> to use virtual-network-family in compartment <your-compartment-name>
171276
```
172-
10. Authorization error related to listing, creating or managing model deployments -
173-
```
174-
Allow dynamic-group <Your dynamic group> to manage data-science-model-deployments in compartment <your-compartment-name>
175-
```
176-
11. Allowing AI Quick Actions to use defined tags -
177-
```
178-
Allow dynamic-group <Your dynamic group> to use tag-namespaces in tenancy
179-
```
180-
12. Unable to create finetuning or evaluation jobs - create_job
181-
```
182-
Allow dynamic-group <Your dynamic group> to manage data-science-jobs in compartment <your-compartment-name>
183-
Allow dynamic-group <Your dynamic group> to manage data-science-job-runs in compartment <your-compartment-name>
184-
```
277+

0 commit comments

Comments
 (0)