You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: ai-quick-actions/troubleshooting-tips.md
+145-52
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,38 @@
1
-
# Model Deployment
1
+
2
+
3
+
<!-- TOC -->
4
+
<!-- /TOC -->
5
+
6
+
-[Troubleshooting Model Deployment](#troubleshooting-model-deployment)
7
+
-[Logs](#logs)
8
+
-[Understanding GPU requirement for models](#understanding-gpu-requirement-for-models)
9
+
-[Issues and Resolutions](#issues-and-resolutions)
10
+
-[Service Timeout Error](#service-timeout-error)
11
+
-[Out of Memory (OOM) Error](#out-of-memory-oom-error)
12
+
-[Trusting Remote Code](#trusting-remote-code)
13
+
-[Architecture Not Supported](#architecture-not-supported)
14
+
-[Capacity Issues](#capacity-issues)
15
+
-[Chat payload is Not Working](#chat-payload-is-not-working)
16
+
-[Image Payload is Not Working](#image-payload-is-not-working)
17
+
-[Prompt Completion Payload is Not Working](#prompt-completion-payload-is-not-working)
18
+
-[Authorization Issues](#authorization-issues)
19
+
-[Types of Authorization Errors](#types-of-authorization-errors)
20
+
-[Create Model](#create-model)
21
+
-[List Models](#list-models)
22
+
-[Create Model Deployment](#create-model-deployment)
23
+
-[List Model Deployment](#list-model-deployment)
24
+
-[Create Model Version Sets](#create-model-version-sets)
25
+
-[List Model Version Sets](#list-model-version-sets)
26
+
-[Create Job](#create-job)
27
+
-[Create Job Run](#create-job-run)
28
+
-[List Log Groups](#list-log-groups)
29
+
-[List Data Science Private Endpoints](#list-data-science-private-endpoints)
30
+
-[Get Namespace](#get-namespace)
31
+
-[Put Object](#put-object)
32
+
-[List Buckets](#list-buckets)
33
+
-[Update Model](#update-model)
34
+
-[Evaluation and Fine Tuning](#evaluation-and-fine-tuning)
35
+
# Troubleshooting Model Deployment
2
36
3
37
## Logs
4
38
@@ -29,7 +63,7 @@ If logs are attached, run the `ads watch` command to retrieve the logs. Once log
29
63
30
64
Here are some frequently encountered issues. Please note it could fail for reasons not listed here, but these form most commonly encountered ones -
31
65
32
-
#### Out of Memory (OOM) error.
66
+
#### Out of Memory (OOM) Error
33
67
34
68
Check the error message in the logging to understand if you need to allocate more GPUs or need to limit the context length. Here are some tips
35
69
@@ -57,7 +91,7 @@ If your log message appears like the above, then you have two options -
57
91
58
92
2) Try quantization:
59
93
60
-
You can reduce the memory footprint of the model by enabling quantization. Here are the steps to enable quantization -
94
+
You can set reduce the memory footprint of the model by enabling quantization. Here are the steps to enable quantization -
61
95
1. Go to create model deployment and select the model you want to deploy
62
96
2. Click on advanced section
63
97
3. Input the quantization option as per the documentation of the inference container. Eg. If you are using vLLM, you can input `--quantization` for Name and `fp8` for value. This will load the model in 8bit reducing the memory requirement by half. You can try `--quantization bitsandbytes` and `--load-format bitsandbytes` to load in 4 bits.
@@ -86,7 +120,7 @@ If you see such a message, constrain the context length by -
86
120
2. Click on advanced section
87
121
3. Add name as `--max-model-len` and for value, use the hint in the log. As per the above log we can set `37696`. Better to leave some room and go lower.
88
122
89
-
#### Trusting remote code
123
+
#### Trusting Remote Code
90
124
91
125
Sometimes, the inference container will not have native support for the model, but the model can still be side loaded using the code provided by the model provider. In such cases error message could look like below -
92
126
@@ -99,7 +133,7 @@ If you see such a message, -
99
133
3. Add name as `--trust-remote-code` and leave value as blank.
100
134
101
135
102
-
#### Architecture Not supported
136
+
#### Architecture Not Supported
103
137
104
138
vLLM container may not support the model that you are trying to load. Here is a sample log snippet in such cases -
105
139
@@ -110,75 +144,134 @@ Exiting vLLM.
110
144
In such cases, you will have to follow [BYOC](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/LLM/deploy-llm-byoc.md) approach. Check [here](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/ai-quick-actions/ai-quick-actions-containers.md) for the supported containers by AI Quick Actions.
111
145
112
146
Visit [vLLM supported models](https://docs.vllm.ai/en/latest/models/supported_models.html) to know what models are supported.
113
-
114
147
If you are using Text Generation Inference, visit [TGI Support models page](https://huggingface.co/docs/text-generation-inference/en/supported_models)
115
148
116
149
### Capacity Issues
117
150
118
-
You see a message "There is currently no capacity for the specified shape. Choose a different shape or region". This happens because currently all the instances of the selected shape are in use in that region. This is different from the limits.
151
+
You see a message "There is currently no capacity for the specified shape. Choose a different shape or region". This happens because there currently all the instances of the selected shape are in use in that region. This is different from the limits.
119
152
120
153
The shapes are provisioned from a common pool by default. You could create a capacity reservation for more predictable availability of the shape. More information [here](https://docs.oracle.com/en-us/iaas/data-science/using/gpu-using.htm#gpu-use-reserve)
121
154
122
-
### Chat payload is not working
155
+
### Chat payload is Not Working
123
156
TODO
124
157
125
-
### Image Payload not working
158
+
### Image Payload is Not Working
126
159
TODO
127
160
128
-
### Prompt completion payload is not working
161
+
### Prompt Completion Payload is Not Working
129
162
TODO
130
-
131
163
# Authorization Issues
132
164
133
-
Authorization issues arise due to missing policy. Please refer to [policy document](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/ai-quick-actions/policies/README.md) to setup policies. We strongly encourage using ORM option mentioned in the policy document.
165
+
Authorization issues arise due to missing policy and/or using non-versioned OCI Object Storage Buckets with AQUA.
166
+
1. Set up policies for AQUA as seen [here](https://github.com/oracle-samples/oci-data-science-ai-samples/blob/main/ai-quick-actions/policies/README.md)
167
+
- We strongly encourage using ORM option (automated setup of policies, not manual) mentioned in the policy document.
168
+
2. The notebook session has to be in the **same compartment** as the one defined by the dynamic group.
169
+
- The dynamic group definition used while setting up ORM stack identifies the notebook from where AI Quick Actions is being used.
170
+
3. Ensure that the bucket used with AQUA has object versioning enabled
**Note**: `<Your dynamic group>` in the policy below has to be replaced with dynamic group that you defined while using the ORM stack.
136
174
137
-
If you see authorization issues after setting up the policies here are possible cases -
138
-
1. The dynamic group definition used while setting up ORM stack identifies the notebook from where AI quick actions is being used. The notebook session has to be in the same compartment as the one defined by the dynamic group.
139
-
2. If the UI is not able to list the buckets or fetch namespace you maybe missing following policy -
140
-
```
141
-
Allow dynamic-group <Your dynamic group> to read buckets in compartment <your-compartment-name>
142
-
Allow dynamic-group <Your dynamic group> to read objectstorage-namespaces in compartment <your-compartment-name>
143
-
```
144
-
3. While registering the model, AI Quick Actions is not able to reach the object storage location specified -
145
-
```
146
-
Allow dynamic-group <Your dynamic group> to manage object-family in compartment <your-compartment-name> where any {target.bucket.name='<your-bucket-name>'}
147
-
```
148
-
4. While registering the model, AI Quick Actions is not able to create model in model catalog -
149
-
```
150
-
Allow dynamic-group <Your dynamic group> to manage data-science-models in compartment <your-compartment-name>
151
-
```
152
-
5. Unable to fetch model details for fine tuned models -
175
+
## Types of Authorization Errors
176
+
If you see authorization issues after setting up the policies, ensuring that the notebook is in the **same compartment** as the one defined by the dynamic group, and the bucket is versioned, here are the following cases:
177
+
178
+
#### Create Model
179
+
1. AI Quick Actions is not able to reach the object storage location specified when registering the model.
180
+
```
181
+
Allow dynamic-group <Your dynamic group> to manage object-family in compartment <your-compartment-name> where any {target.bucket.name='<your-bucket-name>'}
182
+
```
183
+
1. AI Quick Actions is not able to create model in model catalog, ensure that policy below is in place.
184
+
```
185
+
Allow dynamic-group <Your dynamic group> to manage data-science-models in compartment <your-compartment-name>
186
+
```
187
+
1. The AQUA UI currently does not support adding freeform tags. Use the AQUA CLI to register a model with freeform tags.
188
+
189
+
```
190
+
ads aqua model register --model <model-ocid> --os_path <oss-path> --download_from_hf True --compartment_id ocid1.compartment.xxx --defined_tags '{"key1":"value1", ...}' --freeform_tags '{"key1":"value1", ...}'
191
+
```
192
+
#### List Models
193
+
Authorization error related to listing, creating, or registering models, ensure that policy below is in place.
194
+
```
195
+
Allow dynamic-group <Your dynamic group> to manage data-science-models in compartment <your-compartment-name>
196
+
```
197
+
198
+
#### Create Model Deployment
199
+
#### List Model Deployment
200
+
Authorization error related to creating, listing, or managing model deployments, ensure that policy below is in place.
201
+
```
202
+
Allow dynamic-group <Your dynamic group> to manage data-science-model-deployments in compartment <your-compartment-name>
203
+
```
204
+
205
+
#### Create Model Version Sets
206
+
#### List Model Version Sets
207
+
Unable to create a model version set or not able to fetch model version set information during fine tuning or evaluation step, ensure the policy below is in place.
208
+
```
209
+
Allow dynamic-group <Your dynamic group> to manage data-science-modelversionsets in compartment <your-compartment-name>
210
+
```
211
+
212
+
#### Create Job
213
+
Unable to create a job during evaluation or fine tuning. Ensure the policy below is in place.
214
+
```
215
+
Allow dynamic-group <Your dynamic group> to manage data-science-job-runs in compartment <your-compartment-name>
216
+
```
217
+
218
+
#### Create Job Run
219
+
Unable to create a job run during fine tuning or evaluation. Ensure the policy below is in place.
220
+
```
221
+
Allow dynamic-group aqua-dynamic-group to manage data-science-job-runs in compartment <your-compartment-name>
222
+
```
223
+
224
+
#### List Log Groups
225
+
The dropdown for log group or log does not show anything and gives authorization error, ensure policy below is in place.
226
+
```
227
+
Allow dynamic-group <Your dynamic group> to use logging-family in compartment <your-compartment-name>
228
+
```
229
+
#### List Data Science Private Endpoints
230
+
Authorization error does not list the private endpoints in the specified compartment on UI, ensure policy below is in place.
231
+
```
232
+
Allow dynamic-group <Your dynamic group> to use virtual-network-family in compartment <your-compartment-name>
233
+
```
234
+
235
+
#### Get Namespace
236
+
If the UI is unable to fetch namespace or list object storage buckets ensure policy below is in place.
237
+
```
238
+
Allow dynamic-group <Your dynamic group> to read buckets in compartment <your-compartment-name>
239
+
Allow dynamic-group <Your dynamic group> to read objectstorage-namespaces in compartment <your-compartment-name>
240
+
```
241
+
242
+
#### Put Object
243
+
If an object storage bucket (with Object Versioning enabled) is unable to be accessed, ensure these policies are in place.
244
+
```
245
+
Allow dynamic-group <Your dynamic group> to manage object-family in compartment <your-compartment-name> where any {target.bucket.name='<your-bucket-name>'}
246
+
Allow dynamic-group <Your dynamic group> to read buckets in compartment <your-compartment-name>
247
+
Allow dynamic-group <Your dynamic group> to read objectstorage-namespaces in compartment <your-compartment-name>
248
+
```
249
+
250
+
#### List Buckets
251
+
If the UI is unable to list buckets, ensure the following:
252
+
- If using custom networking, configure NAT gateway and SGW gateway
253
+
- ensure the policy below is in place
254
+
```
255
+
Allow dynamic-group <dynamic group> to read buckets in compartment <your-compartment-name>
256
+
```
257
+
258
+
#### Update Model
259
+
When creating a fine-tuned model deployment and an error occurs when submitting the UI form, add the following policy.
260
+
```
261
+
Allow dynamic-group <Your dynamic group> to use tag-namespaces in tenancy
262
+
```
263
+
264
+
#### Evaluation and Fine Tuning
265
+
1. Unable to fetch model details for fine tuned models
153
266
```
154
267
Allow dynamic-group <Your dynamic group> to manage data-science-models in compartment <your-compartment-name>
155
268
```
156
-
6. Unable to create a model version set or not able to fetch model version set information during fine tuning or evaluation step -
157
-
```
158
-
Allow dynamic-group <Your dynamic group> to manage data-science-modelversionsets in compartment <your-compartment-name>
159
-
```
160
-
7. Unable to fetch resource limits information where you select shape -
269
+
2. Unable to fetch resource limits information when selecting instance shape -
161
270
```
162
271
Allow dynamic-group <Your dynamic group> to read resource-availability in compartment <your-compartment-name>
163
272
```
164
-
8. The dropdown for log group or log does not show anything and gives authorization error -
165
-
```
166
-
Allow dynamic-group <Your dynamic group> to use logging-family in compartment <your-compartment-name>
167
-
```
168
-
9. Unable to list any VCN or subnet while creating Fine Tuning job or Evaluation Job -
273
+
3. Unable to list any VCN or subnet while creating Fine Tuning job or Evaluation Job -
169
274
```
170
275
Allow dynamic-group <Your dynamic group> to use virtual-network-family in compartment <your-compartment-name>
171
276
```
172
-
10. Authorization error related to listing, creating or managing model deployments -
173
-
```
174
-
Allow dynamic-group <Your dynamic group> to manage data-science-model-deployments in compartment <your-compartment-name>
175
-
```
176
-
11. Allowing AI Quick Actions to use defined tags -
177
-
```
178
-
Allow dynamic-group <Your dynamic group> to use tag-namespaces in tenancy
179
-
```
180
-
12. Unable to create finetuning or evaluation jobs - create_job
181
-
```
182
-
Allow dynamic-group <Your dynamic group> to manage data-science-jobs in compartment <your-compartment-name>
183
-
Allow dynamic-group <Your dynamic group> to manage data-science-job-runs in compartment <your-compartment-name>
0 commit comments