Sagemaker endpoint doesn't use GPU (instance ml.g4dn.xlarge)

I have spent the whole day trying to deploy a custom HF model to Sagemaker endpoint and making sure it uses the GPU, and I had no luck, hoping to get some insight here. 

Here's my script for the model deployment
```
img_url_old_lib='763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04'

huggingface_model = HuggingFaceModel(
    model_data=model_uri,
    role=role,
    source_dir='code',
    entry_point='inference.py',
    name='hf-inference-1-13-gpu',
    image_uri=img_url_old_lib
)

predictor = huggingface_model.deploy(
    initial_instance_count=1, # number of instances
    instance_type='ml.g4dn.xlarge', #has 1 GPU
    endpoint_name='hf-inference-1-13-gpu', 
)
```
And here's my `code/inference.py` script
```
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def model_fn(model_dir, context=None):
    """
    Load the model for inference
    """

    model_path = os.path.join(model_dir, 'model/')

    processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
    print("Loaded Processor")

    model = VisionEncoderDecoderModel.from_pretrained(model_path)
    print("Loaded Model")

    model_dict = {'model': model.to(device), 'processor': processor}

    return model_dict

def predict_fn(images, model, context=None):
    """
    Apply model to the incoming request
    """

    images = [Image.open(io.BytesIO(content)) for content in images]
    print("Opened Image")

    processor = model['processor']
    model = model['model']

    pixel_values = processor(images, return_tensors="pt").pixel_values
    pixel_values = pixel_values.to(device)
    generated_ids = model.generate(pixel_values)
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
    print("Generated Text: " + str(generated_text))

    return generated_text
```

I've read the threads [here](https://discuss.huggingface.co/t/sagemaker-endpoint-not-using-gpu-for-pygmalionai/37361) and [here](https://discuss.huggingface.co/t/how-do-i-deploy-a-hub-model-to-sagemaker-and-give-it-a-gpu-not-elastic-inference/14727/3), and followed the suggestions made by @philschmid , I tried changing the version of the `transformers_version` arg variable but it still doesn't use the GPU (see pic below). I tested the model in the SM notebook using the same GPU instance(ml.g4dn.xlarge) and I can confirm the inference code does use the GPU as expected. So I'm not sure why when it's deployed to the endpoint with the docker image it doesn't use the GPU. 

I'd appreciate any help on this, thanks!
![GPU monitor](https://github.com/aws/sagemaker-huggingface-inference-toolkit/assets/92494420/bf462ae5-c8ee-494f-979a-f5cfc92115f0)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sagemaker endpoint doesn't use GPU (instance ml.g4dn.xlarge) #117

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sagemaker endpoint doesn't use GPU (instance ml.g4dn.xlarge) #117

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions