Skip to content

Sagemaker endpoint doesn't use GPU (instance ml.g4dn.xlarge) #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jypucca opened this issue Apr 18, 2024 · 1 comment
Open

Sagemaker endpoint doesn't use GPU (instance ml.g4dn.xlarge) #117

jypucca opened this issue Apr 18, 2024 · 1 comment

Comments

@jypucca
Copy link

jypucca commented Apr 18, 2024

I have spent the whole day trying to deploy a custom HF model to Sagemaker endpoint and making sure it uses the GPU, and I had no luck, hoping to get some insight here.

Here's my script for the model deployment

img_url_old_lib='763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:1.13.1-transformers4.26.0-gpu-py39-cu117-ubuntu20.04'

huggingface_model = HuggingFaceModel(
    model_data=model_uri,
    role=role,
    source_dir='code',
    entry_point='inference.py',
    name='hf-inference-1-13-gpu',
    image_uri=img_url_old_lib
)

predictor = huggingface_model.deploy(
    initial_instance_count=1, # number of instances
    instance_type='ml.g4dn.xlarge', #has 1 GPU
    endpoint_name='hf-inference-1-13-gpu', 
)

And here's my code/inference.py script

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def model_fn(model_dir, context=None):
    """
    Load the model for inference
    """

    model_path = os.path.join(model_dir, 'model/')

    processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten")
    print("Loaded Processor")

    model = VisionEncoderDecoderModel.from_pretrained(model_path)
    print("Loaded Model")

    model_dict = {'model': model.to(device), 'processor': processor}

    return model_dict

def predict_fn(images, model, context=None):
    """
    Apply model to the incoming request
    """

    images = [Image.open(io.BytesIO(content)) for content in images]
    print("Opened Image")

    processor = model['processor']
    model = model['model']

    pixel_values = processor(images, return_tensors="pt").pixel_values
    pixel_values = pixel_values.to(device)
    generated_ids = model.generate(pixel_values)
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
    print("Generated Text: " + str(generated_text))

    return generated_text

I've read the threads here and here, and followed the suggestions made by @philschmid , I tried changing the version of the transformers_version arg variable but it still doesn't use the GPU (see pic below). I tested the model in the SM notebook using the same GPU instance(ml.g4dn.xlarge) and I can confirm the inference code does use the GPU as expected. So I'm not sure why when it's deployed to the endpoint with the docker image it doesn't use the GPU.

I'd appreciate any help on this, thanks!
GPU monitor

@jypucca
Copy link
Author

jypucca commented Apr 18, 2024

I figured it out, I need the env parameter.
So something like this, and now it works on the GPU!

env = {'HF_TASK': 'image-to-text'}

huggingface_model = HuggingFaceModel(
    model_data=model_uri,
    env=env,
    source_dir='./code',
    entry_point='inference.py',
    role=role,
    name='hf-inference-2-1-gpu-v4',
    image_uri=img_url
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant