You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "google/paligemma-3b-mix-224"
device = "cuda:0"
dtype = torch.bfloat16
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=dtype,
device_map=device,
revision="bfloat16",
).eval()
processor = AutoProcessor.from_pretrained(model_id)
# Instruct the model to create a caption in Spanish
prompt = "caption es"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
I get the following errors:
You are passing both text and images to PaliGemmaProcessor. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add <image> tokens in the very beginning of your text and <bos> token after that. For this call, we will infer how many images each text has and add special tokens.
and:
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (CUDABFloat16Type) should be the same
The text was updated successfully, but these errors were encountered:
When running this snippet from HuggingFace
I get the following errors:
You are passing both
text
andimages
toPaliGemmaProcessor
. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add<image>
tokens in the very beginning of your text and<bos>
token after that. For this call, we will infer how many images each text has and add special tokens.and:
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (CUDABFloat16Type) should be the same
The text was updated successfully, but these errors were encountered: