-
Notifications
You must be signed in to change notification settings - Fork 41
Open
Description
Problem description
Background
React Native ExecuTorch today provides first‑class hooks for pure‑vision (e.g. useClassification
, useObjectDetection
, useOCR
) and pure‑language (useLLM
) models—but there’s no built‑in way to load and run a single multimodal (“vision‑language”) checkpoint such as LLaVA-1.5 or BLIP‑2.
Why Multimodal?
- LLaVA, BLIP‑2, Flamingo and similar models can ingest images + text prompts and produce visually grounded responses.
- Enabling this on‑device in React Native would unlock powerful offline scenarios (e.g. photo Q&A, visual assistants) with privacy and low latency.
Proposed solution
A new hook, e.g.
const { result, generate, isLoading } = useVisionLLM({
modelSource: require('../assets/llava-1.5.pte'),
tokenizerSource: require('../assets/llava-tokenizer.json'),
});
Alternative solutions
Or extend useLLM
to accept an image tensor:
const { result, isReady } = useLLM({
modelSource: require('…/llava.pte'),
imageInput: myImageTensor,
prompt: "Describe what you see",
});
Benefits to React Native ExecuTorch
- One unified multimodal hook removes boilerplate for separately running useImageEmbeddings + useLLM.
- Enables richer on‑device AI experiences (visual QA, instruction following, AR captions) in pure React Native apps.
Additional context
No response
msluszniak
Metadata
Metadata
Assignees
Labels
No labels