This prototype accepts an image or a video, detects human keypoints, detects objects, performs zoomed-in captions of detected regions, and returns a 20-point descriptive summary.
Features
- Accepts images or videos of any size
- Extracts human keypoints (MediaPipe Holistic)
- Runs object detection (Ultralytics YOLO)
- Produces global caption and zoomed-in captions (BLIP)
- Returns 20 numbered descriptive points
Requirements
- Python 3.9+
- NVIDIA GPU recommended for performance but not required
Quick start (local)
-
Create and activate a virtualenv: python -m venv .venv source .venv/bin/activate # macOS / Linux .venv\Scripts\activate # Windows
-
Install dependencies: pip install -r requirements.txt
-
Run the app: uvicorn app:app --reload --host 0.0.0.0 --port 8000
-
Open http://localhost:8000/ and upload an image or a video.
Notes
- Models (BLIP, YOLO) download weights on first run; this may take time.
- If you want to avoid heavy models, you can disable object detection or captioning in
processing.py.
API
- POST /analyze (multipart form) field
filewhich is image or video - Returns JSON: { "keypoints": {...}, "objects": [...], "global_caption": "...", "zoomed_captions": [...], "description_points": ["1. ...", "... up to 20"] }