A full-stack AI-powered application that allows users to upload a video and ask context-aware questions about it using CLIP, FAISS, and LLaVA via Segmind API.
- Upload a Video via a React + Tailwind frontend.
- Frame Extraction: Key frames are extracted using OpenCV.
- Embedding: Frames are embedded using Hugging Face’s CLIP model.
- Indexing: Embeddings are stored and queried with FAISS.
- Questioning: User questions are semantically matched to the most relevant frames.
- Answering: Segmind's LLaVA API generates answers using the retrieved context.
- CLIP (Hugging Face
openai/clip-vit-large-patch14
) - FAISS (vector similarity search)
- Segmind LLaVA API
- OpenCV (frame extraction)
- Python, PIL, Torch
- Tailwind CSS
- Fetch API (HTTP-based interaction)
- Python ≥ 3.9
- Node.js ≥ 18
- Segmind LLaVA API Key
# Clone the repo
git clone https://github.com/coderuhaan2004/VideoQA.git
cd VideoQA/backend
# Create a virtual env
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
# Install dependencies
pip install -r requirements.txt
# Run the server
python app.py
# Clone the repo
cd VideoQA/chatbot
# Install Libraries
npm install
# Run
npm run dev