This project implements a Retrieval-Augmented Generation (RAG) pipeline for efficient document processing and knowledge retrieval. It extracts text and tables from PDFs using the Unstructured library, stores raw PDFs in Redis, and indexes extracted embeddings in PGVector for semantic search. The system leverages MultiVector Retriever for context retrieval before querying an LLM (GPT model).
- Unstructured Document Processing: Extracts text and tables from PDFs.
- Redis for Raw Storage: Stores and retrieves raw PDFs efficiently, to implement persistent storage.
- PGVector for Vector Storage: Indexes and retrieves high-dimensional embeddings for similarity search.
- MultiVector Retriever: Optimized for retrieving contextual information from multiple sources.
- LLM Integration: Uses a GPT model to generate responses based on retrieved context.
- Python
unstructured
pgvector
redis
langchain
openai
- Redis: For raw PDF storage
- PostgreSQL + PGVector: For embeddings storage
- GPT (via OpenAI API or local model)