A tool for scraping documentation websites and performing intelligent Q&A using agentic RAG (Retrieval-Augmented Generation).
- Website Crawling: Automatically crawls documentation websites, with support for sitemap.xml
- Semantic Chunking: Intelligently splits content into meaningful chunks while preserving context
- Rich Metadata: Extracts and stores metadata like topics, technologies, and content types
- Vector Search: Uses OpenAI embeddings for semantic search
- Agentic RAG: Leverages LLMs for intelligent question answering with context
- Source Management: Manage multiple documentation sources independently
- Clone the repository:
git clone https://github.com/yourusername/agentic-scrape-and-qa.git
cd agentic-scrape-and-qa
- Install dependencies:
pip install -r requirements.txt
- Set up your environment variables by copying the example:
cp .env.example .env
-
Edit
.env
with your API keys and configure your Supabase database -
Run the SQL setup script in your Supabase database (site_pages.sql)
Run the program:
python agentic_rag.py
-
Crawl a New Website
- Enter the base URL (e.g., https://docs.example.com)
- Provide a unique identifier for this documentation
- System will crawl pages, extract content, and store with metadata
-
Q&A on Existing Documentation
- Select from available documentation sets
- Ask questions naturally
- Get context-aware responses with source URLs
-
Manage Documentation Sets
- View all stored documentation sets
- Delete specific sets when needed
- Clean up outdated content
- Uses OpenAI embeddings for semantic search
- Stores content and metadata in Supabase
- Implements vector similarity search
- Preserves code blocks and formatting
- Handles pagination and rate limiting
- Python 3.8+
- OpenAI API key
- Supabase account
- Packages listed in requirements.txt