Skip to content

Scape websites, add them to Supabase with LLM generated tags and meta data and then use agentic rag for q7A

Notifications You must be signed in to change notification settings

discopops/agentic-scrape-and-qa

Repository files navigation

Agentic Scrape and QA

A tool for scraping documentation websites and performing intelligent Q&A using agentic RAG (Retrieval-Augmented Generation).

Features

  • Website Crawling: Automatically crawls documentation websites, with support for sitemap.xml
  • Semantic Chunking: Intelligently splits content into meaningful chunks while preserving context
  • Rich Metadata: Extracts and stores metadata like topics, technologies, and content types
  • Vector Search: Uses OpenAI embeddings for semantic search
  • Agentic RAG: Leverages LLMs for intelligent question answering with context
  • Source Management: Manage multiple documentation sources independently

Setup

  1. Clone the repository:
git clone https://github.com/yourusername/agentic-scrape-and-qa.git
cd agentic-scrape-and-qa
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up your environment variables by copying the example:
cp .env.example .env
  1. Edit .env with your API keys and configure your Supabase database

  2. Run the SQL setup script in your Supabase database (site_pages.sql)

Usage

Run the program:

python agentic_rag.py

Features:

  1. Crawl a New Website

    • Enter the base URL (e.g., https://docs.example.com)
    • Provide a unique identifier for this documentation
    • System will crawl pages, extract content, and store with metadata
  2. Q&A on Existing Documentation

    • Select from available documentation sets
    • Ask questions naturally
    • Get context-aware responses with source URLs
  3. Manage Documentation Sets

    • View all stored documentation sets
    • Delete specific sets when needed
    • Clean up outdated content

Technical Details

  • Uses OpenAI embeddings for semantic search
  • Stores content and metadata in Supabase
  • Implements vector similarity search
  • Preserves code blocks and formatting
  • Handles pagination and rate limiting

Requirements

  • Python 3.8+
  • OpenAI API key
  • Supabase account
  • Packages listed in requirements.txt

About

Scape websites, add them to Supabase with LLM generated tags and meta data and then use agentic rag for q7A

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published