A Python-based Retrieval-Augmented Generation (RAG) pipeline using Wikipedia as the knowledge base. This project demonstrates how to fetch, preprocess, embed, store, and query Wikipedia content to enhance the capabilities of a large language model (LLM).
- Project Overview
- Features
- Directory Structure
- Prerequisites
- Installation
- Configuration
- Usage
- Examples
- Contributing
- Roadmap
- License
wiki-rag is a modular pipeline designed to:
- Fetch relevant Wikipedia articles via the MediaWiki API.
- Preprocess and chunk text for efficient embedding and retrieval.
- Embed content using your choice of embedding model (e.g., OpenAI embeddings).
- Index embeddings into a FAISS vector store for similarity search.
- Query the vector store and feed results into a Gemini-based LLM agent for concise, context-aware answers.
This setup enables a lightweight, production‑ready RAG system using open, freely available wiki data.
- 🔍 Wikipedia Fetcher: Interface with the MediaWiki API to retrieve pages, categories, or search results.
- 🔧 Text Preprocessor: Clean, split, and organize raw text into chunks suitable for embedding.
- 🔢 Embeddings Module: Plug in any embedding provider (default: OpenAI).
- 💾 FAISS Vector Store: Fast, on-disk or in-memory vector indexing and similarity search.
- 🤖 LLM Agent: Use Google Gemini (or another LLM) to generate synthesized responses based on retrieved context.
- ⚙️ Configurable: Environment-driven settings for API keys, database URLs, chunk sizes, and more.
wiki-rag/
├── database/ # Database connection and utilities
├── embedder/ # Embedding generation scripts
├── faiss_processor/ # FAISS index building and querying
├── gemini_agent/ # LLM agent orchestration
├── preprocessor/ # Text cleaning, chunking, and metadata
├── warehouse/ # Raw data storage (e.g., local cache of articles)
├── wikipedia_api/ # MediaWiki API wrappers
├── main.py # Entry point: orchestrates end-to-end pipeline or exposes API
├── requirements.txt # Python dependencies
└── .gitignore # Files and directories to ignore
-
Python ≥ 3.8
-
pip or poetry
-
API Keys:
OPENAI_API_KEY(for embeddings)GEMINI_API_KEY(for LLM queries)
-
Clone the repo:
git clone https://github.com/xreedev/wiki-rag.git cd wiki-rag -
Install dependencies:
pip install -r requirements.txt
Create a .env file in the root directory (and ensure it is in .gitignore):
OPENAI_API_KEY=your_openai_key
GEMINI_API_KEY=your_gemini_key
# Optional overrides
WIKI_API_URL=https://en.wikipedia.org/w/api.php
EMBED_MODEL=text-embedding-ada-002
CHUNK_SIZE=500
VECTOR_STORE_PATH=./faiss_indexRun the pipeline end-to-end:
python main.py \
--query "Evolution of quantum computing" \
--top_k 5 \
--source wikiOr start an API server (if implemented in main.py):
uvicorn main:app --reload-
Local Search:
python main.py --query "Python concurrency" --mode local
-
Server Mode:
curl \ -X POST http://localhost:8000/ask \ -H "Content-Type: application/json" \ -d '{"question": "What is RAG?"}'
- Fork the repo.
- Create a feature branch:
git checkout -b feature/awesome - Commit your changes:
git commit -m "Add awesome feature" - Push:
git push origin feature/awesome - Open a pull request.
Please open issues for bugs or feature requests.
- Add Docker support for easy deployment
- Implement automated tests (pytest)
- CI/CD pipeline (GitHub Actions)
- Support multi-language wiki sources
- Real-time streaming responses from LLM
This project is licensed under the MIT License. See LICENSE for details.