wiki-rag

A Python-based Retrieval-Augmented Generation (RAG) pipeline using Wikipedia as the knowledge base. This project demonstrates how to fetch, preprocess, embed, store, and query Wikipedia content to enhance the capabilities of a large language model (LLM).

🧩 Table of Contents

Project Overview
Features
Directory Structure
Prerequisites
Installation
Configuration
Usage
Examples
Contributing
Roadmap
License

Project Overview

wiki-rag is a modular pipeline designed to:

Fetch relevant Wikipedia articles via the MediaWiki API.
Preprocess and chunk text for efficient embedding and retrieval.
Embed content using your choice of embedding model (e.g., OpenAI embeddings).
Index embeddings into a FAISS vector store for similarity search.
Query the vector store and feed results into a Gemini-based LLM agent for concise, context-aware answers.

This setup enables a lightweight, production‑ready RAG system using open, freely available wiki data.

Features

🔍 Wikipedia Fetcher: Interface with the MediaWiki API to retrieve pages, categories, or search results.
🔧 Text Preprocessor: Clean, split, and organize raw text into chunks suitable for embedding.
🔢 Embeddings Module: Plug in any embedding provider (default: OpenAI).
💾 FAISS Vector Store: Fast, on-disk or in-memory vector indexing and similarity search.
🤖 LLM Agent: Use Google Gemini (or another LLM) to generate synthesized responses based on retrieved context.
⚙️ Configurable: Environment-driven settings for API keys, database URLs, chunk sizes, and more.

Directory Structure

wiki-rag/
├── database/            # Database connection and utilities
├── embedder/            # Embedding generation scripts
├── faiss_processor/     # FAISS index building and querying
├── gemini_agent/        # LLM agent orchestration
├── preprocessor/        # Text cleaning, chunking, and metadata
├── warehouse/           # Raw data storage (e.g., local cache of articles)
├── wikipedia_api/       # MediaWiki API wrappers
├── main.py              # Entry point: orchestrates end-to-end pipeline or exposes API
├── requirements.txt     # Python dependencies
└── .gitignore           # Files and directories to ignore

Prerequisites

Python ≥ 3.8
pip or poetry
API Keys:
- OPENAI_API_KEY (for embeddings)
- GEMINI_API_KEY (for LLM queries)

Installation

Clone the repo:

git clone https://github.com/xreedev/wiki-rag.git
cd wiki-rag

Install dependencies:
```
pip install -r requirements.txt
```

Configuration

Create a .env file in the root directory (and ensure it is in .gitignore):

OPENAI_API_KEY=your_openai_key
GEMINI_API_KEY=your_gemini_key
# Optional overrides
WIKI_API_URL=https://en.wikipedia.org/w/api.php
EMBED_MODEL=text-embedding-ada-002
CHUNK_SIZE=500
VECTOR_STORE_PATH=./faiss_index

Usage

Run the pipeline end-to-end:

python main.py \
  --query "Evolution of quantum computing" \
  --top_k 5 \
  --source wiki

Or start an API server (if implemented in main.py):

uvicorn main:app --reload

Examples

Local Search:

python main.py --query "Python concurrency" --mode local

Server Mode:

curl \
  -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is RAG?"}'

Contributing

Fork the repo.
Create a feature branch: git checkout -b feature/awesome
Commit your changes: git commit -m "Add awesome feature"
Push: git push origin feature/awesome
Open a pull request.

Please open issues for bugs or feature requests.

Roadmap

Add Docker support for easy deployment
Implement automated tests (pytest)
CI/CD pipeline (GitHub Actions)
Support multi-language wiki sources
Real-time streaming responses from LLM

License

This project is licensed under the MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

wiki-rag

🧩 Table of Contents

Project Overview

Features

Directory Structure

Prerequisites

Installation

Configuration

Usage

Examples

Contributing

Roadmap

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
database		database
embedder		embedder
faiss_processor		faiss_processor
gemini_agent		gemini_agent
preprocessor		preprocessor
warehouse		warehouse
wikipedia_api		wikipedia_api
.gitignore		.gitignore
README.MD		README.MD
main.py		main.py
requirements.txt		requirements.txt

xreedev/wiki-rag

Folders and files

Latest commit

History

Repository files navigation

wiki-rag

🧩 Table of Contents

Project Overview

Features

Directory Structure

Prerequisites

Installation

Configuration

Usage

Examples

Contributing

Roadmap

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages