Website Archiver & Change Detection System

A production-ready website monitoring and archival system built on ArchiveBox. Automatically discovers, snapshots, and tracks changes across websites with full-text search, cryptographic verification, and distributed storage capabilities.

Overview

This project implements an intelligent website watcher that combines ArchiveBox's powerful archival capabilities with advanced change detection, distributed storage, and cryptographic verification. Perfect for compliance monitoring, research archival, competitive intelligence, or preserving important web content.

Key Features

Intelligent Discovery: Automatic sitemap parsing and internal link crawling with robots.txt compliance
Change Detection: Content-hash based change tracking with stable diff algorithms
Full-Text Search: SQLite FTS5 powered search across all archived versions
Distributed Storage: Optional IPFS integration for decentralized content preservation
Cryptographic Verification: Merkle trees and content anchoring for authenticity
Production Ready: Systemd timers, Docker support, Prometheus metrics, health checks
Web Interface: Flask-based UI for search, monitoring, and site management
Automated Scheduling: Configurable intervals (default: every 2 hours)

Technology Stack

Category	Technologies
Core	Python 3.10+, ArchiveBox, SQLite (FTS5)
Web	Flask, BeautifulSoup4, lxml, readability-lxml
Scheduling	APScheduler, Systemd timers
Crypto	PyNaCl, Cryptography, Merkle trees
Monitoring	Prometheus, Health checks
Storage	IPFS (optional), SQLite

Quick Start

# Prerequisites: Install ArchiveBox
pip install archivebox && archivebox init

# Clone and install
git clone https://github.com/jayhemnani9910/webcrawler.git
cd webcrawler
pip install -r requirements.txt

# Add a site and run
python -m src.main add-site https://example.com
python -m src.main run

# Search archived content
python -m src.main search "query"

# Launch web UI
python -m src.main web  # http://localhost:5000

Production Deployment

Docker

docker compose up -d --build

# Production with Prometheus + Grafana
docker compose -f docker-compose.prod.yml up -d

Systemd

sudo ./install_service.sh
sudo systemctl enable website-watcher.timer
sudo systemctl start website-watcher.timer

API

# Search endpoint
GET /api/search?q={query}

# Health check
GET /health

# Prometheus metrics
GET /metrics

Project Structure

webcrawler/
├── src/
│   ├── main.py              # CLI and scheduler
│   ├── crawler.py           # Discovery and orchestration
│   ├── archivebox_interface.py
│   ├── db.py                # SQLite schema
│   ├── crypto.py            # Cryptographic utilities
│   ├── merkle.py            # Merkle tree implementation
│   └── ipfs_interface.py    # IPFS storage layer
├── systemd/                 # Service units
├── docker-compose.yml
└── scripts/backup_db.sh

Use Cases

Compliance monitoring and regulatory tracking
Research archival and citation preservation
Competitive intelligence
Content preservation before deletion
Change auditing with verifiable records

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
backups		backups
docker		docker
docs		docs
scripts		scripts
services		services
src		src
systemd		systemd
tests		tests
.env.prod.example		.env.prod.example
.gitignore		.gitignore
CRON.md		CRON.md
Dockerfile		Dockerfile
README.md		README.md
docker-compose.integration.yml		docker-compose.integration.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
install_service.sh		install_service.sh
requirements.txt		requirements.txt
secure_permissions.sh		secure_permissions.sh
watcher_config.json		watcher_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Website Archiver & Change Detection System

Overview

Key Features

Technology Stack

Quick Start

Production Deployment

Docker

Systemd

API

Project Structure

Use Cases

License

About

Uh oh!

Releases

Packages

Languages

jayhemnani9910/webcrawler

Folders and files

Latest commit

History

Repository files navigation

Website Archiver & Change Detection System

Overview

Key Features

Technology Stack

Quick Start

Production Deployment

Docker

Systemd

API

Project Structure

Use Cases

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages