E-commerce Product Crawler

A high-performance, concurrent web crawler designed to extract product URLs from e-commerce websites. Built with Python, asyncio, and Playwright.

Quick Start

Prerequisites

Python 3.9 or higher
Make

One-Command Setup

make setup

This will:

Create a virtual environment
Install all dependencies
Set up Playwright browser
Configure development tools

Running

# Run the crawler
make run

# Run tests
make test

# Format code
make format

Project Structure

src/
├── core/                   # Core crawler functionality
│   ├── crawler.py         # Main crawler implementation
│   └── director.py        # Multi-domain orchestration
├── utils/                 # Utility functions
│   └── url_utils.py       # URL processing utilities
└── config/               # Configuration
    └── patterns.py       # URL pattern definitions

Usage Example

from core.director import CrawlDirector

# Initialize and run
director = CrawlDirector()
results = director.execute_crawlers(["example.com"])

# Results are saved to results.json

Development

Available Commands

make help                    # Show all available commands
make check                   # Run all checks (tests and linting)
make dev-clean              # Clean all generated files
make dev-setup              # Setup development environment

Configuration

Edit patterns in src/config/patterns.py:

# Add product URL patterns
PRODUCT_PATTERNS = [
    re.compile(r'(.*/product/.*)'),  # Standard product path
    re.compile(r'(.*/p/.*)'),        # Short product paths
]

# Add URLs to ignore
PATTERNS_TO_IGNORE = [
    re.compile(r'(.*/(about).*)'),   # About pages
    re.compile(r'(.*/(cart).*)'),    # Shopping cart
]

Features

Concurrent crawling of multiple domains
Smart product URL detection
Configurable URL patterns
Automatic result saving

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docs		docs
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
results.json		results.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

E-commerce Product Crawler

Quick Start

Prerequisites

One-Command Setup

Running

Project Structure

Usage Example

Development

Available Commands

Configuration

Features

License

About

Uh oh!

Releases

Packages

Languages

vasudua/crawler

Folders and files

Latest commit

History

Repository files navigation

E-commerce Product Crawler

Quick Start

Prerequisites

One-Command Setup

Running

Project Structure

Usage Example

Development

Available Commands

Configuration

Features

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages