A high-performance, concurrent web crawler designed to extract product URLs from e-commerce websites. Built with Python, asyncio, and Playwright.
- Python 3.9 or higher
- Make
make setupThis will:
- Create a virtual environment
- Install all dependencies
- Set up Playwright browser
- Configure development tools
# Run the crawler
make run
# Run tests
make test
# Format code
make formatsrc/
├── core/ # Core crawler functionality
│ ├── crawler.py # Main crawler implementation
│ └── director.py # Multi-domain orchestration
├── utils/ # Utility functions
│ └── url_utils.py # URL processing utilities
└── config/ # Configuration
└── patterns.py # URL pattern definitions
from core.director import CrawlDirector
# Initialize and run
director = CrawlDirector()
results = director.execute_crawlers(["example.com"])
# Results are saved to results.jsonmake help # Show all available commands
make check # Run all checks (tests and linting)
make dev-clean # Clean all generated files
make dev-setup # Setup development environmentEdit patterns in src/config/patterns.py:
# Add product URL patterns
PRODUCT_PATTERNS = [
re.compile(r'(.*/product/.*)'), # Standard product path
re.compile(r'(.*/p/.*)'), # Short product paths
]
# Add URLs to ignore
PATTERNS_TO_IGNORE = [
re.compile(r'(.*/(about).*)'), # About pages
re.compile(r'(.*/(cart).*)'), # Shopping cart
]- Concurrent crawling of multiple domains
- Smart product URL detection
- Configurable URL patterns
- Automatic result saving
MIT License