A Node.js application built with TypeScript that uses RabbitMQ for asynchronous job processing and Puppeteer for web scraping.
This application provides a robust infrastructure for scheduling and processing web scraping jobs asynchronously. It uses a producer-consumer pattern with RabbitMQ as the message broker:
- Producer: Accepts scraping jobs via an HTTP API and enqueues them in RabbitMQ.
- Consumer: Consumes jobs from the queue and processes them with Puppeteer.
- Scraper: Extracts data from web pages including titles, meta descriptions, and links.
The application follows a functional programming approach with TypeScript for type safety:
- Express API: Accepts job requests and forwards them to the producer.
- RabbitMQ: Manages the job queue for reliable delivery.
- Puppeteer: Headless browser for web scraping.
- Node.js (v14 or higher)
- npm or yarn
- RabbitMQ Server
Install and set up RabbitMQ on Ubuntu:
# Install RabbitMQ
sudo apt install rabbitmq-server
# Start RabbitMQ service
sudo systemctl start rabbitmq-server
# Enable RabbitMQ to start on boot
sudo systemctl enable rabbitmq-server
# Verify RabbitMQ is running
sudo systemctl status rabbitmq-server-
Clone the repository:
git clone https://github.com/semsion/node-rabbitmq-scraper.git cd node-rabbitmq-scraper -
Install dependencies:
npm install
-
Set up environment variables:
cp .env.example .env # The example values can be used for testing
npm run buildnpm startFor development with automatic reloading:
npm run devUse the REST API to send a scraping job:
curl -X POST http://localhost:3000/jobs \
-H "Content-Type: application/json" \
-d '{
"bookingId": "123456",
"urls": [
"https://www.example.com",
"https://developer.mozilla.org",
"https://github.com"
]
}'Check the console output for detailed logs of the scraping process. RabbitMQ provides various monitoring tools:
# Enable RabbitMQ management interface
sudo rabbitmq-plugins enable rabbitmq_management
# Access the management interface at http://localhost:15672
# Default credentials: guest/guestnode-rabbitmq-scraper
├── src/
│ ├── app.ts # Main application entry point
│ ├── config/ # Configuration files
│ ├── consumer/ # Message queue consumer
│ ├── models/ # Data models
│ ├── producer/ # Message queue producer
│ ├── services/ # Core services (RabbitMQ, scraper)
│ └── utils/ # Utility functions
├── test/ # Test files
├── package.json # Node.js dependencies
└── tsconfig.json # TypeScript configuration
The main scraping logic is defined in:
- jobConsumer.ts - Contains the core scraping functionality using Puppeteer
- scraper.ts - Provides a simpler implementation for scraping
To customize what gets scraped, modify the page evaluation functions in these files. For example, in jobConsumer.ts, you can enhance the page evaluation function to extract specific elements:
// Example: Extract product information from an e-commerce site
const productData = await page.evaluate(() => {
return {
title: document.querySelector('.product-title')?.textContent?.trim(),
price: document.querySelector('.product-price')?.textContent?.trim(),
description: document.querySelector('.product-description')?.textContent?.trim(),
imageUrl: document.querySelector('.product-image')?.getAttribute('src')
};
});- Scalability: Can be scaled horizontally by running multiple consumers, via tools like PM2, Docker, or Kubernetes
- Functional Programming: Uses a functional approach with TypeScript for better testability
- Asynchronous Processing: Non-blocking job processing with RabbitMQ
- Error Handling: Robust error handling with message acknowledgment
- Configurability: Uses configuration values from index.ts
Run the test suite with:
npm testMIT