Node RabbitMQ Scraper

A Node.js application built with TypeScript that uses RabbitMQ for asynchronous job processing and Puppeteer for web scraping.

Overview

This application provides a robust infrastructure for scheduling and processing web scraping jobs asynchronously. It uses a producer-consumer pattern with RabbitMQ as the message broker:

Producer: Accepts scraping jobs via an HTTP API and enqueues them in RabbitMQ.
Consumer: Consumes jobs from the queue and processes them with Puppeteer.
Scraper: Extracts data from web pages including titles, meta descriptions, and links.

Architecture

The application follows a functional programming approach with TypeScript for type safety:

Express API: Accepts job requests and forwards them to the producer.
RabbitMQ: Manages the job queue for reliable delivery.
Puppeteer: Headless browser for web scraping.

Installation

Prerequisites

Node.js (v14 or higher)
npm or yarn
RabbitMQ Server

RabbitMQ Installation

Install and set up RabbitMQ on Ubuntu:

# Install RabbitMQ
sudo apt install rabbitmq-server

# Start RabbitMQ service
sudo systemctl start rabbitmq-server

# Enable RabbitMQ to start on boot
sudo systemctl enable rabbitmq-server

# Verify RabbitMQ is running
sudo systemctl status rabbitmq-server

Application Setup

Clone the repository:

git clone https://github.com/semsion/node-rabbitmq-scraper.git
cd node-rabbitmq-scraper

Install dependencies:
```
npm install
```

Set up environment variables:

cp .env.example .env

# The example values can be used for testing

Running the Application

Build the TypeScript Code

npm run build

Start the Application

npm start

For development with automatic reloading:

npm run dev

Usage

Sending Scraping Jobs

Use the REST API to send a scraping job:

curl -X POST http://localhost:3000/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "bookingId": "123456",
    "urls": [
      "https://www.example.com",
      "https://developer.mozilla.org",
      "https://github.com"
    ]
  }'

Monitoring

Check the console output for detailed logs of the scraping process. RabbitMQ provides various monitoring tools:

# Enable RabbitMQ management interface
sudo rabbitmq-plugins enable rabbitmq_management

# Access the management interface at http://localhost:15672
# Default credentials: guest/guest

Project Structure

node-rabbitmq-scraper
├── src/
│   ├── app.ts                 # Main application entry point
│   ├── config/                # Configuration files
│   ├── consumer/              # Message queue consumer
│   ├── models/                # Data models
│   ├── producer/              # Message queue producer
│   ├── services/              # Core services (RabbitMQ, scraper)
│   └── utils/                 # Utility functions
├── test/                      # Test files
├── package.json               # Node.js dependencies
└── tsconfig.json              # TypeScript configuration

Customizing Scraping Logic

The main scraping logic is defined in:

jobConsumer.ts - Contains the core scraping functionality using Puppeteer
scraper.ts - Provides a simpler implementation for scraping

To customize what gets scraped, modify the page evaluation functions in these files. For example, in jobConsumer.ts, you can enhance the page evaluation function to extract specific elements:

// Example: Extract product information from an e-commerce site
const productData = await page.evaluate(() => {
  return {
    title: document.querySelector('.product-title')?.textContent?.trim(),
    price: document.querySelector('.product-price')?.textContent?.trim(),
    description: document.querySelector('.product-description')?.textContent?.trim(),
    imageUrl: document.querySelector('.product-image')?.getAttribute('src')
  };
});

Key Features

Scalability: Can be scaled horizontally by running multiple consumers, via tools like PM2, Docker, or Kubernetes
Functional Programming: Uses a functional approach with TypeScript for better testability
Asynchronous Processing: Non-blocking job processing with RabbitMQ
Error Handling: Robust error handling with message acknowledgment
Configurability: Uses configuration values from index.ts

Testing

Run the test suite with:

npm test

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
test		test
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
jest.config.js		jest.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Node RabbitMQ Scraper

Overview

Architecture

Installation

Prerequisites

RabbitMQ Installation

Application Setup

Running the Application

Build the TypeScript Code

Start the Application

Usage

Sending Scraping Jobs

Monitoring

Project Structure

Customizing Scraping Logic

Key Features

Testing

License

About

Uh oh!

Releases

Packages

Languages

semsion/node-rabbitmq-scraper

Folders and files

Latest commit

History

Repository files navigation

Node RabbitMQ Scraper

Overview

Architecture

Installation

Prerequisites

RabbitMQ Installation

Application Setup

Running the Application

Build the TypeScript Code

Start the Application

Usage

Sending Scraping Jobs

Monitoring

Project Structure

Customizing Scraping Logic

Key Features

Testing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages