AWS Lambda Chrome Batch Download Starter

A starter kit for running GUI mode Chrome on AWS Lambda to batch download web pages using a Chrome extension. Saves downloaded HTML with compressed content storage.

Features

🚀 headfulChrome automation in AWS Lambda
📦 Chrome extension integration (SingleFile format)
🔄 Fuzzy matching of URLs to downloaded files
🗜️ Gzip compression + Base64 encoding of content
📁 Atomic file operations with proper cleanup
🐳 Dockerized Lambda environment

Prerequisites

AWS Account with Lambda access
Docker installed locally
Python 3.10+
AWS CLI configured

Installation

Clone repository:

git clone https://github.com/yourusername/musaspacecadet-aws_lambda_chrome_starter.git
cd musaspacecadet-aws_lambda_chrome_starter

Build Docker image:

docker build -t lambda-chrome-batch .

Install Python requirements:

pip install -r requirements.txt

Configuration

Set environment variables in app.py:

os.environ['DOWNLOAD_DIR'] = '/tmp/snapshots'  # Lambda writable dir
os.environ['EXTENSION_DIR'] = '/tmp/unpacked_extension'

Local Testing

Start Lambda runtime interface emulator:

docker run -p 9000:8080 lambda-chrome-batch

In another terminal, run test script:

python test.py

Check generated HTML files in output_html/ directory

Deployment

Create ECR repository in AWS
Push Docker image:

aws ecr get-login-password | docker login --username AWS --password-stdin YOUR_ECR_URL
docker tag lambda-chrome-batch:latest YOUR_ECR_URI/lambda-chrome-batch:latest
docker push YOUR_ECR_URI/lambda-chrome-batch:latest

Create Lambda function using container image

Usage

Lambda event format:

{
  "urls": [
    "https://example.com",
    "https://github.com",
    "https://google.com"
  ]
}

Sample response:

{
  "url_mappings": {
    "https://example.com": {
      "filename": "d18c3abb...html",
      "content": "H4sIAAAAAAAEAI2S227j..."
    }
  }
}

Customization

Timeout Settings: Adjust max_wait_time in main()
Extension: Modify mpiodijhokgodhhofbcjdecpffjipkle.crx
Matching Logic: Tune thresholds in FileMatcher class
Compression: Modify gzip/b64 encoding in get_url_mapping_with_content()

Troubleshooting

Common Issues:

⏱️ Timeouts: Increase Lambda timeout/memory settings
🔒 File Permissions: Ensure /tmp directory write access
🖥️ headfulIssues: Test with visible browser first
🔍 Content Matching: Adjust fuzzy match thresholds

Debugging Tips:

Check CloudWatch logs
Test locally with test.py
Inspect downloaded files in /tmp/snapshots
Enable verbose Chrome logging via --enable-logging=stderr

License

MIT License - See LICENSE for details

Contributions

PRs welcome! Please:

Open issue first for major changes
Update tests accordingly
Maintain coding style consistency

Happy scraping! 🕷️

Disclaimer: Always scrape responsibly and respect websites’ terms of service and robots.txt files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!