Skip to content

Latest commit

 

History

History
133 lines (101 loc) · 3.09 KB

README.md

File metadata and controls

133 lines (101 loc) · 3.09 KB

AWS Lambda Chrome Batch Download Starter

A starter kit for running GUI mode Chrome on AWS Lambda to batch download web pages using a Chrome extension. Saves downloaded HTML with compressed content storage.

Features

  • 🚀 headfulChrome automation in AWS Lambda
  • 📦 Chrome extension integration (SingleFile format)
  • 🔄 Fuzzy matching of URLs to downloaded files
  • 🗜️ Gzip compression + Base64 encoding of content
  • 📁 Atomic file operations with proper cleanup
  • 🐳 Dockerized Lambda environment

Prerequisites

  • AWS Account with Lambda access
  • Docker installed locally
  • Python 3.10+
  • AWS CLI configured

Installation

  1. Clone repository:
git clone https://github.com/yourusername/musaspacecadet-aws_lambda_chrome_starter.git
cd musaspacecadet-aws_lambda_chrome_starter
  1. Build Docker image:
docker build -t lambda-chrome-batch .
  1. Install Python requirements:
pip install -r requirements.txt

Configuration

Set environment variables in app.py:

os.environ['DOWNLOAD_DIR'] = '/tmp/snapshots'  # Lambda writable dir
os.environ['EXTENSION_DIR'] = '/tmp/unpacked_extension'

Local Testing

  1. Start Lambda runtime interface emulator:
docker run -p 9000:8080 lambda-chrome-batch
  1. In another terminal, run test script:
python test.py
  1. Check generated HTML files in output_html/ directory

Deployment

  1. Create ECR repository in AWS
  2. Push Docker image:
aws ecr get-login-password | docker login --username AWS --password-stdin YOUR_ECR_URL
docker tag lambda-chrome-batch:latest YOUR_ECR_URI/lambda-chrome-batch:latest
docker push YOUR_ECR_URI/lambda-chrome-batch:latest
  1. Create Lambda function using container image

Usage

Lambda event format:

{
  "urls": [
    "https://example.com",
    "https://github.com",
    "https://google.com"
  ]
}

Sample response:

{
  "url_mappings": {
    "https://example.com": {
      "filename": "d18c3abb...html",
      "content": "H4sIAAAAAAAEAI2S227j..."
    }
  }
}

Customization

  1. Timeout Settings: Adjust max_wait_time in main()
  2. Extension: Modify mpiodijhokgodhhofbcjdecpffjipkle.crx
  3. Matching Logic: Tune thresholds in FileMatcher class
  4. Compression: Modify gzip/b64 encoding in get_url_mapping_with_content()

Troubleshooting

Common Issues:

  • ⏱️ Timeouts: Increase Lambda timeout/memory settings
  • 🔒 File Permissions: Ensure /tmp directory write access
  • 🖥️ headfulIssues: Test with visible browser first
  • 🔍 Content Matching: Adjust fuzzy match thresholds

Debugging Tips:

  1. Check CloudWatch logs
  2. Test locally with test.py
  3. Inspect downloaded files in /tmp/snapshots
  4. Enable verbose Chrome logging via --enable-logging=stderr

License

MIT License - See LICENSE for details

Contributions

PRs welcome! Please:

  1. Open issue first for major changes
  2. Update tests accordingly
  3. Maintain coding style consistency

Happy scraping! 🕷️


Disclaimer: Always scrape responsibly and respect websites’ terms of service and robots.txt files.