AWS Lambda Chrome Batch Download Starter

A starter kit for running GUI mode Chrome on AWS Lambda to batch download web pages using a Chrome extension. Saves downloaded HTML with compressed content storage.

Features

🚀 headfulChrome automation in AWS Lambda
📦 Chrome extension integration (SingleFile format)
🔄 Fuzzy matching of URLs to downloaded files
🗜️ Gzip compression + Base64 encoding of content
📁 Atomic file operations with proper cleanup
🐳 Dockerized Lambda environment

Prerequisites

AWS Account with Lambda access
Docker installed locally
Python 3.10+
AWS CLI configured

Installation

Clone repository:

git clone https://github.com/yourusername/musaspacecadet-aws_lambda_chrome_starter.git
cd musaspacecadet-aws_lambda_chrome_starter

Build Docker image:

docker build -t lambda-chrome-batch .

Install Python requirements:

pip install -r requirements.txt

Configuration

Set environment variables in app.py:

os.environ['DOWNLOAD_DIR'] = '/tmp/snapshots'  # Lambda writable dir
os.environ['EXTENSION_DIR'] = '/tmp/unpacked_extension'

Local Testing

Start Lambda runtime interface emulator:

docker run -p 9000:8080 lambda-chrome-batch

In another terminal, run test script:

python test.py

Check generated HTML files in output_html/ directory

Deployment

Create ECR repository in AWS
Push Docker image:

aws ecr get-login-password | docker login --username AWS --password-stdin YOUR_ECR_URL
docker tag lambda-chrome-batch:latest YOUR_ECR_URI/lambda-chrome-batch:latest
docker push YOUR_ECR_URI/lambda-chrome-batch:latest

Create Lambda function using container image

Usage

Lambda event format:

{
  "urls": [
    "https://example.com",
    "https://github.com",
    "https://google.com"
  ]
}

Sample response:

{
  "url_mappings": {
    "https://example.com": {
      "filename": "d18c3abb...html",
      "content": "H4sIAAAAAAAEAI2S227j..."
    }
  }
}

Customization

Timeout Settings: Adjust max_wait_time in main()
Extension: Modify mpiodijhokgodhhofbcjdecpffjipkle.crx
Matching Logic: Tune thresholds in FileMatcher class
Compression: Modify gzip/b64 encoding in get_url_mapping_with_content()

Troubleshooting

Common Issues:

⏱️ Timeouts: Increase Lambda timeout/memory settings
🔒 File Permissions: Ensure /tmp directory write access
🖥️ headfulIssues: Test with visible browser first
🔍 Content Matching: Adjust fuzzy match thresholds

Debugging Tips:

Check CloudWatch logs
Test locally with test.py
Inspect downloaded files in /tmp/snapshots
Enable verbose Chrome logging via --enable-logging=stderr

License

MIT License - See LICENSE for details

Contributions

PRs welcome! Please:

Open issue first for major changes
Update tests accordingly
Maintain coding style consistency

Happy scraping! 🕷️

Disclaimer: Always scrape responsibly and respect websites’ terms of service and robots.txt files.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
output_html		output_html
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
install_browser.sh		install_browser.sh
mpiodijhokgodhhofbcjdecpffjipkle.crx		mpiodijhokgodhhofbcjdecpffjipkle.crx
requirements.txt		requirements.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS Lambda Chrome Batch Download Starter

Features

Prerequisites

Installation

Configuration

Local Testing

Deployment

Usage

Customization

Troubleshooting

License

Contributions

About

Releases

Packages

Contributors 2

Languages

musaspacecadet/aws_lambda_chrome_starter

Folders and files

Latest commit

History

Repository files navigation

AWS Lambda Chrome Batch Download Starter

Features

Prerequisites

Installation

Configuration

Local Testing

Deployment

Usage

Customization

Troubleshooting

License

Contributions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages