A starter kit for running GUI mode Chrome on AWS Lambda to batch download web pages using a Chrome extension. Saves downloaded HTML with compressed content storage.
- 🚀 headfulChrome automation in AWS Lambda
- 📦 Chrome extension integration (SingleFile format)
- 🔄 Fuzzy matching of URLs to downloaded files
- 🗜️ Gzip compression + Base64 encoding of content
- 📁 Atomic file operations with proper cleanup
- 🐳 Dockerized Lambda environment
- AWS Account with Lambda access
- Docker installed locally
- Python 3.10+
- AWS CLI configured
- Clone repository:
git clone https://github.com/yourusername/musaspacecadet-aws_lambda_chrome_starter.git
cd musaspacecadet-aws_lambda_chrome_starter
- Build Docker image:
docker build -t lambda-chrome-batch .
- Install Python requirements:
pip install -r requirements.txt
Set environment variables in app.py
:
os.environ['DOWNLOAD_DIR'] = '/tmp/snapshots' # Lambda writable dir
os.environ['EXTENSION_DIR'] = '/tmp/unpacked_extension'
- Start Lambda runtime interface emulator:
docker run -p 9000:8080 lambda-chrome-batch
- In another terminal, run test script:
python test.py
- Check generated HTML files in
output_html/
directory
- Create ECR repository in AWS
- Push Docker image:
aws ecr get-login-password | docker login --username AWS --password-stdin YOUR_ECR_URL
docker tag lambda-chrome-batch:latest YOUR_ECR_URI/lambda-chrome-batch:latest
docker push YOUR_ECR_URI/lambda-chrome-batch:latest
- Create Lambda function using container image
Lambda event format:
{
"urls": [
"https://example.com",
"https://github.com",
"https://google.com"
]
}
Sample response:
{
"url_mappings": {
"https://example.com": {
"filename": "d18c3abb...html",
"content": "H4sIAAAAAAAEAI2S227j..."
}
}
}
- Timeout Settings: Adjust
max_wait_time
inmain()
- Extension: Modify
mpiodijhokgodhhofbcjdecpffjipkle.crx
- Matching Logic: Tune thresholds in
FileMatcher
class - Compression: Modify gzip/b64 encoding in
get_url_mapping_with_content()
Common Issues:
- ⏱️ Timeouts: Increase Lambda timeout/memory settings
- 🔒 File Permissions: Ensure /tmp directory write access
- 🖥️ headfulIssues: Test with visible browser first
- 🔍 Content Matching: Adjust fuzzy match thresholds
Debugging Tips:
- Check CloudWatch logs
- Test locally with
test.py
- Inspect downloaded files in
/tmp/snapshots
- Enable verbose Chrome logging via
--enable-logging=stderr
MIT License - See LICENSE for details
PRs welcome! Please:
- Open issue first for major changes
- Update tests accordingly
- Maintain coding style consistency
Happy scraping! 🕷️
Disclaimer: Always scrape responsibly and respect websites’ terms of service and robots.txt files.