AWS Serverless File Processing Pipeline

📌 Overview

This project implements an event-driven, serverless backend designed to automate file processing. When a file is uploaded to an Amazon S3 bucket, the system instantly triggers a validation and metadata extraction workflow, storing the results in Amazon DynamoDB.

The architecture adheres to core cloud-native principles—scalability, least-privilege security, and cost-efficiency—by operating entirely within the AWS Free Tier.

🏗 System Architecture

The pipeline follows a decoupled, event-driven flow as illustrated below:

Trigger: A user or system uploads a file to the Amazon S3 bucket.
Event: S3 publishes an "Object Creation Event," which triggers AWS Lambda.
Compute: The Lambda function executes the logic to validate the file and extract metadata.
Storage: Processed metadata is persisted into an Amazon DynamoDB table.
Observability: Lambda streams execution logs and errors to Amazon CloudWatch for monitoring.

🛠 Tech Stack

Storage: Amazon S3 (Simple Storage Service)
Compute: AWS Lambda (Python 3.9 / Boto3 SDK)
Database: Amazon DynamoDB (NoSQL)
Security: AWS IAM (Identity and Access Management)
Monitoring: Amazon CloudWatch Logs

🚀 Key Features

Event-Driven: Zero-latency processing triggered immediately upon file upload.
Optimized Storage: Efficient DynamoDB schema design for fast metadata retrieval.
Security-First: Implements least-privilege access using granular IAM policies.
Validation Logic: Automated checks for file types and constraints prior to storage.
Full Observability: Real-time debugging and historical execution tracking via CloudWatch.

🔧 Implementation Details

1. Amazon S3 Bucket (Source)

A dedicated S3 bucket, shown below as file-upload-bucket-nep-1, serves as the landing zone for raw files. It is configured to send an event notification on s3:ObjectCreated:* actions to invoke the backend Lambda function.

2. IAM Role & Security

To ensure secure execution, a custom IAM Role was created for the Lambda function with strictly scoped permissions:

s3:GetObject: Read access limited strictly to the source S3 bucket.
dynamodb:PutItem: Write access limited to the metadata DynamoDB table.
logs:*: Permissions to write execution events to CloudWatch logs.

3. AWS Lambda Function (Processor)

The core logic is handled by a Python 3.9 Lambda function shown below. The code initializes Boto3 clients for S3 and DynamoDB, parses the incoming event to retrieve the bucket name and file key, and passes them to a processing handler.

4. Amazon DynamoDB (Metadata Storage)

Extracted metadata is stored based on the schema defined below. The screenshot confirms that files uploaded to S3 (like sample.pdf and test.txt) have been successfully processed and their details populated in the table with a status of PROCESSED.

Table Schema:

Attribute	Type	Description
`documentId`	String (PK)	Unique identifier (UUID)
`fileName`	String	Original name of the file
`fileType`	String	MIME type (e.g., application/pdf)
`fileSize`	Number	Size in bytes
`status`	String	Processing state (SUCCESS/FAILED)
`uploadedAt`	String	ISO 8601 timestamp

📊 Testing & Validation

The system was validated through various scenarios. The CloudWatch logs below provide proof of a successful execution flow.

Success Path: Uploaded sample.pdf. The logs confirm the function triggered correctly and outputted specific metadata: "File sample.pdf processed with status PROCESSED, words=446".
Error Handling: Verified that invalid files (e.g., exceeding size limits) are caught by validation logic and logged as errors without corrupting the database.
Security: Verified that the Lambda role cannot access resources outside its defined scope.

💡 Key Learnings

Event-Driven Architectures: Designing systems that react asynchronously to state changes rather than polling.
Cloud Security Posture: The critical importance of defining granular IAM policies to minimize the attack surface.
NoSQL Data Modeling: Designing efficient, single-table schemas in DynamoDB for operational workloads.
Serverless Operations: Managing the lifecycle and monitoring of Functions-as-a-Service (FaaS).

🛠 Future Roadmap

AI/ML Integration: Incorporate Amazon Textract for deep document OCR and data extraction.
Enhanced Security: Integrate antivirus scanning on S3 uploads prior to triggering the processing workflow.
User Notifications: Use Amazon SNS to send email or SMS alerts upon successful processing or failures.
Frontend Application: Develop a modern Next.js web interface for users to upload files and visualize processed metadata.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
architecture		architecture
lambda_pdf_package		lambda_pdf_package
screenshots		screenshots
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AWS Serverless File Processing Pipeline

📌 Overview

🏗 System Architecture

🛠 Tech Stack

🚀 Key Features