This Terraform solution deploys a fully automated LLM inference solution on AWS EC2 using vLLM for OpenAI-compatible API endpoints. The solution supports GPU acceleration and uses vLLM for high-performance language model inference.
- Controlled Deployment Process: Added version-based EC2 instance replacement mechanism for zero-downtime deployments
- Idempotent Infrastructure: Fixed multiple idempotency issues to ensure consistent Terraform plans
- HTTPS Support: Added support for secure HTTPS endpoints using Let's Encrypt certificates
- Enhanced Monitoring: Added comprehensive diagnostics and health check scripts
- Improved Error Handling: Better token retrieval and service startup logic
- Force Delete Support: Added support for clean terraform destroy operations
- Maintenance Commands: Added output with useful maintenance commands for easier troubleshooting
The solution includes the following components:
- VPC: Secure network environment with public and private subnets
- EC2 Instance: Running the latest Ubuntu AMI with Docker pre-installed
- ECR Repository: For storing the inference application Docker images
- Route53 DNS: Optional DNS record configuration for easy access
- IAM Roles: Properly scoped permissions for the EC2 instance
- Security Groups: Configured for secure access to the application
The infrastructure is designed to be fully idempotent, ensuring that repeated terraform apply
operations without changed inputs result in no modifications. Key features:
- Fixed timestamps for user_data and other time-sensitive values
- Explicit resource references to avoid lookup changes
- Lifecycle configurations to stabilize resource dependencies
The solution uses a version-based mechanism for controlled infrastructure updates:
- EC2 instances are versioned via the
ec2_instance_version
variable - Replacements create new instances before destroying old ones (create_before_destroy)
- Elastic IPs ensure stable endpoints during replacements
- Outputs track version information for monitoring
Detailed documentation is available in the docs/
directory:
- Architecture Document: System design and components
- Operations Guide: Deployment and maintenance instructions
- Development Roadmap: Current status and future plans
- API Reference: API endpoints documentation
- Customization Guide: How to customize the solution
- AWS CLI configured with appropriate credentials
- Terraform 1.7.0 or later
- Docker (for local testing, optional)
- An existing Route53 hosted zone (if DNS records are needed)
- A HuggingFace token stored in SSM Parameter Store (default path:
/inference/hf_token
) - (Optional) GPU-enabled EC2 instance for production use
Before initializing Terraform, create an S3 bucket to store the Terraform state:
aws s3api create-bucket \
--bucket ob-lq-live-inference-solution-terraform-state-us-west-2 \
--region us-west-2 \
--create-bucket-configuration LocationConstraint=us-west-2
# Enable S3 bucket versioning for state recovery
aws s3api put-bucket-versioning \
--bucket ob-lq-live-inference-solution-terraform-state-us-west-2 \
--versioning-configuration Status=Enabled
# Enable S3 bucket encryption for security
aws s3api put-bucket-encryption \
--bucket ob-lq-live-inference-solution-terraform-state-us-west-2 \
--server-side-encryption-configuration '{"Rules": [{"ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "AES256"}}]}'
-
Clone this repository
-
Update
terraform.tfvars
with your desired configuration -
Make sure the S3 bucket name in
backend.tf
matches the bucket you created -
Deploy with Terraform:
terraform init terraform apply
-
Access your API using the outputs provided by Terraform
Edit terraform.tfvars
to customize:
- AWS region
- Environment name (production, staging, etc.)
- Domain name for DNS records
- Admin IPs allowed to access management endpoints
- Instance type and other EC2 parameters
The solution comes with a pre-configured vLLM setup. To customize:
-
Update
terraform.tfvars
to change the model and instance parameters:# Example: Switch to GPU and use a different model use_gpu_instance = true gpu_instance_type = "g4dn.xlarge" model_id = "meta-llama/Llama-2-7b-chat-hf" max_model_len = 4096
-
If needed, modify the API proxy in
app/server.js
-
Run
terraform apply
to rebuild and redeploy
For detailed configuration options, see the Customization Guide.
After deployment, Terraform provides detailed outputs including:
- API endpoint URLs (IP-based and domain-based)
- SSH connection string
- ECR repository URL
- Detailed resource IDs and information
- Modify the application code in the
app/
directory - Run
terraform apply
to rebuild and push the Docker image - The EC2 instance will automatically pull the latest image via cron job
-
Edit
terraform.tfvars
and increment theec2_instance_version
value:# Change this from the current value (e.g., from 1 to 2) ec2_instance_version = 2
-
Run
terraform apply
-
Terraform will:
- Create a new EC2 instance with the updated configuration
- Wait for the new instance to be ready
- Move the Elastic IP to the new instance
- Terminate the old instance
This approach enables zero-downtime deployments while maintaining the same public IP address.
To remove all resources:
terraform destroy
- The EC2 instance uses IMDSv2 for enhanced security
- The security group restricts access to configured IP ranges
- All data volumes are encrypted
For detailed troubleshooting steps, refer to the Operations Guide.
Common issues:
- Check CloudWatch logs for application issues
- SSH to the instance using the provided connection string
- Use the AWS SSM Session Manager for secure console access
- If you encounter Terraform state issues:
- Verify the S3 bucket exists and is accessible
- Check the bucket name in
backend.tf
matches the created bucket - Ensure you have proper permissions to read/write to the bucket
- For state lock issues, you may need to manually release locks in S3 using the AWS console
This project is actively maintained. See the Development Roadmap for information about current status, planned features, and technical debt.
We use semantic versioning for release management. To tag a new version:
# List existing tags
git tag -l
# Create a new tag (locally)
git tag -a v1.0.0 -m "Initial stable release"
# Push the tag to the remote repository
git push origin v1.0.0
# Push all tags
git push origin --tags
To checkout a specific tag:
# Create a branch from a tag
git checkout -b branch-name v1.0.0
# Or view the code at a specific tag without creating a branch
git checkout v1.0.0
The version number format follows semantic versioning:
- MAJOR version for incompatible API changes (v1.0.0 → v2.0.0)
- MINOR version for backward-compatible functionality additions (v1.0.0 → v1.1.0)
- PATCH version for backward-compatible bug fixes (v1.0.0 → v1.0.1)
- Complete and test your changes
- Update documentation to reflect changes
- Create a git tag following semantic versioning
- Push the tag to the repository
- Create a detailed release note in your repository management system
See the LICENSE file for details.