Complete AWS solution for deploying an Apache Flink 1.20 streaming application using Amazon Managed Service for Apache Flink. The application continuously generates streaming data at 2 records per second and writes it to an S3 table.
Status: ✅ Production-ready
Region: us-east-2 (configured via AWS CLI default profile)
Last Updated: October 10, 2025
- Solution Architecture
- Components
- Prerequisites
- Quick Start
- Scripts Reference
- Application Details
- Data Structure
- Monitoring & Operations
- Troubleshooting
- Clean Up
┌─────────────────────────────────────────────────────────────────┐
│ AWS Cloud (us-east-2) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Amazon Managed Service for Apache Flink │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ Flink Application (datahose-app) │ │ │
│ │ │ - Apache Flink 1.20 │ │ │
│ │ │ - Java 11 (SDKMAN: 11.0.28-amzn) │ │ │
│ │ │ - DataStream API │ │ │
│ │ │ - Streaming Mode (Continuous) │ │ │
│ │ │ - Data Generator: 2 records/sec │ │ │
│ │ │ - Rolling Policy: 30s inactivity / 2min │ │ │
│ │ │ - Checkpointing: 60s │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ │ writes data continuously │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ S3 Data Bucket: tm-data-bucket-20251010 │ │
│ │ └── datafall/ (S3 Table with 'foams' column) │ │
│ │ └── 2025-10-10--17/ │ │
│ │ ├── part-xxx-0 (finalized) │ │
│ │ ├── part-xxx-1 (finalized) │ │
│ │ └── .part-xxx.inprogress (in-progress) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ S3 Application Bucket: tm-streaming-app-bucket-20251010 │
│ │ └── datahose-app.jar (31 MB, versioned) │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ IAM Role: datahose-app-flink-role │ │
│ │ └── Policy: datahose-app-flink-policy │ │
│ │ - S3 Read/Write │ │
│ │ - CloudWatch Logs/Metrics │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ CloudWatch Logs: /aws/kinesis-analytics/datahose-app │ │
│ │ - Retention: 7 days │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
- Runtime: Apache Flink 1.20 (STREAMING mode)
- Language: Java 11 (managed via SDKMAN)
- Source: DataGeneratorSource (unbounded, 2 records/second)
- Sink: S3 FileSink with rolling policy
- Checkpointing: Every 60 seconds
- Data Format: Text records with timestamp
- Parallelism: 1 KPU (Kinesis Processing Unit)
-
Application Bucket:
tm-streaming-app-bucket-20251010
- Stores versioned JAR file (31 MB)
- Versioning enabled for rollback capability
-
Data Bucket:
tm-data-bucket-20251010
- Stores streaming output in
datafall/
table - Versioning enabled
- Rolling files based on inactivity (30s) or time (2min)
- Stores streaming output in
-
Role:
datahose-app-flink-role
- Service:
kinesisanalytics.amazonaws.com
- Trust relationship configured for Managed Flink
- Service:
-
Policy:
datahose-app-flink-policy
- S3: GetObject, PutObject, ListBucket
- CloudWatch: CreateLogGroup, CreateLogStream, PutLogEvents, PutMetricData
- EC2/VPC: DescribeVpcs, DescribeSubnets, etc. (for VPC access)
- Log Group:
/aws/kinesis-analytics/datahose-app
- Retention: 7 days
- Log Types: Application logs, checkpoint logs, error logs
-
AWS CLI (v2 or later)
aws --version # If not installed: https://aws.amazon.com/cli/
-
Maven (3.x or later)
mvn --version # If not installed: https://maven.apache.org/install.html
-
Java 11 (via SDKMAN recommended)
# Install SDKMAN curl -s "https://get.sdkman.io" | bash source "$HOME/.sdkman/bin/sdkman-init.sh" # Install Java 11 sdk install java 11.0.28-amzn sdk use java 11.0.28-amzn
-
jq (for JSON parsing)
# Fedora/RHEL sudo dnf install jq # Ubuntu/Debian sudo apt-get install jq # macOS brew install jq
-
Configure AWS Credentials
aws configure
You'll need:
- AWS Access Key ID
- AWS Secret Access Key
- Default region: us-east-2
- Default output format: json
-
Set Region (critical!)
aws configure set region us-east-2
Verify:
aws configure get region # Should output: us-east-2
-
Required AWS Permissions
Your AWS user/role needs:
- IAM:
CreateRole
,DeleteRole
,CreatePolicy
,DeletePolicy
,AttachRolePolicy
,DetachRolePolicy
- S3:
CreateBucket
,DeleteBucket
,PutObject
,GetObject
,ListBucket
- Kinesis Analytics:
CreateApplication
,DeleteApplication
,StartApplication
,StopApplication
- CloudWatch:
CreateLogGroup
,DeleteLogGroup
,PutRetentionPolicy
- IAM:
cd datahose-app
./setup-env.sh
This script:
- Initializes SDKMAN and Java 11
- Loads Flink configuration
- Sets up AWS region from your CLI profile
- Displays available commands
./iac_create.sh
This creates:
- S3 buckets (application + data)
- IAM role and policy
- CloudWatch log group
- Configuration file at
/tmp/flink-config.env
Expected output:
[INFO] Infrastructure Creation Complete!
[INFO] Resources Created:
✓ S3 Bucket (Application): tm-streaming-app-bucket-20251010
✓ S3 Bucket (Data): tm-data-bucket-20251010
✓ IAM Role: datahose-app-flink-role
✓ CloudWatch Log Group: /aws/kinesis-analytics/datahose-app
# Load configuration
source /tmp/flink-config.env
# Build and deploy
./cicd.sh
This script:
- Initializes Java 11 via SDKMAN
- Builds the application with Maven (creates 31 MB JAR)
- Uploads JAR to S3
- Creates/updates Flink application
- Starts the application in STREAMING mode
- Displays initial logs
Expected output:
[INFO] Build complete: target/datahose-app.jar (31 MB)
[INFO] JAR uploaded to S3
[INFO] Flink application created/updated
[INFO] Application starting...
[INFO] Application status: RUNNING
./verify.sh
Performs 6 health checks:
- ✓ AWS credentials
- ✓ S3 buckets exist
- ✓ IAM role exists
- ✓ CloudWatch log group exists
- ✓ Flink application status (RUNNING)
- ✓ Recent logs available
# Watch application logs
aws logs tail /aws/kinesis-analytics/datahose-app --follow
# Check S3 data files
aws s3 ls s3://tm-data-bucket-20251010/datafall/ --recursive
# View sample data
aws s3 cp s3://tm-data-bucket-20251010/datafall/2025-10-10--17/part-0-0 - | head -10
Purpose: Initialize development environment
What it does:
- Loads SDKMAN and sets Java 11
- Loads Flink configuration from
/tmp/flink-config.env
- Configures AWS region from CLI profile
- Displays available commands
Usage:
./setup-env.sh
Output:
- Environment variables loaded
- Java version confirmed
- AWS region confirmed
Purpose: Create all AWS infrastructure
What it does:
- Creates S3 bucket for application JAR with versioning
- Creates S3 bucket for data sink with versioning
- Creates CloudWatch log group with 7-day retention
- Creates IAM role with trust policy for Kinesis Analytics
- Creates IAM policy with S3, CloudWatch, and VPC permissions
- Attaches policy to role
- Saves configuration to
/tmp/flink-config.env
Usage:
./iac_create.sh
Configuration saved:
export APP_NAME="datahose-app"
export STREAMING_APP_BUCKET="tm-streaming-app-bucket-20251010"
export DATA_BUCKET="tm-data-bucket-20251010"
export REGION="us-east-2"
export FLINK_ROLE_ARN="arn:aws:iam::ACCOUNT_ID:role/datahose-app-flink-role"
Resources Created:
- S3 buckets (versioned)
- IAM role and policy
- CloudWatch log group
- Configuration file
Purpose: Build application and deploy to Managed Flink
What it does:
- Initializes Java 11 via SDKMAN
- Builds Maven project (
mvn clean package
) - Uploads JAR to S3 with versioning
- Creates Flink application (if first deployment)
- Updates Flink application (if already exists)
- Stops application if running before update
- Starts application in STREAMING mode
- Monitors deployment status
- Displays initial logs
Usage:
# Load configuration first
source /tmp/flink-config.env
# Run CI/CD
./cicd.sh
Build Output:
- JAR file:
target/datahose-app.jar
(31 MB) - Uploaded to:
s3://tm-streaming-app-bucket-20251010/datahose-app.jar
Application Versions:
- Each deployment increments version number
- Previous versions retained in S3 (versioning enabled)
Purpose: Verify all resources and application health
What it does: Performs 6 checks:
- AWS credentials configured
- S3 buckets exist and accessible
- IAM role exists
- CloudWatch log group exists
- Flink application status
- Recent logs available
Usage:
./verify.sh
Example Output:
╔════════════════════════════════════════════════════════╗
║ Flink Application Verification Report ║
╚════════════════════════════════════════════════════════╝
=== AWS Credentials ===
[✓] Account ID: 047472788728
[✓] User/Role: arn:aws:iam::047472788728:user/username
=== S3 Buckets ===
[✓] Application bucket: tm-streaming-app-bucket-20251010
[✓] Data bucket: tm-data-bucket-20251010
=== IAM Resources ===
[✓] IAM Role: datahose-app-flink-role
=== CloudWatch Logs ===
[✓] Log group: /aws/kinesis-analytics/datahose-app
=== Flink Application ===
[✓] Application: datahose-app
[✓] Status: RUNNING
[✓] Version: 3
=== Recent Logs ===
[✓] Found 150 log entries in the last 10 minutes
=== Summary ===
Health Score: 6/6 checks passed
[✓] System is healthy
Purpose: Destroy all AWS resources (with confirmation)
What it does:
- Stops Flink application if running
- Deletes Flink application
- Deletes all S3 objects (including versions)
- Deletes S3 buckets
- Detaches and deletes IAM policy
- Deletes IAM role
- Deletes CloudWatch log group
- Removes configuration file
Usage:
# Interactive mode (with confirmation prompt)
./iac_destroy.sh
# Force mode (skip confirmation)
./iac_destroy.sh --force
Safety Features:
- Requires explicit "yes" confirmation
- Shows list of resources before deletion
- Handles versioned S3 objects properly
- Gracefully handles missing resources
Warning: This is destructive and cannot be undone!
File: src/main/java/org/muralis/datahose/StreamingApp.java
Key Features:
-
Unbounded Data Generator
DataGeneratorSource<String> source = new DataGeneratorSource<>( index -> String.format("Record-%d: Data from foams column at %s", index, Instant.now()), Long.MAX_VALUE, // Unbounded stream RateLimiterStrategy.perSecond(2) // 2 records/second );
-
S3 File Sink
FileSink<String> sink = FileSink .forRowFormat(new Path("s3://tm-data-bucket-20251010/datafall"), new SimpleStringEncoder<String>("UTF-8")) .withRollingPolicy( DefaultRollingPolicy.builder() .withRolloverInterval(Duration.ofMinutes(2)) .withInactivityInterval(Duration.ofSeconds(30)) .withMaxPartSize(1024 * 1024) // 1 MB .build() ) .build();
-
Checkpointing
env.enableCheckpointing(60000); // Every 60 seconds
DataGeneratorSource (2 rec/sec)
↓
Stream<String>
↓
S3 FileSink (rolling every 30s-2min)
↓
s3://tm-data-bucket-20251010/datafall/
Each record contains:
Record-{index}: Data from foams column at {ISO-8601 timestamp}
Example:
Record-0: Data from foams column at 2025-10-10T17:15:23.456Z
Record-1: Data from foams column at 2025-10-10T17:15:23.956Z
Record-2: Data from foams column at 2025-10-10T17:15:24.456Z
Location: s3://tm-data-bucket-20251010/datafall/
Schema:
Column | Type | Description |
---|---|---|
foams | VARCHAR | Streaming data content with timestamp |
Directory Structure:
datafall/
├── 2025-10-10--17/
│ ├── part-0-0 # Finalized file (~7.2 KB)
│ ├── part-0-1 # Finalized file (~7.2 KB)
│ ├── part-0-2 # Finalized file (~7.2 KB)
│ └── .part-0-3.inprogress.xyz # Currently being written
├── 2025-10-10--18/
│ └── ...
File Properties:
- Naming:
part-{subtask}-{fileIndex}
- In-Progress:
.part-{subtask}-{fileIndex}.inprogress.{uuid}
- Format: Plain text (UTF-8)
- Rolling: New file every 30s (inactivity) or 2min (max time)
- Size: ~7.2 KB per finalized file (at 2 rec/sec for 30-120s)
Method 1: Direct S3 read
# List all data files
aws s3 ls s3://tm-data-bucket-20251010/datafall/ --recursive
# Download and view specific file
aws s3 cp s3://tm-data-bucket-20251010/datafall/2025-10-10--17/part-0-0 - | head -20
# Count total records in a file
aws s3 cp s3://tm-data-bucket-20251010/datafall/2025-10-10--17/part-0-0 - | wc -l
Method 2: S3 Select (SQL-like)
# Query data using S3 Select
aws s3api select-object-content \
--bucket tm-data-bucket-20251010 \
--key "datafall/2025-10-10--17/part-0-0" \
--expression "SELECT * FROM S3Object[*][*] s LIMIT 10" \
--expression-type SQL \
--input-serialization '{"CSV": {"FileHeaderInfo": "NONE"}}' \
--output-serialization '{"CSV": {}}' \
output.csv
cat output.csv
Method 3: AWS Athena (for larger datasets)
-- Create external table
CREATE EXTERNAL TABLE datafall (
foams STRING
)
LOCATION 's3://tm-data-bucket-20251010/datafall/';
-- Query data
SELECT COUNT(*) FROM datafall;
SELECT * FROM datafall LIMIT 10;
# Check application status
aws kinesisanalyticsv2 describe-application \
--application-name datahose-app \
--query 'ApplicationDetail.ApplicationStatus'
# Possible statuses: READY, STARTING, RUNNING, STOPPING, DELETING
# Tail logs in real-time
aws logs tail /aws/kinesis-analytics/datahose-app --follow
# Get last 100 log entries
aws logs tail /aws/kinesis-analytics/datahose-app --since 10m
# Filter for errors
aws logs filter-log-events \
--log-group-name /aws/kinesis-analytics/datahose-app \
--filter-pattern "ERROR"
# Count files in S3
aws s3 ls s3://tm-data-bucket-20251010/datafall/ --recursive | wc -l
# Show recent files
aws s3 ls s3://tm-data-bucket-20251010/datafall/ --recursive | tail -10
# Calculate total data size
aws s3 ls s3://tm-data-bucket-20251010/datafall/ --recursive --summarize | grep "Total Size"
# Stop gracefully
aws kinesisanalyticsv2 stop-application \
--application-name datahose-app
# Force stop
aws kinesisanalyticsv2 stop-application \
--application-name datahose-app \
--force
# Start application (after stopping)
aws kinesisanalyticsv2 start-application \
--application-name datahose-app \
--run-configuration '{"ApplicationRestoreConfiguration":{"ApplicationRestoreType":"RESTORE_FROM_LATEST_SNAPSHOT"}}'
# After modifying code, rebuild and redeploy
mvn clean package
./cicd.sh
# List checkpoints (stored in S3)
aws s3 ls s3://tm-streaming-app-bucket-20251010/checkpoints/ --recursive
# Application automatically restores from last checkpoint on restart
Symptom: Status stuck in STARTING
or transitions to RESTARTING
Diagnosis:
# Check recent logs
aws logs tail /aws/kinesis-analytics/datahose-app --since 10m | grep -i error
# Check application details
aws kinesisanalyticsv2 describe-application \
--application-name datahose-app
Common Causes:
-
S3 Path Issue: Bucket name or path incorrect in source code
- Fix: Update
StreamingApp.java
with correct bucket name - Rebuild and redeploy
- Fix: Update
-
IAM Permissions: Role doesn't have S3 write access
- Check policy:
aws iam get-role-policy --role-name datahose-app-flink-role --policy-name datahose-app-flink-policy
- Verify S3 permissions are present
- Check policy:
-
JAR File Corrupt: Upload failed or build issue
- Rebuild:
mvn clean package
- Re-upload:
./cicd.sh
- Rebuild:
Symptom: Application running but no files in S3
Diagnosis:
# Check if data bucket exists
aws s3 ls s3://tm-data-bucket-20251010/
# Check application logs for errors
aws logs tail /aws/kinesis-analytics/datahose-app --since 30m | grep -i "s3\|error"
# Verify rolling policy timing
# Files appear after 30s inactivity OR 2min max interval
Possible Causes:
- Timing: Wait at least 2 minutes after start for first file
- Path Issue: Check application logs for S3 write errors
- Permissions: Verify IAM role has PutObject permission
Symptom: Resources not found or "Access Denied" errors
Diagnosis:
# Check configured region
aws configure get region
# Should output: us-east-2
Fix:
# Set correct region
aws configure set region us-east-2
# Re-run scripts
./iac_create.sh
./cicd.sh
Symptom: Build fails with Java version errors
Diagnosis:
java -version
# Should show: openjdk version "11.x.x"
Fix:
# Use SDKMAN to set Java 11
sdk use java 11.0.28-amzn
# Or set JAVA_HOME manually
export JAVA_HOME=/path/to/java11
Symptom: mvn clean package
fails
Diagnosis:
# Check Maven version
mvn --version
# Check pom.xml exists
ls -l pom.xml
Fix:
# Clean Maven cache
mvn clean
# Rebuild
mvn package -DskipTests
# If dependencies fail, update Maven
sdk install maven 3.9.11
Symptom: bash: ./script.sh: Permission denied
Fix:
# Make scripts executable
chmod +x *.sh
# Or run with bash
bash iac_create.sh
# Interactive mode with confirmation
./iac_destroy.sh
# Force mode (skip confirmation)
./iac_destroy.sh --force
This safely removes:
- Flink application
- S3 buckets (all objects including versions)
- IAM role and policy
- CloudWatch log group
If the destroy script fails:
# 1. Stop and delete Flink application
CREATE_TS=$(aws kinesisanalyticsv2 describe-application \
--application-name datahose-app \
--query 'ApplicationDetail.CreateTimestamp' --output text)
aws kinesisanalyticsv2 stop-application --application-name datahose-app --force
aws kinesisanalyticsv2 delete-application \
--application-name datahose-app \
--create-timestamp "$CREATE_TS"
# 2. Delete S3 buckets (including all versions)
aws s3 rb s3://tm-streaming-app-bucket-20251010 --force
aws s3 rb s3://tm-data-bucket-20251010 --force
# 3. Delete IAM resources
aws iam detach-role-policy \
--role-name datahose-app-flink-role \
--policy-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):policy/datahose-app-flink-policy
aws iam delete-policy \
--policy-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):policy/datahose-app-flink-policy
aws iam delete-role --role-name datahose-app-flink-role
# 4. Delete CloudWatch log group
aws logs delete-log-group --log-group-name /aws/kinesis-analytics/datahose-app
# 5. Clean up local files
rm -f /tmp/flink-config.env
Estimated Monthly Cost (us-east-2):
Service | Usage | Estimated Cost |
---|---|---|
Managed Flink | 1 KPU, 24/7 | ~$45/month |
S3 Storage | ~100 GB/month (at 2 rec/sec) | ~$2.30/month |
S3 Requests | PUT/GET operations | ~$0.50/month |
CloudWatch Logs | 7-day retention, moderate logs | ~$2/month |
Data Transfer | Minimal (S3 same-region) | ~$0.50/month |
Total | ~$50/month |
Cost Optimization Tips:
- Stop application when not needed:
aws kinesisanalyticsv2 stop-application --application-name datahose-app
- Reduce log retention: Modify
LOG_RETENTION_DAYS
iniac_create.sh
- Clean up old S3 data: Set up lifecycle policies
- Monitor via AWS Cost Explorer
datahose-app/
├── src/
│ ├── main/
│ │ └── java/
│ │ └── org/
│ │ └── muralis/
│ │ └── datahose/
│ │ └── StreamingApp.java # Main application
│ └── test/
│ └── java/
│ └── org/
│ └── muralis/
│ └── datahose/
├── target/
│ └── datahose-app.jar # Built artifact (31 MB)
├── pom.xml # Maven configuration
├── iac_create.sh # Create infrastructure
├── iac_destroy.sh # Destroy infrastructure
├── cicd.sh # Build and deploy
├── verify.sh # Health check
├── setup-env.sh # Environment setup
└── README.md # This file
- Apache Flink 1.20 - Stream processing framework
- Java 11 - Programming language (SDKMAN: 11.0.28-amzn)
- Maven 3.x - Build tool
- AWS Managed Service for Apache Flink - Serverless Flink runtime
- Amazon S3 - Object storage for code and data
- AWS IAM - Identity and access management
- Amazon CloudWatch - Logging and monitoring
- AWS CLI - Infrastructure management
- Bash - Automation scripts
# Environment
./setup-env.sh # Initialize environment
source /tmp/flink-config.env # Load configuration
# Infrastructure
./iac_create.sh # Create all resources
./iac_destroy.sh # Destroy all resources
# Deployment
./cicd.sh # Build and deploy
./verify.sh # Health check
# Monitoring
aws kinesisanalyticsv2 describe-application --application-name datahose-app
aws logs tail /aws/kinesis-analytics/datahose-app --follow
aws s3 ls s3://tm-data-bucket-20251010/datafall/ --recursive
# Operations
aws kinesisanalyticsv2 stop-application --application-name datahose-app
aws kinesisanalyticsv2 start-application --application-name datahose-app
This project is for educational and demonstration purposes.
Last Updated: October 10, 2025
Version: 1.0
Region: us-east-2