This project automates the creation and management of S3Tables that integrate with AWS analytics services such as Amazon Athena, EMR, AWS Glue Data Catalog, and AWS Lake Formation. Using Terraform, the infrastructure is provisioned to build a scalable, cost-effective data lake solution.
The AWS CLI must be set up and credentials configured
- Replace and export with your correct value:
export TF_VAR_aws_user="?"
- Create infrastructure:
terraform -chdir="./terraform" init -upgrade
terraform -chdir="./terraform" apply --auto-approve
- Get the output variables:
export EMR_APP_ID="$(terraform -chdir="./terraform" output -raw emr_app_id)"
export REGION="$(terraform -chdir="./terraform" output -raw region)"
export ACCOUNT_ID="$(terraform -chdir="./terraform" output -raw account_id)"
export BUCKET_AUX="$(terraform -chdir="./terraform" output -raw bucket_aux)"
export TABLEBUCKET_NAME="$(terraform -chdir="./terraform" output -raw table_bucket)"
export TABLEBUCKET_ARN="$(terraform -chdir="./terraform" output -raw table_bucket_arn)"
export NAMESPACE="$(terraform -chdir="./terraform" output -raw namespace)"
export TABLE="$(terraform -chdir="./terraform" output -raw table)"
export EMR_ROLE="$(terraform -chdir="./terraform" output -raw emr_role)"
export GROUP_CLOUDWATCH="$(terraform -chdir="./terraform" output -raw group_cloudwatch)"
export ATHENA_WORKGROUP="$(terraform -chdir="./terraform" output -raw athena_workgroup)"
- Create Glue Data Catalog (New terraform resource requested hashicorp/terraform-provider-aws#40725):
aws glue create-catalog \
--region $REGION \
--cli-input-json "{
\"Name\": \"s3tablescatalog\",
\"CatalogInput\": {
\"FederatedCatalog\": {
\"Identifier\": \"arn:aws:s3tables:${REGION}:${ACCOUNT_ID}:bucket/*\",
\"ConnectionName\": \"aws:s3tables\"
},
\"CreateDatabaseDefaultPermissions\":[],
\"CreateTableDefaultPermissions\":[]
}
}"
- Give permission (Terraform bug: hashicorp/terraform-provider-aws#40724):
aws lakeformation grant-permissions \
--region $REGION \
--cli-input-json \
"{
\"Principal\": {
\"DataLakePrincipalIdentifier\": \"arn:aws:iam::${ACCOUNT_ID}:user/${TF_VAR_aws_user}\"
},
\"Resource\": {
\"Table\": {
\"CatalogId\": \"${ACCOUNT_ID}:s3tablescatalog/${TABLEBUCKET_NAME}\",
\"DatabaseName\": \"${NAMESPACE}\",
\"Name\": \"${TABLE}\"
}
},
\"Permissions\": [
\"ALL\"
]
}"
- Upload code and jar:
#Download and upload jar
curl --create-dirs -O --output-dir ./libs/ https://repo1.maven.org/maven2/software/amazon/s3tables/s3-tables-catalog-for-iceberg-runtime/0.1.3/s3-tables-catalog-for-iceberg-runtime-0.1.3.jar
aws s3 cp ./libs/s3-tables-catalog-for-iceberg-runtime-0.1.3.jar s3://$BUCKET_AUX/libs/
#Upload pyspark scripts
aws s3 cp ./src/main_emr.py s3://$BUCKET_AUX/src/
- EMR serverless start job:
emr_job=$(aws emr-serverless start-job-run \
--name pyspark-job \
--application-id $EMR_APP_ID \
--execution-role-arn $EMR_ROLE \
--job-driver "{
\"sparkSubmit\": {
\"entryPoint\": \"s3://${BUCKET_AUX}/src/main_emr.py\",
\"entryPointArguments\": [
\"--tablebucket_arn\", \"${TABLEBUCKET_ARN}\",
\"--namespace\", \"${NAMESPACE}\",
\"--table\", \"${TABLE}\"
],
\"sparkSubmitParameters\": \"--jars s3://${BUCKET_AUX}/libs/s3-tables-catalog-for-iceberg-runtime-0.1.3.jar\"
}
}" \
--configuration-overrides "{
\"monitoringConfiguration\": {
\"cloudWatchLoggingConfiguration\": {
\"enabled\": true,
\"logGroupName\": \"$GROUP_CLOUDWATCH\",
\"logStreamNamePrefix\": \"pyspark-job\"
}
}
}" \
--region $REGION | jq -r '.jobRunId')
#Get status, wait until "SUCCESS"
aws emr-serverless get-job-run \
--application-id $EMR_APP_ID \
--job-run-id $emr_job \
--region $REGION | jq -r '.jobRun.state'
- Query with Athena:
query_id=$(aws athena start-query-execution \
--query-string "SELECT * FROM \"s3tablescatalog/${TABLEBUCKET_NAME}\".\"${NAMESPACE}\".\"${TABLE}\" LIMIT 10" \
--work-group $ATHENA_WORKGROUP \
--region $REGION | jq -r '.QueryExecutionId')
#Results
aws athena get-query-results \
--query-execution-id $query_id \
--region $REGION | jq -r '.ResultSet.Rows[] | [.Data[].VarCharValue] | @tsv'
aws glue delete-catalog \
--region $REGION \
--catalog-id s3tablescatalog
aws emr-serverless stop-application \
--application-id $EMR_APP_ID \
--region $REGION
terraform -chdir="./terraform" destroy --auto-approve