aws-iceberg-s3tables

Overview

This project automates the creation and management of S3Tables that integrate with AWS analytics services such as Amazon Athena, EMR, AWS Glue Data Catalog, and AWS Lake Formation. Using Terraform, the infrastructure is provisioned to build a scalable, cost-effective data lake solution.

Documentation

Architecture

Prerequisites

The AWS CLI must be set up and credentials configured

Installation

Replace and export with your correct value:

export TF_VAR_aws_user="?"

Create infrastructure:

terraform -chdir="./terraform" init -upgrade
terraform -chdir="./terraform" apply --auto-approve

Get the output variables:

export EMR_APP_ID="$(terraform -chdir="./terraform" output -raw emr_app_id)"
export REGION="$(terraform -chdir="./terraform" output -raw region)"
export ACCOUNT_ID="$(terraform -chdir="./terraform" output -raw account_id)"
export BUCKET_AUX="$(terraform -chdir="./terraform" output -raw bucket_aux)"
export TABLEBUCKET_NAME="$(terraform -chdir="./terraform" output -raw table_bucket)"
export TABLEBUCKET_ARN="$(terraform -chdir="./terraform" output -raw table_bucket_arn)"
export NAMESPACE="$(terraform -chdir="./terraform" output -raw namespace)"
export TABLE="$(terraform -chdir="./terraform" output -raw table)"
export EMR_ROLE="$(terraform -chdir="./terraform" output -raw emr_role)"
export GROUP_CLOUDWATCH="$(terraform -chdir="./terraform" output -raw group_cloudwatch)"
export ATHENA_WORKGROUP="$(terraform -chdir="./terraform" output -raw athena_workgroup)"

Create Glue Data Catalog (New terraform resource requested hashicorp/terraform-provider-aws#40725):

aws glue create-catalog \
--region $REGION \
--cli-input-json "{
  \"Name\": \"s3tablescatalog\",
  \"CatalogInput\": {
    \"FederatedCatalog\": {
      \"Identifier\": \"arn:aws:s3tables:${REGION}:${ACCOUNT_ID}:bucket/*\",
      \"ConnectionName\": \"aws:s3tables\"
    },
    \"CreateDatabaseDefaultPermissions\":[],
    \"CreateTableDefaultPermissions\":[]
  }
}"

Give permission (Terraform bug: hashicorp/terraform-provider-aws#40724):

aws lakeformation grant-permissions \
--region $REGION \
--cli-input-json \
"{
    \"Principal\": {
        \"DataLakePrincipalIdentifier\": \"arn:aws:iam::${ACCOUNT_ID}:user/${TF_VAR_aws_user}\"
    },
    \"Resource\": {
        \"Table\": {
            \"CatalogId\": \"${ACCOUNT_ID}:s3tablescatalog/${TABLEBUCKET_NAME}\",
            \"DatabaseName\": \"${NAMESPACE}\",
            \"Name\": \"${TABLE}\"
        }
    },
    \"Permissions\": [
        \"ALL\"
    ]
}"

Upload code and jar:

#Download and upload jar
curl --create-dirs -O --output-dir ./libs/ https://repo1.maven.org/maven2/software/amazon/s3tables/s3-tables-catalog-for-iceberg-runtime/0.1.3/s3-tables-catalog-for-iceberg-runtime-0.1.3.jar
aws s3 cp ./libs/s3-tables-catalog-for-iceberg-runtime-0.1.3.jar s3://$BUCKET_AUX/libs/

#Upload pyspark scripts
aws s3 cp ./src/main_emr.py s3://$BUCKET_AUX/src/

EMR serverless start job:

emr_job=$(aws emr-serverless start-job-run \
--name pyspark-job \
--application-id $EMR_APP_ID \
--execution-role-arn $EMR_ROLE \
--job-driver "{
  \"sparkSubmit\": {
    \"entryPoint\": \"s3://${BUCKET_AUX}/src/main_emr.py\",
    \"entryPointArguments\": [
      \"--tablebucket_arn\", \"${TABLEBUCKET_ARN}\",
      \"--namespace\", \"${NAMESPACE}\",
      \"--table\", \"${TABLE}\"
    ],
    \"sparkSubmitParameters\": \"--jars s3://${BUCKET_AUX}/libs/s3-tables-catalog-for-iceberg-runtime-0.1.3.jar\"
  }
}" \
--configuration-overrides "{
  \"monitoringConfiguration\": {
    \"cloudWatchLoggingConfiguration\": {
      \"enabled\": true,
      \"logGroupName\": \"$GROUP_CLOUDWATCH\",
      \"logStreamNamePrefix\": \"pyspark-job\"
    }
  }
}" \
--region $REGION | jq -r '.jobRunId')

#Get status, wait until "SUCCESS"
aws emr-serverless get-job-run \
--application-id $EMR_APP_ID \
--job-run-id $emr_job \
--region $REGION  | jq -r '.jobRun.state'

Query with Athena:

query_id=$(aws athena start-query-execution \
--query-string "SELECT * FROM \"s3tablescatalog/${TABLEBUCKET_NAME}\".\"${NAMESPACE}\".\"${TABLE}\" LIMIT 10" \
--work-group $ATHENA_WORKGROUP \
--region $REGION | jq -r '.QueryExecutionId')

#Results
aws athena get-query-results \
--query-execution-id $query_id \
--region $REGION | jq -r '.ResultSet.Rows[] | [.Data[].VarCharValue] | @tsv'

Clean resources

aws glue delete-catalog \
--region $REGION \
--catalog-id s3tablescatalog

aws emr-serverless stop-application \
--application-id $EMR_APP_ID \
--region $REGION

terraform -chdir="./terraform" destroy --auto-approve

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
src		src
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aws-iceberg-s3tables

Overview

Documentation

Architecture

Prerequisites

Installation

Clean resources

About

Releases

Packages

Languages

alfonsozamorac/aws-iceberg-s3tables

Folders and files

Latest commit

History

Repository files navigation

aws-iceberg-s3tables

Overview

Documentation

Architecture

Prerequisites

Installation

Clean resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages