Event-driven serverless ETL in AWS

Contextual overview

The management of an electronic toll-collection company wants to consolidate the billing of the toll plazas in a data warehouse with the most up-to-date data.

Architecture diagram

Project objectives

1. To collect the raw data, we will use an Amazon S3 landing bucket.
2. To create a data catalog, an AWS Glue crawler will crawl the S3 bucket JSON files.
3. To receive notifications when a PUT event happens, we will configure an Amazon S3 Event Notification which will send notification messages to the AWS Lambda function.
4. To ingest the data from the data lake and into the data warehouse, we will use an AWS Glue workflow, invoked by a Lambda function and launching a Glue ETL job.
5. To access the data, the end users have the Redshift built-in query editor, the SQL client tool or the built-in Data API.

Reproducibility guidelines

Required setup

1. Download the query included in the "create_table.txt" file.
2. In S3, create a landing bucket and a staging bucket.
3. Create several Lambda functions using the boto3 scripts I uploaded above.
4. Upload the "sample_data_toll_application.json" file manually to your landing bucket or run the "toll_plaza_application.py" Lambda function to automatically create and send randomly generated toll data to your landing bucket.
5. Create a Redshift cluster with the following configurations:
- Node type: dc2.large
- Number of nodes: 1
6. Create a secret in the AWS Secrets Manager called tollclusterSecret associated with the Redshift cluster.
7. Create a VPC with the following configurations:
- IPv4 CIDR block: 10.0.0.0/16.
- Tenancy: default.
- No IPv6.
- 4 subnets and 5 route tables across two availability zones in the same region.
- DNS options: enable DNS hostnames and enable DNS resolution.
8. Create a Redshift security group with the following configurations:
- inbound rule type: Redhshift / source: custom 0.0.0.0/0 / description: Allow access to tcp port 5439.
- inbound rule type: All traffic / source: custom sg-00d547bf7a5580bab / description: Allow access to self SG.
- outbound rule type: All traffic / source: custom 0.0.0.0/0 / description: Allow all outbound traffic by default.
9. Create an IAMGlueServiceRole.
10. Create a S3 crawler and an associated workflow that uses the S3 crawler.

Configure an Amazon S3 event notification to invoke an AWS Lambda function

1. Click on your S3 staging bucket and go to the Properties tab.
2. Scroll down to Event Notifications and click on "create event notification".
3. Use the following configurations:
- Name: s3Events
- Suffix: .json
- Under Object Creation, choose Put.
- Under Destination, choose Lambda function.
- Select the "start_workflow_function" from the dropdown menu.
- Save the changes and finish this step.

Create an AWS Glue crawler to read schema from Amazon Redshift

1. Open Redshift and select the cluster you created earlier.
2. Copy the JDBC URL of your Redshift cluster.
3. Open the query editor inside your Redshift cluster.
4. Click "Connect to Database" and use the following configurations:
- Under Connection, choose "Create new connection"
- Under Authentication, choose Temporary credentials
- Under Cluster, select the one you created for this project
- Under Database name, type "toll_db"
- Under Database user, select "admin".
- Click connect and finish this step.
5. Under Resources on the left, make sure "toll_db" and "public" are selected.
6. Open the "create_table.txt" file and paste the content of the query to the query editor on the right.
7. Run the query and review that the table was created by checking it on the left side.
8. Navigate to the AWS Secrets Manager, click Secrets on the left and then click on the secret you created earlier.
9. Under Secret Value, click on "Retrieve secret value".
10. Copy the password and keep it available in a text editor.
11. Navigate to the AWS VPC console, click on the VPC you created earlier and then copy the ID one of the subnets that starts with database.
12. Navigate to AWS Glue and under Data Connections, create a connection with the following configurations:
- Name: redshift_conn.
- Connection type: JDBC.
- JDBC URL: the JDBC URL you copied from the Redshift cluster page.
- Credential type: Username and password.
- Username: admin.
- Password: the password you copied earlier.
- Expand Network Options, under VPC: the VPC you created earlier.
- Subnets: choose the subnet ID you copied earlier.
- Create the create connection.
13. In AWS Glue, under the Data Catalog section, select Crawlers.
14. Click Create Crawler and input the following configurations:
- Name: Redshift-Crawler.
- Data Source Configuration: Not yet.
- Click Add a data source and input the following configurations:
- Data Source: JDBC.
- Connection: redshift_conn.
- Include path: toll_db/public/%.
- Click add a JDBC source.
- Click Next.
- Choose the IAMGlueServiceRole.
- Click Next.
- Under Output Configuration, choose toll-raw-db as the target database.
- Under Crawler schedule, within Frequency choose On-demand.
- Click Next and then Create Crawler.
15. Refresh the crawler page and then run the crawler.
16. Once the status of the crawler run displays "completed", click on Data Catalog / Databases / Tables.
17. Review the two available tables and their respective classifications to finish this step.

Create an AWS Glue ETL job to move data from an S3 bucket to Amazon Redshift

1. Click ETL Jobs.
2. Use the following configurations to create the job and then click create:
- Visual with a source and target.
- Source: S3.
- Target: Amazon Redshift.
3. On the visual editor, click the S3 step and input the following:
- S3 source type: Data Catalog table.
- Database: toll-raw-db.
- Table: the one with the prefix "landing_bucket".
4. Under the second step in the visual editor, edit with these options:
- Choose bigint for the data type of transaction id and double for the transaction amount.
5. Under the last step in the visual editor, choose the following:
- Redshift access type: Glue Data Catalog tables.
- Database: toll-raw-db.
- Table: toll_db_public_toll_table.
- Expand Performance and Security, click Browse S3 and choose the staging bucket.
6. Under Job Details:
- Name: s3_to_redshift_job.
- IAM Role: IAMGlueServiceRole.
- Requested number of workers: 3.
- Generate job insights checked.
- Number of retries: 1.
- Job timeout: 15 minutes.
- Click save to finish this step.

Create AWS Glue workflows to crawl raw data from the S3 bucket, and then use an AWS Glue ETL job

1. Under Data Integration and ETL in the AWS Glue main page, click Workflows (orchestration).
2. Create a new workflow with the following configurations:
- Name: redshift_workflow.
- Max concurrency: 2.
3. On the newly created workflow page, click add trigger:
- Choose add new.
- Name: redshift-workflows-start.
- Trigger type: on-demand.
4. On the workflow canvas, click add node:
- Under crawlers, select s3_crawler.
5. Click the s3_crawler node and add a trigger:
- Under add new, input the following name: s3-crawler-event.
- Trigger type: event.
- Trigger logic: start after ANY watched event.
6. On the workflow canvas, click on add node to the right:
- Under Jobs, select s3_to_redshift_job.
7. Review the process which starts with an S3 event notification, which starts an S3 crawler. Upon completion, the AWS Glue crawler event runs the s3_to_redshift_job.
8. Go to AWS Lambda, click on the toll_plaza_application function.
9. Configure a test event:
- Event name: TestEvent and then click save.
10. Click test and review the results to check if the upload of the randomly generated data was completed successfully.
11. Return to the Glue workflow page and wait until the workflow finishes running to complete this step.

Query the data using the Amazon Redshift query editor

1. Navigate to the query editor page in your Redshift cluster and click on connect to database using the recent connection.
2. Use the ellipsis on the right of the toll_table to then select the preview data option.
3. You can now query the data as you wish via Redshift.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_table.txt		create_table.txt
sample_data_toll_application.json		sample_data_toll_application.json
start_workflow_function.py		start_workflow_function.py
toll_plaza_application_data.py		toll_plaza_application_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Event-driven serverless ETL in AWS

Contextual overview

Architecture diagram

Project objectives

Reproducibility guidelines

About

Releases

Packages

Languages

License

martins-jean/Event-driven-serverless-ETL-in-AWS

Folders and files

Latest commit

History

Repository files navigation

Event-driven serverless ETL in AWS

Contextual overview

Architecture diagram

Project objectives

Reproducibility guidelines

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages