AWS-demo-project

Demo Project for a sample data analytics solution showing various integrations.

Project description

This will be a demo repo containing a a repo containing a sample project with a data Analytics solution. Data Analytics solution from various sources. Sources:

API (mimic streaming data)
Flatfiles (xml/json/csv)

ETL:

Serverless functions: Lambda => to query API and save towards AWS (Dynamo or S3)
Spark using Glue

End product:

Query from S3 with Athena (presto) + Visuals in Quicksight
ML pipeline + with model as endpoint (API)
Data as a service API

Project description - build out:

01: Infrastructure as code (IAC) project bootstrap with Terraform
02: Setup a Terraform CI/CD pipeline with Github action (GHA) using OPENID connect for auth from GHA
03: Intro to Standardized Formatting -> Pre-commit and standard styling/formatting for Terraform integrated in GHA pipeline
04: Codeartifact setup and why you would want it => a pipeline packaging you code to your private (PIP) Artifact (CB or GHA)
05: Python formatting with Black/Flake (PEP) (Pre-commit) and unit test (pytest) (integrated with 04 pipeline).
06: Pattern for variable load using a queue - Lambda to write to SQS, Lambda to pull async from SQS save to DynamoDB (3 tier web architecture with seperate scaling example)
07: Streaming (Spark streaming or Kafka - TBD) or Kinesis?
08: Data Cataloque and crawler (S3/dynamo and RDS), Flatfile CSV towards Postgres database (DMS => RDS) (or graph with Neptune?)
09: Setup 3 Spark processing Job with Glue that pre-processes data of Dynamo, S3 + RDS/Neptune => save towards S3- Perhaps also a Lambda to orchestrate various things in next step.
10a: Orchestration of these 3 jobs with Airflow => Deploy airflow with managaged airflow (AWS) setup pipeline and GHA towards it
10b: Same as a but with Step Functions?
11: Data testing (greater expectations) - visual and included in pipeline or?
12: Towards redshift/snowflake? add to orchestration

Extension A (ML application)

01: Feature store setup and pre-procesisng job (spark?)
02: Model training pipeline (sagemaker pipeline) saving in model registry etc. (MLFLOW or standard sagemaker?)
03: Model hosting on endpoint or something else?

Extension B (Dashboard)

01: Query data with Athena from S3 and load into Quicksight
02: Data into Amazon Opensearch (elastic stack)

Extension C (Data as a Service):

01: Make data avaialble through an API (API GATEWAY with some backend (lambda or..?))

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.github		.github
docs		docs
python_packages		python_packages
terraform		terraform
.gitignore		.gitignore
README-IAC.md		README-IAC.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS-demo-project

Project description

About

Releases

Packages

Contributors 3

Languages

pglagerweij/AWS-demo-project

Folders and files

Latest commit

History

Repository files navigation

AWS-demo-project

Project description

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages