This project involves creating a data lakehouse solution for the STEDI team, leveraging sensor data to train a machine learning model. The core device is the STEDI Step Trainer, a hardware that helps users with balance exercises while collecting valuable data for machine learning purposes.
The STEDI Step Trainer, paired with a mobile app, serves dual purposes:
- Training: Assists users in balance exercises.
- Data Collection: Gathers step data through sensors for a machine learning algorithm.
A significant number of early adopters are already using the Step Trainer and mobile app, providing a rich dataset for analysis.
- The Step Trainer uses motion sensors to record distances of the object detected.
- The mobile app uses a phone's accelerometer to detect motion in the X, Y, and Z directions.
- Data privacy is paramount. Only data from customers who have consented for research should be utilized for the machine learning model.
- Extract data from both the Step Trainer sensors and mobile app.
- Curate the data into an AWS-based data lakehouse.
- Enable Data Scientists to train a machine learning model.
- Anonymize the final curated table so that it is not subject to GDPR or other privacy regulations.
The lakehouse solution is developed using:
- Python and Spark for processing.
- AWS Glue for data integration.
- AWS Athena for interactive querying.
- AWS S3 for data storage.
STEDI provides three JSON data sources:
customer
data from fulfillment and the STEDI websitestep_trainer
records from the motion sensoraccelerometer
data from the mobile app
1. Ingested data from the data source repository into a landing zone S3 bucket using AWS CloudShell.
-
customer_landing
table:The
customer_landing
data contains multiple rows with a blankshareWithResearchAsOfDate
.
- customer_landing_to_trusted.py has a node that drops rows that do not have data in the
sharedWithResearchAsOfDate
column. - accelerometer_landing.py has a node that inner joins the
customer_trusted
data with theaccelerometer_landing
data by emails. - step_trainer_landing.py has a node that inner joins the
step_trainer_landing
data with thecustomer_curated
data by serial numbers.
customer_trusted
accelerometer_trusted
step_trainer_trusted
- record count check for each table:
customer_trusted
data has no rows where shareWithResearchAsOfDate is blank:
- customer_trusted_to_curated.py as a node that inner joins the
customer_trusted
data with theaccelerometer_trusted
data by emails. - machine_learning_curated.py has a node that inner joins the
step_trainer_trusted
data with theaccelerometer_trusted
data by sensor reading time and timestamps.
customer_curated
machine_learning_curated