End to end ELT pipeline from a data source to any data warehouse/data source using Python, SQL to answer - Find the patient(s) with the most generated minutes?
This project showcases an end-to-end ELT (Extract, Load, Transform) pipeline implemented in Python and SQL. The goal is to answer the question: "Find the patient(s) with the most generated minutes." The pipeline extracts data from CSV files, manipulates it using SQL, and generates an output CSV file containing patient information with the highest total minutes.
The following datasets are provided:
steps.csv- Columns: ID, External_id, Steps, Submission_time, Updated_at
exercises.csv- Columns: ID, External_id, minutes, completed_at, Updated_at
patients.csv- Columns: Patient_id, First_name, Last_name, Country
- Steps are converted to minutes using the formula:
minutes = steps * 0.002 - A patient can submit steps multiple times and complete multiple exercises.
- Python is used for data extraction and injection, while SQL is used for data manipulation.
- Output may contain multiple rows if multiple patients have the same highest total minutes.
- Python: Used for data extraction (Pandas library) and loading.
- SQL: Used for data manipulation (SQLite database).
- Git: Version control system for project management.
- Technologies: Docker, cloud services, AirFlow, DBT were not used in this solution.
- Clone the repository:
git clone <repository_url> - Run the Python script:
python src/main.py - Check the output CSV file in the
output/directory.
This project demonstrates a simple yet effective ELT pipeline using Python and SQL to analyze patient data and identify those with the highest total minutes generated.