Skip to content

Azure hosted Data Pipeline using Medallion Architecture, Pyspark, Databricks workflows

Notifications You must be signed in to change notification settings

ShubhamDalvi1999/Azure-End-to-End-Data-Engineering-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to Azure Data Processing Project!

> Hey There!, I am Shubham Dalvi


「 I am a data engineer with a passion for big data, distributed computing, cloud solutions, and data visualization 」

Typing SVG

yourprofile


About Me

Coding gif

✌️   Enjoy solving data problems

❤️   Passionate about big data technologies, cloud platforms, and data visualizations

📧   Reach me: [email protected]


Formula 1 Data Pipeline

Skills and Technologies

Python PySpark Pandas Matplotlib Azure Delta Lake Databricks Power BI


Project Overview

This project demonstrates the implementation of an end-to-end data processing pipeline for Formula 1 data using Azure Databricks and other Azure services. By following the modern data lakehouse architecture with Bronze, Silver, and Gold layers, the solution handles raw data ingestion, cleansing, transformation, and analytics-ready dataset creation for insights.

The architecture integrates Azure Data Lake Storage Gen2, Azure Key Vault, and Azure SQL Database, alongside Delta Lake for scalable, secure, and reliable data processing.

Table of Contents

Technologies Used

  • Azure Blob Storage: For scalable and secure object storage.
  • Azure Data Lake Storage Gen2 (ADLS): For raw and processed data storage.
  • Azure Databricks: For data processing and orchestration.
  • Databricks Workflows: For orchestrating and scheduling Databricks jobs.
  • Delta Lake: To ensure ACID transactions, schema enforcement, and time travel.
  • Azure DevOps: For CI/CD and pipeline automation.
  • Azure Key Vault: For secure secrets management.
  • Azure SQL Database: For structured data storage.
  • Power BI: For data visualization and insights.
  • Python: For scripting and processing logic.

Skills Demonstrated

  • Cloud Integration: Using Azure services to build a robust data pipeline.
  • ETL Pipeline Design: Automating data ingestion, cleansing, and transformation.
  • Data Lakehouse Implementation: Adopting the Bronze, Silver, and Gold layers for processing.
  • Delta Lake Features: Leveraging schema validation, time travel, and ACID compliance.
  • Data Visualization: Creating reports and dashboards in Power BI.
  • Secure Data Handling: Using Azure Key Vault for managing credentials and secrets.

Azure Architecture

The architecture follows the modern data lakehouse approach, integrating technologies such as Azure Blob Storage, Azure DevOps, and Databricks Workflows alongside the core services:

Architecture Diagram

  1. Bronze Layer (Raw Data):

    • Data ingestion from multiple sources: CSV, Excel, Parquet, SQL, APIs, and streaming data.
    • Stored in ADLS Gen2 under /mnt/bronze_layer_gen2/.
  2. Silver Layer (Cleansed Data):

    • Data cleansing, schema validation, and type standardization.
    • Delta Lake ensures ACID compliance.
    • Stored in ADLS Gen2 under /mnt/silver_layer_gen2/.
  3. Gold Layer (Analytics-Ready Data):

    • Fact and dimension tables for business reporting.
    • Stored in ADLS Gen2 under /mnt/gold_layer_gen2/.
  4. Visualization and Analytics:

    • Power BI dashboards and reports for insights.

Data Flow

Orchestration and CI/CD Automation

  • Azure DevOps: Enables CI/CD pipelines, ensuring seamless deployment and integration of data workflows.
  • Databricks Workflows: Manages job scheduling, task orchestration, and dependency handling within the data pipeline.

Detailed Data Ingestion Process:

  • Data is ingested into the Bronze layer using Azure Blob Storage and Azure Databricks.
  • Databricks Workflows handle automated scheduling and orchestration, ensuring tasks are executed in the correct sequence.
  • Multiple sources ingested into the Bronze layer.
  • Azure Databricks notebooks handle ingestion for APIs, CSVs, and SQL tables.
  1. Data Cleansing and Transformation:

    • Bronze-to-Silver processing cleanses raw data and applies schema.
    • Silver-to-Gold processing aggregates and implements business logic.
  2. Analytics and Reporting:

    • Data is queried from the Gold layer for creating dashboards and insights using Power BI.
    • Power BI connects to Azure Databricks and the Gold layer datasets for interactive reporting.

Usage Instructions

  1. Setup Azure Resources:

    • Configure Azure Databricks workspace and clusters.
    • Set up ADLS Gen2 for data storage.
    • Configure Azure Key Vault for secret management.
  2. Implement Data Processing:

    • Create notebooks for data ingestion and cleansing.
    • Implement Delta Lake tables for data storage.
    • Develop transformation logic for Silver and Gold layers.
  3. Visualization and Reporting:

    • Use Power BI to connect to Gold layer datasets.
    • Build interactive dashboards for business insights.
  4. Monitor Pipeline:

    • Use Azure Monitor and Databricks logging for monitoring.

Feel free to contribute or reach out if you have any suggestions or improvements!

About

Azure hosted Data Pipeline using Medallion Architecture, Pyspark, Databricks workflows

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages