About Me

Welcome to Azure Data Processing Project!

`> Hey There!, I am Shubham Dalvi`

「 I am a data engineer with a passion for big data, distributed computing, cloud solutions, and data visualization 」

About Me

✌️ Enjoy solving data problems

❤️ Passionate about big data technologies, cloud platforms, and data visualizations

📧 Reach me: [email protected]

Skills and Technologies

Project Overview

This project demonstrates the implementation of an end-to-end data processing pipeline for Formula 1 data using Azure Databricks and other Azure services. By following the modern data lakehouse architecture with Bronze, Silver, and Gold layers, the solution handles raw data ingestion, cleansing, transformation, and analytics-ready dataset creation for insights.

The architecture integrates Azure Data Lake Storage Gen2, Azure Key Vault, and Azure SQL Database, alongside Delta Lake for scalable, secure, and reliable data processing.

Technologies Used

Azure Blob Storage: For scalable and secure object storage.
Azure Data Lake Storage Gen2 (ADLS): For raw and processed data storage.
Azure Databricks: For data processing and orchestration.
Databricks Workflows: For orchestrating and scheduling Databricks jobs.
Delta Lake: To ensure ACID transactions, schema enforcement, and time travel.
Azure DevOps: For CI/CD and pipeline automation.
Azure Key Vault: For secure secrets management.
Azure SQL Database: For structured data storage.
Power BI: For data visualization and insights.
Python: For scripting and processing logic.

Skills Demonstrated

Cloud Integration: Using Azure services to build a robust data pipeline.
ETL Pipeline Design: Automating data ingestion, cleansing, and transformation.
Data Lakehouse Implementation: Adopting the Bronze, Silver, and Gold layers for processing.
Delta Lake Features: Leveraging schema validation, time travel, and ACID compliance.
Data Visualization: Creating reports and dashboards in Power BI.
Secure Data Handling: Using Azure Key Vault for managing credentials and secrets.

Azure Architecture

The architecture follows the modern data lakehouse approach, integrating technologies such as Azure Blob Storage, Azure DevOps, and Databricks Workflows alongside the core services:

Bronze Layer (Raw Data):
- Data ingestion from multiple sources: CSV, Excel, Parquet, SQL, APIs, and streaming data.
- Stored in ADLS Gen2 under /mnt/bronze_layer_gen2/.
Silver Layer (Cleansed Data):
- Data cleansing, schema validation, and type standardization.
- Delta Lake ensures ACID compliance.
- Stored in ADLS Gen2 under /mnt/silver_layer_gen2/.
Gold Layer (Analytics-Ready Data):
- Fact and dimension tables for business reporting.
- Stored in ADLS Gen2 under /mnt/gold_layer_gen2/.
Visualization and Analytics:
- Power BI dashboards and reports for insights.

Data Flow

Orchestration and CI/CD Automation

Azure DevOps: Enables CI/CD pipelines, ensuring seamless deployment and integration of data workflows.
Databricks Workflows: Manages job scheduling, task orchestration, and dependency handling within the data pipeline.

Detailed Data Ingestion Process:

Data is ingested into the Bronze layer using Azure Blob Storage and Azure Databricks.
Databricks Workflows handle automated scheduling and orchestration, ensuring tasks are executed in the correct sequence.
Multiple sources ingested into the Bronze layer.
Azure Databricks notebooks handle ingestion for APIs, CSVs, and SQL tables.

Data Cleansing and Transformation:
- Bronze-to-Silver processing cleanses raw data and applies schema.
- Silver-to-Gold processing aggregates and implements business logic.
Analytics and Reporting:
- Data is queried from the Gold layer for creating dashboards and insights using Power BI.
- Power BI connects to Azure Databricks and the Gold layer datasets for interactive reporting.

Usage Instructions

Setup Azure Resources:
- Configure Azure Databricks workspace and clusters.
- Set up ADLS Gen2 for data storage.
- Configure Azure Key Vault for secret management.
Implement Data Processing:
- Create notebooks for data ingestion and cleansing.
- Implement Delta Lake tables for data storage.
- Develop transformation logic for Silver and Gold layers.
Visualization and Reporting:
- Use Power BI to connect to Gold layer datasets.
- Build interactive dashboards for business insights.
Monitor Pipeline:
- Use Azure Monitor and Databricks logging for monitoring.

Feel free to contribute or reach out if you have any suggestions or improvements!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
System_Architecuture.drawio		System_Architecuture.drawio
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to Azure Data Processing Project!

`> Hey There!, I am Shubham Dalvi`

About Me

Skills and Technologies

Project Overview

Table of Contents

Technologies Used

Skills Demonstrated

Azure Architecture

Data Flow

Orchestration and CI/CD Automation

Detailed Data Ingestion Process:

Usage Instructions

About

Releases

Packages

Languages

ShubhamDalvi1999/Azure-End-to-End-Data-Engineering-Project

Folders and files

Latest commit

History

Repository files navigation

Welcome to Azure Data Processing Project!

> Hey There!, I am Shubham Dalvi

About Me

Skills and Technologies

Project Overview

Table of Contents

Technologies Used

Skills Demonstrated

Azure Architecture

Data Flow

Orchestration and CI/CD Automation

Detailed Data Ingestion Process:

Usage Instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`> Hey There!, I am Shubham Dalvi`

Packages