> Hey There!, I am Shubham Dalvi
「 I am a data engineer with a passion for big data, distributed computing, cloud solutions, and data visualization 」
✌️ Enjoy solving data problems
❤️ Passionate about big data technologies, cloud platforms, and data visualizations
📧 Reach me: [email protected]
This project demonstrates the implementation of an end-to-end data processing pipeline for Formula 1 data using Azure Databricks and other Azure services. By following the modern data lakehouse architecture with Bronze, Silver, and Gold layers, the solution handles raw data ingestion, cleansing, transformation, and analytics-ready dataset creation for insights.
The architecture integrates Azure Data Lake Storage Gen2, Azure Key Vault, and Azure SQL Database, alongside Delta Lake for scalable, secure, and reliable data processing.
- Azure Blob Storage: For scalable and secure object storage.
- Azure Data Lake Storage Gen2 (ADLS): For raw and processed data storage.
- Azure Databricks: For data processing and orchestration.
- Databricks Workflows: For orchestrating and scheduling Databricks jobs.
- Delta Lake: To ensure ACID transactions, schema enforcement, and time travel.
- Azure DevOps: For CI/CD and pipeline automation.
- Azure Key Vault: For secure secrets management.
- Azure SQL Database: For structured data storage.
- Power BI: For data visualization and insights.
- Python: For scripting and processing logic.
- Cloud Integration: Using Azure services to build a robust data pipeline.
- ETL Pipeline Design: Automating data ingestion, cleansing, and transformation.
- Data Lakehouse Implementation: Adopting the Bronze, Silver, and Gold layers for processing.
- Delta Lake Features: Leveraging schema validation, time travel, and ACID compliance.
- Data Visualization: Creating reports and dashboards in Power BI.
- Secure Data Handling: Using Azure Key Vault for managing credentials and secrets.
The architecture follows the modern data lakehouse approach, integrating technologies such as Azure Blob Storage, Azure DevOps, and Databricks Workflows alongside the core services:
-
Bronze Layer (Raw Data):
- Data ingestion from multiple sources: CSV, Excel, Parquet, SQL, APIs, and streaming data.
- Stored in ADLS Gen2 under
/mnt/bronze_layer_gen2/
.
-
Silver Layer (Cleansed Data):
- Data cleansing, schema validation, and type standardization.
- Delta Lake ensures ACID compliance.
- Stored in ADLS Gen2 under
/mnt/silver_layer_gen2/
.
-
Gold Layer (Analytics-Ready Data):
- Fact and dimension tables for business reporting.
- Stored in ADLS Gen2 under
/mnt/gold_layer_gen2/
.
-
Visualization and Analytics:
- Power BI dashboards and reports for insights.
- Azure DevOps: Enables CI/CD pipelines, ensuring seamless deployment and integration of data workflows.
- Databricks Workflows: Manages job scheduling, task orchestration, and dependency handling within the data pipeline.
- Data is ingested into the Bronze layer using Azure Blob Storage and Azure Databricks.
- Databricks Workflows handle automated scheduling and orchestration, ensuring tasks are executed in the correct sequence.
- Multiple sources ingested into the Bronze layer.
- Azure Databricks notebooks handle ingestion for APIs, CSVs, and SQL tables.
-
Data Cleansing and Transformation:
- Bronze-to-Silver processing cleanses raw data and applies schema.
- Silver-to-Gold processing aggregates and implements business logic.
-
Analytics and Reporting:
- Data is queried from the Gold layer for creating dashboards and insights using Power BI.
- Power BI connects to Azure Databricks and the Gold layer datasets for interactive reporting.
-
Setup Azure Resources:
- Configure Azure Databricks workspace and clusters.
- Set up ADLS Gen2 for data storage.
- Configure Azure Key Vault for secret management.
-
Implement Data Processing:
- Create notebooks for data ingestion and cleansing.
- Implement Delta Lake tables for data storage.
- Develop transformation logic for Silver and Gold layers.
-
Visualization and Reporting:
- Use Power BI to connect to Gold layer datasets.
- Build interactive dashboards for business insights.
-
Monitor Pipeline:
- Use Azure Monitor and Databricks logging for monitoring.
Feel free to contribute or reach out if you have any suggestions or improvements!