This project demonstrates the development of a scalable end-to-end data pipeline for processing and reporting retail order data using Azure Cloud services, Delta Lake architecture, and Databricks.
The project automates data ingestion, transformation, and consumption for a retail system. The final output is made available for downstream analytics in Power BI.
The architecture leverages:
- Azure Data Factory: For orchestrating ingestion from GitHub and storing raw data in Azure Data Lake Gen2.
- Databricks & Delta Live Tables: For creating multi-layered data pipelines (Bronze, Silver, Gold) using PySpark notebooks.
- Power BI: For reporting and visualization.
- GitHub: As the source repository for ingestion and version control.
The Databricks workflow is orchestrated as shown below:
Layer | Description |
---|---|
Lookup | Initial lookup data load |
Bronze | Raw ingestion using AutoLoader |
Silver | Cleansed and filtered data tables (Customers, Orders, Products) |
Gold | Curated business-ready dimension and fact tables |
Fact | Star schema constructed for analytical reporting |
The notebooks follow the medallion architecture:
retail_orders/
βββ lookup Notebook.python
βββ Bronze Layer.python
βββ Silver_Customers.python
βββ Silver_Orders.python
βββ Silver_Products.python
βββ Silver_Regions.python
βββ Gold_Customers.python
βββ Gold_Products.python
βββ Gold Orders.python
- Azure Data Factory: Orchestration
- Azure Data Lake Storage Gen2: Data storage layer
- Databricks & Spark (Delta Live Tables): Data transformation
- Power BI: Visualization and reporting
- Delta Lake: Storage format for time travel, versioning
- GitHub: CI/CD & source control
The final Gold Layer tables and Fact_Orders are pushed to Power BI for analysis. The star schema ensures performance optimization for downstream consumers and reporting tools.
- Import the
.dbc
archive into your Databricks workspace. - Set up the workflows as shown in the pipeline image.
- Configure your Azure Data Lake Gen2 and GitHub connectors.
- Trigger the pipeline manually or via job scheduler.
- Fully orchestrated and automated data pipeline
- Delta Live Tables for scalable processing
- Adheres to Medallion architecture (Bronze β Silver β Gold)
- Seamless integration from ingestion to visualization