Skip to content

Latest commit

 

History

History
19 lines (14 loc) · 2.22 KB

README.md

File metadata and controls

19 lines (14 loc) · 2.22 KB
Machine Learning At Scale (Spark, Spark SQL, Spark ML)

Access the project via Databricks here

image

Flight delays create problems in scheduling for airlines and airports, leading to passenger inconvenience, and huge economic losses. As a result, there is growing interest in predicting flight delays beforehand in order to optimize operations and improve customer satisfaction. The objective of this playground project is to predict flight departure delays two hours ahead of departure at scale. The project includes an exploration of a series of data transformation and ML pipelines in Apache Spark (using Databricks). It concludes with some challenges faced along the journey and some key lessons learned.

The Databricks notebook is connected with AWS where it can create and manage compute and VPC resources. Data access in the notebook was through a mounted S3 bucket on AWS.

Datasets used in the project include the following:

The project can be directly accessed via Spark Playground - Flight Delay Prediction. This repository also contains the .dbc and .py versions of the Databricks notebook.