The purpose of the data engineering capstone project is to give you a chance to combine what you've learned throughout the program. This project will be an important part of your portfolio that will help you achieve your data engineering-related career goals.
In this project, you can choose to complete the project provided for you, or define the scope and data for a project of your own design. Either way, you'll be expected to go through the same steps outlined below.
In the Udacity provided project, you'll work with four datasets to complete the project. The main dataset will include data on immigration to the United States, and supplementary datasets will include data on airport codes, U.S. city demographics, and temperature data. You're also welcome to enrich the project with additional data if you'd like to set your project apart.
- Python 3.6 and above
- GIT setup and configured for SSH
- Docker (If running locally)
- Clone repository by running
git clone [email protected]:seetdev/dend-capstone.git
- Go into the cloned folder
- Create folders for
model_data
,raw_sas_data
,sas_data
andstaging_data
- Setup the docker image by running
docker build --tag udacity-dend/pyspark-notebook .
- Start docker container by runnin
docker run --rm -d -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -v $PWD:/home/jovyan/work --name spark udacity-dend/pyspark-notebook