Skip to content

Latest commit

 

History

History
47 lines (37 loc) · 2.18 KB

File metadata and controls

47 lines (37 loc) · 2.18 KB

Designing a Cassandra Database for Sparkify's Music Streaming Analytics

Description

This project involves creating a NoSQL database using Apache Cassandra for Sparkify, a startup focusing on music streaming. The aim is to analyze song and user activity data collected on their app, and provide a seamless way to query this data to understand user preferences.

Table of Contents

Installation

  • Python 3.7+
  • Apache Cassandra
  • Cassandra Python Driver

Usage

  1. Clone this repository.
  2. Execute Data_Modeling_with_Cassandra.ipynb to preprocess the data and interact with the database.

Project Overview

This project entails creating tables in Apache Cassandra to facilitate efficient querying on song play data for Sparkify’s analytics team. The ETL pipeline is developed using Python, and it processes data residing in a directory of CSV files to create a streamlined CSV file, which is then used to insert data into Apache Cassandra tables.

Datasets

The dataset used is event_data, which is a collection of CSV files partitioned by date. It contains details like artist name, user name, song details, user location, etc. After processing these files, the denormalized data appear as follows: Sample of the denormalized data

Project Steps

  1. Develop an ETL pipeline to process and transform event_data files to create a denormalized dataset.
  2. Create the Apache Cassandra database.
  3. Model the database tables based on the required queries.
  4. Create the tables and load the data into them.
  5. Run the provided queries to verify the model's effectiveness in answering analytics queries.

Files

  • Sparkify-Project-Notebook.ipynb - Jupyter notebook containing ETL pipeline, Apache Cassandra database, tables setup, and test queries.
  • event_datafile_new.csv - The preprocessed CSV file, generated by combining event_data files.

Acknowledgments

This project is part of the Data Engineering Nanodegree Program provided by Udacity.