Real-time Data-streaming & Analysis Project

A project is underway to implement Kafka-Python, a client designed for the Apache Kafka distributed stream processing system, to stream BitCoin price and to use Apache Spark for real-time price analytics.

Install Spark

Go to official website Spark and and extract it in a new empty folderC:\Spark.
Create a foder C:\Hadoop and copy the bin file with your specific Hadoop version from this repository.
Download Java if not yet installed.
Set SPARK_HOMEasC:\Spark\spark-3.5.0-bin-hadoop3, HadoopasC:\Hadoop , JAVA_HOMEasC:\Program Files\Java\jre-1.8, and SPARK_LOCAL_IPas127.0.0.1 for system variables.
Set paths in system environment: %SPARK_HOME%\bin and %HADOOP_HOME%\bin.
Go to Command Prompt Window in administrator mode and execute C:\Spark\spark-3.5.1-bin-hadoop3\bin\spark-shell command.
If you see Spark logo appears, then you successfully installed it.

Test Spark

Open a web browser and navigate to http://localhost:4040/. An Apache Spark shell Web UI will show up.
To test Spark on command prompt, execute:
- val data = List("Test")
- var t = sc.parallelize(data): will return t: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24.
- t.collect(): will return res0: Array[String] = Array(Test).

Install Kafka

Go to Kafka and click on Binary downloads.
Extract the zipped file into a new created empty folder C:\Kafka.
Go to config folder and choose:
- zookeeper.properties file and change dataDir to dataDir=C:/Kafka/kafka_2.13-3.7.0/zookeeper-data.
- server.properties file and change log.dirs to log.dirs=C:/Kafka/kafka_2.13-3.7.0/kafka-logs .

Use Kafka

Open three seperate Command Prompt, all under C:\Kafka\kafka_2.13-3.7.0 folder.

Execute Zoo-keeper .\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties.
After Zoo-keeper is completely executed, execute Kafka Server .\bin\windows\kafka-server-start.bat .\config\server.properties.
Execute .\bin\windows\kafka-topics.bat --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic bitcoin-prices to create a topic. It will return Created topic bitcoin-prices message.

Setup Python

Use Anaconda to create a new environment for this project: conda create -n kafka-project python=3.9.
After installing all the dependencies in requirement.txt, add a environment variable PYSPARK_PYTHON with your desired Python path to your system environment.

Run Code

Run Spark (see 6. in Install Spark section).
Run Kafka Server (see 2. in Use Kafka section)
Execute producer.py file and it will scrape and stream Bitcoin Price in real-time.
Execute consumer.py file and it will receive the data and do data processing.

It will show like the following in the console:

+--------------------+-----------------+
|              window|       mean_price|
+--------------------+-----------------+
|{2024-03-22 01:47...|65759.99907678064|
+--------------------+-----------------+

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
consumer.py		consumer.py
producer.py		producer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-time Data-streaming & Analysis Project

Install Spark

Test Spark

Install Kafka

Use Kafka

Setup Python

Run Code

About

Releases

Packages

Languages

License

clairebb1005/Real-Time-Data-Analysis-Project

Folders and files

Latest commit

History

Repository files navigation

Real-time Data-streaming & Analysis Project

Install Spark

Test Spark

Install Kafka

Use Kafka

Setup Python

Run Code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages