A project is underway to implement Kafka-Python, a client designed for the Apache Kafka distributed stream processing system, to stream BitCoin price and to use Apache Spark for real-time price analytics.
-
Go to official website Spark and and extract it in a new empty folder
C:\Spark
. -
Create a foder
C:\Hadoop
and copy thebin
file with your specific Hadoop version from this repository. -
Download Java if not yet installed.
-
Set
SPARK_HOME
asC:\Spark\spark-3.5.0-bin-hadoop3
,Hadoop
asC:\Hadoop
,JAVA_HOME
asC:\Program Files\Java\jre-1.8
, andSPARK_LOCAL_IP
as127.0.0.1
for system variables. -
Set paths in system environment:
%SPARK_HOME%\bin
and%HADOOP_HOME%\bin
. -
Go to Command Prompt Window in administrator mode and execute
C:\Spark\spark-3.5.1-bin-hadoop3\bin\spark-shell
command. -
If you see Spark logo appears, then you successfully installed it.
- Open a web browser and navigate to http://localhost:4040/. An Apache Spark shell Web UI will show up.
- To test Spark on command prompt, execute:
val data = List("Test")
var t = sc.parallelize(data)
: will returnt: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24
.t.collect()
: will returnres0: Array[String] = Array(Test)
.
- Go to Kafka and click on Binary downloads.
- Extract the zipped file into a new created empty folder
C:\Kafka
. - Go to
config
folder and choose:- zookeeper.properties file and change
dataDir
todataDir=C:/Kafka/kafka_2.13-3.7.0/zookeeper-data
. - server.properties file and change
log.dirs
tolog.dirs=C:/Kafka/kafka_2.13-3.7.0/kafka-logs
.
- zookeeper.properties file and change
Open three seperate Command Prompt, all under C:\Kafka\kafka_2.13-3.7.0
folder.
- Execute Zoo-keeper
.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties
. - After Zoo-keeper is completely executed, execute Kafka Server
.\bin\windows\kafka-server-start.bat .\config\server.properties
. - Execute
.\bin\windows\kafka-topics.bat --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic bitcoin-prices
to create a topic. It will returnCreated topic bitcoin-prices
message.
- Use Anaconda to create a new environment for this project:
conda create -n kafka-project python=3.9
. - After installing all the dependencies in
requirement.txt
, add a environment variablePYSPARK_PYTHON
with your desired Python path to your system environment.
- Run Spark (see 6. in
Install Spark
section). - Run Kafka Server (see 2. in
Use Kafka
section) - Execute
producer.py
file and it will scrape and stream Bitcoin Price in real-time. - Execute
consumer.py
file and it will receive the data and do data processing.
It will show like the following in the console:
+--------------------+-----------------+
| window| mean_price|
+--------------------+-----------------+
|{2024-03-22 01:47...|65759.99907678064|
+--------------------+-----------------+