Get python modules, Java, Scale
- Install pip3
sudo apt install python3-pip
- Install jupyter notebook
pip3 install jupyter
sudo apt-get update
- Install java
sudo apt-get install default-jre
- Check java is installed
java -version
- Install scala
sudo apt-get install scala
- Check scala is installed
scala -version
sudo pip3 install py4j
Get Apache Spark
- Go to https://spark.apache.org/downloads.html
- Download latest Spark version
- Cut and paste tgz file to home folder
- unzip file
sudo tar -zxvf spark-2.2.0-bin-hadoop2.7
Define Paths
export SPARK_HOME='home/ubuntu/spark-2.2.0-bin-hadoop2.7'
export PATH=$SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
Grant Permissions to essential Spark folders
sudo chmod 777 spark-2.2.0-bin-hadoop2.7
cd spark-2.2.0-bin-hadoop2.7
sudo chmod 777 python
cd python
sudo chmod 777 pyspark
pip3 install findspark
In the script or python shell
# order script to find spark location
import findspark
findspark.init('/home/jake/spark/spark-2.2.0-bin-hadoop2.7')
import pyspark