Python Machine Learning Package on collection of Docker Containers

Goal

This project's goal is to use docker containers to set up a network of services and workbenches commonly used by Data Scientists working on Machine Learning problems. It's currently marked as experimental and contributions are welcome. The Docker Compose file outlines a couple of the containers. They should be configured to work with eachother over the docpyml network you create on your docker VM.

List of Containers:

docpyml-namenode: Hadoop NameNode. Keeps the directory tree of all files in the file system.
docpyml-datanode1: Data Storage HadoopFileSystem
docpyml-datanode2: Data Storage HadoopFileSystem
docpyml-spark-master: Apache Spark Master
spark-worker (<- may launch many): Spark Workers. Also contain the Python version matching docpyml-conda
docpyml-sparknotebook: Preconfigured Spark Notebook
docpyml-hdfsfb: HDFS FileBrowser from Cloudera Hue
docpyml-conda: Anaconda Python 3.5 with Jupyter Notebook, machine learning packages, pySpark preconfigured
docpyml-rocker: RStudio

Install

Prerequisites. Docker Toolbox.

optionally adjust your VM settings:

    docker-machine stop
    VBoxManage modifyvm default --cpus 4
    VBoxManage modifyvm default --memory 8192
    docker-machine start

To start the enviroment:

    docker network create docpyml
    docker-compose up -d

If says docker not running try first:

   eval "$(docker-machine env default)"

To scale up spark-workers:

    docker-compose scale spark-worker=3

Credits

The Spark HDFS Workbench
c12e jupyter
pysparklines
docker-spark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Python Machine Learning Package on collection of Docker Containers

Goal

Install

Credits

Files

README.md

Latest commit

History

README.md

File metadata and controls

Python Machine Learning Package on collection of Docker Containers

Goal

Install

Credits