Skip to content

Latest commit

 

History

History
61 lines (41 loc) · 1.95 KB

README.md

File metadata and controls

61 lines (41 loc) · 1.95 KB

Python Machine Learning Package on collection of Docker Containers

By: Brian Ray [email protected]

Goal

This project's goal is to use docker containers to set up a network of services and workbenches commonly used by Data Scientists working on Machine Learning problems. It's currently marked as experimental and contributions are welcome. The Docker Compose file outlines a couple of the containers. They should be configured to work with eachother over the docpyml network you create on your docker VM.

List of Containers:

  • docpyml-namenode: Hadoop NameNode. Keeps the directory tree of all files in the file system.
  • docpyml-datanode1: Data Storage HadoopFileSystem
  • docpyml-datanode2: Data Storage HadoopFileSystem
  • docpyml-spark-master: Apache Spark Master
  • spark-worker (<- may launch many): Spark Workers. Also contain the Python version matching docpyml-conda
  • docpyml-sparknotebook: Preconfigured Spark Notebook
  • docpyml-hdfsfb: HDFS FileBrowser from Cloudera Hue
  • docpyml-conda: Anaconda Python 3.5 with Jupyter Notebook, machine learning packages, pySpark preconfigured
  • docpyml-rocker: RStudio

Install

Prerequisites. Docker Toolbox.

optionally adjust your VM settings:

    docker-machine stop
    VBoxManage modifyvm default --cpus 4
    VBoxManage modifyvm default --memory 8192
    docker-machine start
    

To start the enviroment:

    docker network create docpyml
    docker-compose up -d

If says docker not running try first:

   eval "$(docker-machine env default)"

To scale up spark-workers:

    docker-compose scale spark-worker=3

Credits