Name	Name	Last commit message	Last commit date
parent directory ..
__pycache__	__pycache__	load_new_params	Jun 8, 2021
image_classification	image_classification	load_new_params	Jun 8, 2021
scripts	scripts	Initial commit	Aug 31, 2019
tests	tests	Fix bug in runtime/tests/communication/README.md (msr-fiddle#26 )	Dec 13, 2019
translation	translation	Initial commit	Aug 31, 2019
README.md	README.md	Fix example command lines in README to use `--local_rank` and `--dist…	Sep 26, 2019
adam.py	adam.py	Initial commit	Aug 31, 2019
communication.py	communication.py	load_new_params	Jun 8, 2021
driver.py	driver.py	Initial commit	Aug 31, 2019
launch.py	launch.py	Initial commit	Aug 31, 2019
optimizer.py	optimizer.py	load_new_params	Jun 8, 2021
runtime.py	runtime.py	load_new_params	Jun 8, 2021
runtime_utilities.py	runtime_utilities.py	Initial commit	Aug 31, 2019
sgd.py	sgd.py	Initial commit	Aug 31, 2019
threadsafe_counter.py	threadsafe_counter.py	Initial commit	Aug 31, 2019
threadsafe_queue.py	threadsafe_queue.py	swap out and swap in one version of weights at the need time	Jun 8, 2021

PipeDream Runtime

This directory contains implementation for the distributed runtime that integrates model parallelism, pipelining, and data parallelism into PyTorch.

runtime.py: Contains the main StageRuntime class.

communication.py: Simple communication library that sends PyTorch tensors between a single sender and receiver.

tests: Contains a simple test harness for the send_tensor and receive_tensor functions in communication.py.

models: Contains implementations of models that can be run with the runtime.

driver_configs: Contains driver configuration files to use with driver.py

Auto-generated model with runtime

main_with_runtime.py is a driver program for ImageNet image classification models that uses our StageRuntime and integrates with PyTorch. The runtime allows a model's layers to be split over multiple machines, and supports pipelining.

Using `driver.py`

driver.py configures containers, launches main_with_runtime.py within the containers, and logs experimental settings and output. It uses a user provided Yaml file to configure the settings:

python driver.py --config_file driver_configs/resnet50_single_machine.yml

All the options described below can be configured to be launched using driver.py.

Using `StageRuntime` on single machine

To use the StageRuntime implemented in runtime.py on a single machine, use command line arguments like below.

python main_with_runtime.py --module models.resnet50.gpus=2 -b 128 --data_dir ../../../data/imagenet

Using `StageRuntime` with Model Parallelism

To split the generated ResNet50 model over two machines (modules 1 & 2 on machine 1, and modules 3, 4 & 5 (loss) on machine 2) using the StageRuntime implemented in ../../runtime.py, use command line arguments like below (--rank, --master_addr, and --config_path are important).

With input pipelining,

python main_with_runtime.py --module models.resnet50.gpus=2 -b 64 --data_dir ../../../data/imagenet --rank 0 --local_rank 0 --master_addr localhost --config_path models/resnet50/gpus=2/mp_conf.json --distributed_backend gloo
python main_with_runtime.py --module models.resnet50.gpus=2 -b 64 --data_dir ../../../data/imagenet --rank 1 --local_rank 1 --master_addr localhost --config_path models/resnet50/gpus=2/mp_conf.json --distributed_backend gloo

Without input pipelining,

python main_with_runtime.py --module models.resnet50.gpus=2 -b 64 --data_dir ../../../data/imagenet --rank 0 --local_rank 0 --master_addr localhost --config_path models/resnet50/gpus=2/mp_conf.json --no_input_pipelining --distributed_backend gloo
python main_with_runtime.py --module models.resnet50.gpus=2 -b 64 --data_dir ../../../data/imagenet --rank 1 --local_rank 1 --master_addr localhost --config_path models/resnet50/gpus=2/mp_conf.json --no_input_pipelining --distributed_backend gloo

With data parallelism (and no input pipelining),

python main_with_runtime.py --module models.resnet50.gpus=2 -b 128 --data_dir ../../../data/imagenet --rank 0 --local_rank 0 --master_addr localhost --config_path models/resnet50/gpus=2/dp_conf.json --no_input_pipelining --distributed_backend nccl
python main_with_runtime.py --module models.resnet50.gpus=2 -b 128 --data_dir ../../../data/imagenet --rank 1 --local_rank 1 --master_addr localhost --config_path models/resnet50/gpus=2/dp_conf.json --no_input_pipelining --distributed_backend nccl

Note that for DP-only setups, we use the nccl backend for optimal performance.

With hybrid parallelism (model and data parallelism, and pipelining),

python main_with_runtime.py --module models.resnet50.gpus=2 -b 64 --data_dir ../../../data/imagenet --rank 0 --local_rank 0 --master_addr localhost --config_path models/resnet50/gpus=2/hybrid_conf.json --distributed_backend gloo
python main_with_runtime.py --module models.resnet50.gpus=2 -b 64 --data_dir ../../../data/imagenet --rank 1 --local_rank 1 --master_addr localhost --config_path models/resnet50/gpus=2/hybrid_conf.json --distributed_backend gloo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

runtime

runtime

README.md

PipeDream Runtime

Auto-generated model with runtime

Using `driver.py`

Using `StageRuntime` on single machine

Using `StageRuntime` with Model Parallelism

Files

runtime

Directory actions

More options

Directory actions

More options

Latest commit

History

runtime

Folders and files

parent directory

README.md

PipeDream Runtime

Auto-generated model with runtime

Using driver.py

Using StageRuntime on single machine

Using StageRuntime with Model Parallelism

Using `driver.py`

Using `StageRuntime` on single machine

Using `StageRuntime` with Model Parallelism