Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix and enhance single node example #2

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ The examples in this repository are based on the [original TensorFlow Examples](

| Directory | TensorFlow script description |
| :--- | ---: |
| [MirroredStrategy](examples/single-node/README.md) | Synchronous distributed training on multiple GPUs on one machine. |
| [MultiWorkerMirroredStrategy](examples/multi-node/README.md) | Synchronous distributed training across multiple workers, each with potentially multiple GPUs. |
| [MirroredStrategy](examples/single_node/README.md) | Synchronous distributed training on multiple GPUs on one machine. |
| [MultiWorkerMirroredStrategy](examples/multi-node/README.md) | Synchronous distributed training across multiple workers, each with potentially multiple GPUs. |

#### Parameter Server
Not yet tested, please reach out to the Outerbounds team if you need help.
Expand Down
4 changes: 0 additions & 4 deletions examples/single-node/README.md

This file was deleted.

164 changes: 0 additions & 164 deletions examples/single-node/mnist_mirrored_strategy.py

This file was deleted.

14 changes: 14 additions & 0 deletions examples/single_node/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Introduction

The following four files showcase how to leverage tensorflow's `MirroredStrategy` with `@kubernetes`. This enables distributed training on multiple GPUs of a single machine. Note that it doesn't use the `@tensorflow` decorator.

1. `gpu_profile.py` contains the `@gpu_profile` decorator, and is available [here](https://github.com/outerbounds/metaflow-gpu-profile). It is used in the file `flow.py`

2. `train_mnist.py` contains the main snippet for how to use the `MirroredStrategy` while training a model on the MNIST dataset.

3. `flow.py` contains a flow that uses the training code from `train_mnist.py` and uses the docker image `tensorflow/tensorflow:2.15.0-gpu` for GPU setup.

- This can be run using `python flow.py --environment=pypi run`
- If you are on the [Outerbounds](https://outerbounds.com/) platform, you can leverage `fast-bakery` for blazingly fast docker image builds. This can be used by `python flow.py --environment=fast-bakery run`

4. `reload.ipynb` showcases how to use the trained model for inference later on. Please make sure to have `tensorflow==2.15.1` installed locally to be able to run this notebook correctly.
23 changes: 15 additions & 8 deletions examples/single-node/flow.py → examples/single_node/flow.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
from metaflow import FlowSpec, step, batch, conda, environment

N_GPU = 2
from metaflow import FlowSpec, step, kubernetes, environment, pypi
from gpu_profile import gpu_profile


class SingleNodeTensorFlow(FlowSpec):
Expand All @@ -9,19 +8,27 @@ class SingleNodeTensorFlow(FlowSpec):

@step
def start(self):
self.next(self.foo)
self.next(self.train)

@gpu_profile(interval=1)
@environment(vars={"TF_CPP_MIN_LOG_LEVEL": "2"})
@batch(gpu=N_GPU, image="tensorflow/tensorflow:latest-gpu")
@kubernetes(gpu=2, image="registry.hub.docker.com/tensorflow/tensorflow:2.15.0-gpu")
@pypi(
packages={
"tensorflow-datasets": "4.9.7",
"matplotlib": "3.10.0",
}
)
@step
def foo(self):
from mnist_mirrored_strategy import main
def train(self):
from train_mnist import main

main(
run=self,
local_model_dir=self.local_model_dir,
local_tar_name=self.local_tar_name,
run=self,
)

self.next(self.end)

@step
Expand Down
Loading