|
2 | 2 |
|
3 | 3 | ## What is this repo?
|
4 | 4 |
|
5 |
| -This repository provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud. |
| 5 | +The Tensorflow cloud repository provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud. |
6 | 6 |
|
7 | 7 | ## Installation
|
8 | 8 |
|
9 |
| -### Requirements: |
| 9 | +### Requirements |
10 | 10 |
|
11 | 11 | - Python >= 3.5
|
12 | 12 | - [Set up your Google Cloud project](https://cloud.google.com/ai-platform/docs/getting-started-keras#set_up_your_project)
|
13 | 13 | - [Authenticate your GCP account](https://cloud.google.com/ai-platform/docs/getting-started-keras#authenticate_your_gcp_account)
|
14 |
| -- [nbconvert](https://nbconvert.readthedocs.io/en/latest/) - if you are using an iPython notebook |
| 14 | +- We use [Google AI platform](https://cloud.google.com/ai-platform/) for deploying docker images on GCP. Please make sure you have AI platform APIs enabled on your GCP project. |
| 15 | +- Please make sure `docker` is installed and running if you want to use local docker process for docker build, otherwise [create a cloud storage bucket](https://cloud.google.com/ai-platform/docs/getting-started-keras#create_a_bucket) for using [Google cloud build](https://cloud.google.com/cloud-build) for docker image build and publish. |
| 16 | +- Install [nbconvert](https://nbconvert.readthedocs.io/en/latest/) if you are using a notebook file as `entry_point` as shown in [usage guide #4](#detailed-usage-guide). |
15 | 17 |
|
| 18 | +### Install latest release |
16 | 19 |
|
17 |
| -### Install latest release: |
18 |
| - |
19 |
| -``` |
| 20 | +```console |
20 | 21 | pip install -U tensorflow-cloud
|
21 | 22 | ```
|
22 | 23 |
|
23 |
| -### Install from source: |
| 24 | +### Install from source |
24 | 25 |
|
25 |
| -``` |
| 26 | +```console |
26 | 27 | git clone https://github.com/tensorflow/cloud.git
|
27 | 28 | cd cloud
|
28 | 29 | pip install .
|
29 | 30 | ```
|
30 | 31 |
|
31 |
| -## Usage examples |
| 32 | +## High level overview |
| 33 | + |
| 34 | +Tensorflow cloud package provides the `run` API for training your models on GCP. Before we get into the details of the API, let's see how a simple workflow will look like using this API. |
| 35 | + |
| 36 | +1. Let's say you have a Keras model training code, such as the following, saved as `mnist_example.py`. |
| 37 | + |
| 38 | +```python |
| 39 | +import tensorflow as tf |
| 40 | + |
| 41 | +(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data() |
| 42 | + |
| 43 | +mnist_train = tf.data.Dataset.from_tensor_slices((x_train, y_train)) |
| 44 | +mnist_test = tf.data.Dataset.from_tensor_slices((x_test, y_test)) |
| 45 | + |
| 46 | +BUFFER_SIZE = 10000 |
| 47 | +BATCH_SIZE = 64 |
| 48 | + |
| 49 | +def scale(image, label): |
| 50 | + image = tf.cast(image, tf.float32) |
| 51 | + image /= 255 |
| 52 | + return image, label |
| 53 | + |
| 54 | +train_dataset = mnist_train.map(scale).cache().shuffle( |
| 55 | + BUFFER_SIZE).batch(BATCH_SIZE) |
| 56 | +eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE) |
| 57 | + |
| 58 | +model = tf.keras.Sequential([ |
| 59 | + tf.keras.layers.Flatten(input_shape=(28, 28)), |
| 60 | + tf.keras.layers.Dense(512, activation='relu'), |
| 61 | + tf.keras.layers.Dropout(0.2), |
| 62 | + tf.keras.layers.Dense(10, activation='softmax') |
| 63 | + ]) |
| 64 | + |
| 65 | +model.compile(loss='sparse_categorical_crossentropy', |
| 66 | + optimizer=tf.keras.optimizers.Adam(), |
| 67 | + metrics=['accuracy']) |
| 68 | + |
| 69 | +model.fit(train_dataset, epochs=12) |
| 70 | +``` |
| 71 | + |
| 72 | +2. After you have tested this model on your local environment for a few epochs, probably with a small dataset, you can train the model on Google cloud by writing the following simple script `scale_mnist.py`. |
| 73 | + |
| 74 | +```python |
| 75 | +import tensorflow_cloud as tfc |
| 76 | +tfc.run(entry_point='mnist_example.py') |
| 77 | +``` |
| 78 | + |
| 79 | +Running this script will automatically apply Tensorflow [Mirrored distribution strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy) and train your model at scale on Google Cloud Platform. Please see the [usage guide](#usage-guide) section for detailed instructions on how to use the API. |
| 80 | + |
| 81 | +3. You will see an output similar to the following on your console. The information from the output can be used to track the training job status. |
| 82 | + |
| 83 | +```console |
| 84 | +usr@desktop$ python scale_mnist.py |
| 85 | +Job submitted successfully. |
| 86 | +Your job ID is: tf_cloud_train_519ec89c_a876_49a9_b578_4fe300f8865e |
| 87 | +Please access your job logs at the following URL: |
| 88 | +https://console.cloud.google.com/mlengine/jobs/tf_cloud_train_519ec89c_a876_49a9_b578_4fe300f8865e?project=prod-123 |
| 89 | +``` |
| 90 | + |
| 91 | +## Detailed usage guide |
| 92 | + |
| 93 | +As described in the [high level overview](#high-level-overview) section, the `run` API allows you to train your models at scale on GCP. The [`run`](https://github.com/tensorflow/cloud/blob/master/tensorflow_cloud/run.py#L31) API can be used in four different ways. This is defined by where you are running the API (Terminal vs IPython notebook) and what the `entry_point` parameter value is. `entry_point` is an optional Python script or notebook file path to the file that contains the TensorFlow Keras training code. This is the most important parameter in the API. |
| 94 | + |
| 95 | + |
| 96 | +```python |
| 97 | +run(entry_point=None, |
| 98 | + requirements_txt=None, |
| 99 | + distribution_strategy='auto', |
| 100 | + docker_base_image=None, |
| 101 | + chief_config='auto', |
| 102 | + worker_config='auto', |
| 103 | + worker_count=0, |
| 104 | + entry_point_args=None, |
| 105 | + stream_logs=False, |
| 106 | + docker_image_bucket_name=None, |
| 107 | + **kwargs) |
| 108 | +``` |
| 109 | + |
| 110 | +**1. Using a python file as `entry_point`.** |
| 111 | + |
| 112 | +If you have your `tf.keras` model in a python file (`mnist_example.py`), then you can write the following simple script (`scale_mnist.py`) to scale your model on GCP. |
| 113 | + |
| 114 | +```python |
| 115 | +import tensorflow_cloud as tfc |
| 116 | +tfc.run(entry_point='mnist_example.py') |
| 117 | +``` |
| 118 | + |
| 119 | +**2. Using a notebook file as `entry_point`.** |
| 120 | + |
| 121 | +If you have your `tf.keras` model in a notebook file (`mnist_example.ipynb`), then you can write the following simple script (`sclae_mnist.py`) to scale your model on GCP. |
| 122 | + |
| 123 | +```python |
| 124 | +import tensorflow_cloud as tfc |
| 125 | +tfc.run(entry_point='mnist_example.ipynb') |
| 126 | +``` |
| 127 | + |
| 128 | +**3. Using `run` within a python script that contains the `tf.keras` model.** |
| 129 | + |
| 130 | +You can use the `run` API from within your python file that contains the `tf.keras` model (`mnist_scale.py`). |
| 131 | + |
| 132 | +```python |
| 133 | +import tensorflow_datasets as tfds |
| 134 | +import tensorflow as tf |
| 135 | +import tensorflow_cloud as tfc |
| 136 | + |
| 137 | +tfc.run( |
| 138 | + entry_point=None, |
| 139 | + distribution_strategy='auto', |
| 140 | + requirements_txt='tests/testdata/requirements.txt', |
| 141 | + chief_config=tfc.MachineConfig( |
| 142 | + cpu_cores=8, |
| 143 | + memory=30, |
| 144 | + accelerator_type=tfc.AcceleratorType.NVIDIA_TESLA_P100, |
| 145 | + accelerator_count=2), |
| 146 | + worker_count=0) |
| 147 | + |
| 148 | +datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True) |
| 149 | +mnist_train, mnist_test = datasets['train'], datasets['test'] |
| 150 | + |
| 151 | +num_train_examples = info.splits['train'].num_examples |
| 152 | +num_test_examples = info.splits['test'].num_examples |
| 153 | + |
| 154 | +BUFFER_SIZE = 10000 |
| 155 | +BATCH_SIZE = 64 |
| 156 | + |
| 157 | +def scale(image, label): |
| 158 | + image = tf.cast(image, tf.float32) |
| 159 | + image /= 255 |
| 160 | + |
| 161 | + return image, label |
| 162 | + |
| 163 | +train_dataset = mnist_train.map(scale).cache() |
| 164 | +train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE) |
| 165 | + |
| 166 | +model = tf.keras.Sequential([ |
| 167 | + tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=( |
| 168 | + 28, 28, 1)), |
| 169 | + tf.keras.layers.MaxPooling2D(), |
| 170 | + tf.keras.layers.Flatten(), |
| 171 | + tf.keras.layers.Dense(64, activation='relu'), |
| 172 | + tf.keras.layers.Dense(10, activation='softmax') |
| 173 | +]) |
| 174 | + |
| 175 | +model.compile(loss='sparse_categorical_crossentropy', |
| 176 | + optimizer=tf.keras.optimizers.Adam(), |
| 177 | + metrics=['accuracy']) |
| 178 | +model.fit(train_dataset, epochs=12) |
| 179 | +``` |
| 180 | + |
| 181 | +In this use case, `entry_point` should be `None`. The `run` API can be called anywhere and the entire file will be executed remotely. The API can be called at the end to run the script locally once for debugging purposes (possibly with different #epochs and other flags). |
| 182 | + |
| 183 | +**4. Using `run` within a notebook script that contains the `tf.keras` model.** |
| 184 | + |
| 185 | + |
| 186 | + |
| 187 | +In this use case, `entry_point` should be `None` and `docker_image_bucket_name` must be provided. |
| 188 | + |
| 189 | +### What happens when you call run? |
| 190 | + |
| 191 | +The API call will encompass the following: |
| 192 | +1. Making code entities such as a Keras script/notebook, **cloud and distribution ready**. |
| 193 | +2. Converting this distribution entity into a **docker container** with all the required dependencies. |
| 194 | +3. **Deploy** this container at scale and train using Tensorflow distribution strategies. |
| 195 | +4. **Stream logs** and monitor them on hosted TensorBoard, manage checkpoint storage. |
| 196 | + |
| 197 | +By default, we will use local docker daemon for building and publishing docker images to Google container registry. Images are published to `gcr.io/your-gcp-project-id`. If you specify `docker_image_bucket_name`, then we will use [Google cloud build](https://cloud.google.com/cloud-build) to build and publish docker images. |
| 198 | + |
| 199 | +**Note** If you are using `run` within a notebook script that contains the `tf.keras` model, `docker_image_bucket_name` must be specified. |
| 200 | + |
| 201 | +We use [Google AI platform](https://cloud.google.com/ai-platform/) for deploying docker images on GCP. |
| 202 | + |
| 203 | +Please see `run` API documentation for detailed information on the parameters and how you can control the above processes. |
| 204 | + |
| 205 | +## End to end examples |
| 206 | + |
| 207 | +- [Using a python file as `entry_point` (Keras fit API)](https://github.com/tensorflow/cloud/blob/master/tests/integration/call_run_on_script_with_keras_fit.py). |
| 208 | +- [Using a python file as `entry_point` (Keras custom training loop)](https://github.com/tensorflow/cloud/blob/master/tests/integration/call_run_on_script_with_keras_ctl.py). |
| 209 | +- [Using a python file as `entry_point` (Keras save and load)](https://github.com/tensorflow/cloud/blob/master/tests/integration/call_run_on_script_with_keras_save_and_load.py). |
| 210 | +- [Using a notebook file as `entry_point`](https://github.com/tensorflow/cloud/blob/master/tests/integration/call_run_on_notebook_with_keras_fit.py). |
| 211 | +- [Using `run` within a python script that contains the `tf.keras` model](https://github.com/tensorflow/cloud/blob/master/tests/integration/call_run_within_script_with_keras_fit.py). |
| 212 | +- [Using cloud build instead of local docker](https://github.com/tensorflow/cloud/blob/master/tests/integration/call_run_on_script_with_keras_fit_cloud_build.py). |
| 213 | + |
| 214 | +## Coming up |
32 | 215 |
|
33 |
| -- [Usage with `tf.keras` script that trains using `model.fit`](tests/integration/call_run_on_script_with_keras_fit.py). |
34 |
| -- [Usage with `tf.keras` script that trains using a custom training loop](tests/integration/call_run_on_script_with_keras_ctl.py). |
| 216 | +- Keras tuner support. |
| 217 | +- TPU support. |
35 | 218 |
|
36 | 219 | ## Contributing
|
37 | 220 |
|
|
0 commit comments