Skip to content

Commit

Permalink
Add checklist for official models. Remove file access from flag valid…
Browse files Browse the repository at this point in the history
…ator (fix build) (tensorflow#4492)

* Add checklist for official models. Remove file access from flag validator (causing issues with BUILD)

* spelling

* address PR comments
  • Loading branch information
k-w-w authored Jun 12, 2018
1 parent 29c9f98 commit bb62f24
Show file tree
Hide file tree
Showing 7 changed files with 207 additions and 91 deletions.
55 changes: 43 additions & 12 deletions official/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,65 @@

The TensorFlow official models are a collection of example models that use TensorFlow's high-level APIs. They are intended to be well-maintained, tested, and kept up to date with the latest TensorFlow API. They should also be reasonably optimized for fast performance while still being easy to read.

These models are used as end-to-end tests, ensuring that the models run with the same speed and performance with each new TensorFlow build.

## Tensorflow releases
The master branch of the models are **in development**, and they target the [nightly binaries](https://github.com/tensorflow/tensorflow#installation) built from the [master branch of TensorFlow](https://github.com/tensorflow/tensorflow/tree/master). We aim to keep them backwards compatible with the latest release when possible (currently TensorFlow 1.5), but we cannot always guarantee compatibility.

**Stable versions** of the official models targeting releases of TensorFlow are available as tagged branches or [downloadable releases](https://github.com/tensorflow/models/releases). Model repository version numbers match the target TensorFlow release, such that [branch r1.4.0](https://github.com/tensorflow/models/tree/r1.4.0) and [release v1.4.0](https://github.com/tensorflow/models/releases/tag/v1.4.0) are compatible with [TensorFlow v1.4.0](https://github.com/tensorflow/tensorflow/releases/tag/v1.4.0).

If you are on a version of TensorFlow earlier than 1.4, please [update your installation](https://www.tensorflow.org/install/).

---
## Requirements
Please follow the below steps before running models in this repo:

1. Add the top-level ***/models*** folder to the Python path with the command:
```
export PYTHONPATH="$PYTHONPATH:/path/to/models"
```
2. Install dependencies:
```
pip3 install --user -r official/requirements.txt
```
or
```
pip install --user -r official/requirements.txt
```

Below is a list of the models available.

[boosted_trees](boosted_trees): A Gradient Boosted Trees model to classify higgs boson process from HIGGS Data Set.
To make Official Models easier to use, we are planning to create a pip installable Official Models package. This is being tracked in [#917](https://github.com/tensorflow/models/issues/917).

[mnist](mnist): A basic model to classify digits from the MNIST dataset.

[resnet](resnet): A deep residual network that can be used to classify both CIFAR-10 and ImageNet's dataset of 1000 classes.
## Available models

[wide_deep](wide_deep): A model that combines a wide model and deep network to classify census income data.
**NOTE:** Please make sure to follow the steps in the [Requirements](#requirements) section.

More models to come!
* [boosted_trees](boosted_trees): A Gradient Boosted Trees model to classify higgs boson process from HIGGS Data Set.
* [mnist](mnist): A basic model to classify digits from the MNIST dataset.
* [resnet](resnet): A deep residual network that can be used to classify both CIFAR-10 and ImageNet's dataset of 1000 classes.
* [transformer](transformer): A transformer model to translate the WMT English to German dataset.
* [wide_deep](wide_deep): A model that combines a wide model and deep network to classify census income data.
* More models to come!

If you would like to make any fixes or improvements to the models, please [submit a pull request](https://github.com/tensorflow/models/compare).

---
## New Models

## Running the models
The team is actively working to add new models to the repository. Every model should follow the following guidelines, to uphold the
our objectives of readable, usable, and maintainable code.

The *Official Models* are made available as a Python module. To run the models and associated scripts, add the top-level ***/models*** folder to the Python path with the command: `export PYTHONPATH="$PYTHONPATH:/path/to/models"`
**General guidelines**
* Code should be well documented and tested.
* Runnable from a blank environment with relative ease.
* Trainable on: single GPU/CPU (baseline), multiple GPUs, TPU
* Compatible with Python 2 and 3 (using [six](https://pythonhosted.org/six/) when necessary)
* Conform to [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)

To install dependencies pass `-r official/requirements.txt` to pip. (i.e. `pip3 install --user -r official/requirements.txt`)
**Implementation guidelines**

To make Official Models easier to use, we are planning to create a pip installable Official Models package. This is being tracked in [#917](https://github.com/tensorflow/models/issues/917).
These guidelines exist so the model implementations are consistent for better readability and maintainability.

* Use [common utility functions](utils)
* Export SavedModel at the end of training.
* Consistent flags and flag-parsing library ([read more here](utils/flags/guidelines.md))
* Produce benchmarks and logs ([read more here](utils/logs/guidelines.md))
4 changes: 2 additions & 2 deletions official/transformer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Below are the commands for running the Transformer model. See the [Detailed inst
cd /path/to/models/official/transformer
# Ensure that PYTHONPATH is correctly defined as described in
# https://github.com/tensorflow/models/tree/master/official#running-the-models
# https://github.com/tensorflow/models/tree/master/official#requirements
# export PYTHONPATH="$PYTHONPATH:/path/to/models"
# Export variables
Expand Down Expand Up @@ -94,7 +94,7 @@ big | 28.9
0. ### Environment preparation

#### Add models repo to PYTHONPATH
Follow the instructions described in the [Running the models](https://github.com/tensorflow/models/tree/master/official#running-the-models) section to add the models folder to the python path.
Follow the instructions described in the [Requirements](https://github.com/tensorflow/models/tree/master/official#requirements) section to add the models folder to the python path.

#### Export variables (optional)

Expand Down
59 changes: 39 additions & 20 deletions official/transformer/transformer_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,19 +189,18 @@ def get_train_op_and_metrics(loss, params):
loss, tvars, colocate_gradients_with_ops=True)
minimize_op = optimizer.apply_gradients(
gradients, global_step=global_step, name="train")

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
train_op = tf.group(minimize_op, update_ops)

metrics = {"learning_rate": learning_rate}
train_metrics = {"learning_rate": learning_rate}

if not params["use_tpu"]:
# gradient norm is not included as a summary when running on TPU, as
# it can cause instability between the TPU and the host controller.
gradient_norm = tf.global_norm(list(zip(*gradients))[0])
metrics["global_norm/gradient_norm"] = gradient_norm
train_metrics["global_norm/gradient_norm"] = gradient_norm

return train_op, metrics
return train_op, train_metrics


def translate_and_compute_bleu(estimator, subtokenizer, bleu_source, bleu_ref):
Expand Down Expand Up @@ -237,6 +236,13 @@ def evaluate_and_log_bleu(estimator, bleu_source, bleu_ref, vocab_file):
tf.logging.info("Bleu score (cased):", cased_score)
return uncased_score, cased_score


def _validate_file(filepath):
"""Make sure that file exists."""
if not tf.gfile.Exists(filepath):
raise tf.errors.NotFoundError(None, None, "File %s not found." % filepath)


def run_loop(
estimator, schedule_manager, train_hooks=None, benchmark_logger=None,
bleu_source=None, bleu_ref=None, bleu_threshold=None, vocab_file=None):
Expand Down Expand Up @@ -276,7 +282,14 @@ def run_loop(
Raises:
ValueError: if both or none of single_iteration_train_steps and
single_iteration_train_epochs were defined.
NotFoundError: if the vocab file or bleu files don't exist.
"""
if bleu_source:
_validate_file(bleu_source)
if bleu_ref:
_validate_file(bleu_ref)
if vocab_file:
_validate_file(vocab_file)

evaluate_bleu = bleu_source is not None and bleu_ref is not None
if evaluate_bleu and schedule_manager.use_tpu:
Expand Down Expand Up @@ -444,23 +457,29 @@ def _check_train_limits(flag_dict):

@flags.multi_flags_validator(
["bleu_source", "bleu_ref"],
message="Files specified by --bleu_source and/or --bleu_ref don't exist. "
"Please ensure that the file paths are correct.")
message="Both or neither --bleu_source and --bleu_ref must be defined.")
def _check_bleu_files(flags_dict):
"""Validate files when bleu_source and bleu_ref are defined."""
if flags_dict["bleu_source"] is None or flags_dict["bleu_ref"] is None:
return True
return all([
tf.gfile.Exists(flags_dict["bleu_source"]),
tf.gfile.Exists(flags_dict["bleu_ref"])])

@flags.validator("vocab_file", "File set by --vocab_file does not exist.")
def _check_vocab_file(vocab_file):
"""Ensure that vocab file exists."""
if vocab_file:
return tf.gfile.Exists(vocab_file)

flags_core.require_cloud_storage(["data_dir", "model_dir"])
return (flags_dict["bleu_source"] is None) == (
flags_dict["bleu_ref"] is None)

@flags.multi_flags_validator(
["bleu_source", "bleu_ref", "vocab_file"],
message="--vocab_file must be defined if --bleu_source and --bleu_ref "
"are defined.")
def _check_bleu_vocab_file(flags_dict):
if flags_dict["bleu_source"] and flags_dict["bleu_ref"]:
return flags_dict["vocab_file"] is not None
return True

@flags.multi_flags_validator(
["export_dir", "vocab_file"],
message="--vocab_file must be defined if --export_dir is set.")
def _check_export_vocab_file(flags_dict):
if flags_dict["export_dir"]:
return flags_dict["vocab_file"] is not None
return True

flags_core.require_cloud_storage(["data_dir", "model_dir", "export_dir"])


def construct_estimator(flags_obj, params, schedule_manager):
Expand Down
55 changes: 0 additions & 55 deletions official/utils/flags/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,32 +72,6 @@ def _check_pal(provided_pal_flag):
Validators take the form that returning True (truthy) passes, and all others
(False, None, exception) fail.

## Common Flags
Common flags (i.e. batch_size, model_dir, etc.) are provided by various flag definition functions,
and channeled through `official.utils.flags.core`. For instance to define common supervised
learning parameters one could use the following code:

```$xslt
from absl import app as absl_app
from absl import flags
from official.utils.flags import core as flags_core
def define_flags():
flags_core.define_base()
flags.adopt_key_flags(flags_core)
def main(_):
flags_obj = flags.FLAGS
print(flags_obj)
if __name__ == "__main__"
absl_app.run(main)
```

## Testing
To test using absl, simply declare flags in the setupClass method of TensorFlow's TestCase.

Expand All @@ -121,32 +95,3 @@ class BaseTester(unittest.TestCase):
self.AssertEqual(flags.FLAGS.test_flag, "def")
```

## Immutability
Flag values should not be mutated. Instead, use getter functions to return
the desired values. An example getter function is `get_loss_scale` function
below:

```
# Map string to (TensorFlow dtype, default loss scale)
DTYPE_MAP = {
"fp16": (tf.float16, 128),
"fp32": (tf.float32, 1),
}
def get_loss_scale(flags_obj):
if flags_obj.loss_scale is not None:
return flags_obj.loss_scale
return DTYPE_MAP[flags_obj.dtype][1]
def main(_):
flags_obj = flags.FLAGS()
# Do not mutate flags_obj
# if flags_obj.loss_scale is None:
# flags_obj.loss_scale = DTYPE_MAP[flags_obj.dtype][1] # Don't do this
print(get_loss_scale(flags_obj))
...
```
64 changes: 64 additions & 0 deletions official/utils/flags/guidelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Using flags in official models

1. **All common flags must be incorporated in the models.**

Common flags (i.e. batch_size, model_dir, etc.) are provided by various flag definition functions,
and channeled through `official.utils.flags.core`. For instance to define common supervised
learning parameters one could use the following code:

```$xslt
from absl import app as absl_app
from absl import flags
from official.utils.flags import core as flags_core
def define_flags():
flags_core.define_base()
flags.adopt_key_flags(flags_core)
def main(_):
flags_obj = flags.FLAGS
print(flags_obj)
if __name__ == "__main__"
absl_app.run(main)
```
2. **Validate flag values.**

See the [Validators](#validators) section for implementation details.

Validators in the official model repo should not access the file system, such as verifying
that files exist, due to the strict ordering requirements.

3. **Flag values should not be mutated.**

Instead of mutating flag values, use getter functions to return the desired values. An example
getter function is `get_loss_scale` function below:

```
# Map string to (TensorFlow dtype, default loss scale)
DTYPE_MAP = {
"fp16": (tf.float16, 128),
"fp32": (tf.float32, 1),
}
def get_loss_scale(flags_obj):
if flags_obj.loss_scale is not None:
return flags_obj.loss_scale
return DTYPE_MAP[flags_obj.dtype][1]
def main(_):
flags_obj = flags.FLAGS()
# Do not mutate flags_obj
# if flags_obj.loss_scale is None:
# flags_obj.loss_scale = DTYPE_MAP[flags_obj.dtype][1] # Don't do this
print(get_loss_scale(flags_obj))
...
```
58 changes: 58 additions & 0 deletions official/utils/logs/guidelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Logging in official models

This library adds logging functions that print or save tensor values. Official models should define all common hooks
(using hooks helper) and a benchmark logger.

1. **Training Hooks**

Hooks are a TensorFlow concept that define specific actions at certain points of the execution. We use them to obtain and log
tensor values during training.

hooks_helper.py provides an easy way to create common hooks. The following hooks are currently defined:
* LoggingTensorHook: Logs tensor values
* ProfilerHook: Writes a timeline json that can be loaded into chrome://tracing.
* ExamplesPerSecondHook: Logs the number of examples processed per second.
* LoggingMetricHook: Similar to LoggingTensorHook, except that the tensors are logged in a format defined by our data
anaylsis pipeline.


2. **Benchmarks**

The benchmark logger provides useful functions for logging environment information, and evaluation results.
The module also contains a context which is used to update the status of the run.

Example usage:

```
from absl import app as absl_app
from official.utils.logs import hooks_helper
from official.utils.logs import logger
def model_main(flags_obj):
estimator = ...
benchmark_logger = logger.get_benchmark_logger()
benchmark_logger.log_run_info(...)
train_hooks = hooks_helper.get_train_hooks(...)
for epoch in range(10):
estimator.train(..., hooks=train_hooks)
eval_results = estimator.evaluate(...)
# Log a dictionary of metrics
benchmark_logger.log_evaluation_result(eval_results)
# Log an individual metric
benchmark_logger.log_metric(...)
def main(_):
with logger.benchmark_context(flags.FLAGS):
model_main(flags.FLAGS)
if __name__ == "__main__":
# define flags
absl_app.run(main)
```
3 changes: 1 addition & 2 deletions official/utils/logs/hooks_helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,7 @@ def get_logging_metric_hook(tensors_to_log=None,
10 mins.
Returns:
Returns a ProfilerHook that writes out timelines that can be loaded into
profiling tools like chrome://tracing.
Returns a LoggingMetricHook that saves tensor values in a JSON format.
"""
if tensors_to_log is None:
tensors_to_log = _TENSORS_TO_LOG
Expand Down

0 comments on commit bb62f24

Please sign in to comment.