Add checklist for official models. Remove file access from flag valid…

…ator (fix build) (tensorflow#4492) * Add checklist for official models. Remove file access from flag validator (causing issues with BUILD) * spelling * address PR comments
vijethrai · Jun 12, 2018 · bb62f24 · bb62f24
1 parent 29c9f98
commit bb62f24
Show file tree

Hide file tree

Showing 7 changed files with 207 additions and 91 deletions.
diff --git a/official/README.md b/official/README.md
@@ -2,34 +2,65 @@
 
 The TensorFlow official models are a collection of example models that use TensorFlow's high-level APIs. They are intended to be well-maintained, tested, and kept up to date with the latest TensorFlow API. They should also be reasonably optimized for fast performance while still being easy to read.
 
+These models are used as end-to-end tests, ensuring that the models run with the same speed and performance with each new TensorFlow build.
+
+## Tensorflow releases
 The master branch of the models are **in development**, and they target the [nightly binaries](https://github.com/tensorflow/tensorflow#installation) built from the [master branch of TensorFlow](https://github.com/tensorflow/tensorflow/tree/master). We aim to keep them backwards compatible with the latest release when possible (currently TensorFlow 1.5), but we cannot always guarantee compatibility.
 
 **Stable versions** of the official models targeting releases of TensorFlow are available as tagged branches or [downloadable releases](https://github.com/tensorflow/models/releases). Model repository version numbers match the target TensorFlow release, such that [branch r1.4.0](https://github.com/tensorflow/models/tree/r1.4.0) and [release v1.4.0](https://github.com/tensorflow/models/releases/tag/v1.4.0) are compatible with [TensorFlow v1.4.0](https://github.com/tensorflow/tensorflow/releases/tag/v1.4.0).
 
 If you are on a version of TensorFlow earlier than 1.4, please [update your installation](https://www.tensorflow.org/install/).
 
----
+## Requirements
+Please follow the below steps before running models in this repo:
+
+1. Add the top-level ***/models*** folder to the Python path with the command:
+   ```
+   export PYTHONPATH="$PYTHONPATH:/path/to/models"
+   ```
+2. Install dependencies:
+   ```
+   pip3 install --user -r official/requirements.txt
+   ```
+   or
+   ```
+   pip install --user -r official/requirements.txt
+   ```
 
-Below is a list of the models available.
 
-[boosted_trees](boosted_trees): A Gradient Boosted Trees model to classify higgs boson process from HIGGS Data Set.
+To make Official Models easier to use, we are planning to create a pip installable Official Models package. This is being tracked in [#917](https://github.com/tensorflow/models/issues/917).
 
-[mnist](mnist): A basic model to classify digits from the MNIST dataset.
 
-[resnet](resnet): A deep residual network that can be used to classify both CIFAR-10 and ImageNet's dataset of 1000 classes.
+## Available models
 
-[wide_deep](wide_deep): A model that combines a wide model and deep network to classify census income data.
+**NOTE:** Please make sure to follow the steps in the [Requirements](#requirements) section.
 
-More models to come!
+* [boosted_trees](boosted_trees): A Gradient Boosted Trees model to classify higgs boson process from HIGGS Data Set.
+* [mnist](mnist): A basic model to classify digits from the MNIST dataset.
+* [resnet](resnet): A deep residual network that can be used to classify both CIFAR-10 and ImageNet's dataset of 1000 classes.
+* [transformer](transformer): A transformer model to translate the WMT English to German dataset.
+* [wide_deep](wide_deep): A model that combines a wide model and deep network to classify census income data.
+* More models to come!
 
 If you would like to make any fixes or improvements to the models, please [submit a pull request](https://github.com/tensorflow/models/compare).
 
----
+## New Models
 
-## Running the models
+The team is actively working to add new models to the repository. Every model should follow the following guidelines, to uphold the
+our objectives of readable, usable, and maintainable code.
 
-The *Official Models* are made available as a Python module. To run the models and associated scripts, add the top-level ***/models*** folder to the Python path with the command: `export PYTHONPATH="$PYTHONPATH:/path/to/models"`
+**General guidelines**
+* Code should be well documented and tested.
+* Runnable from a blank environment with relative ease.
+* Trainable on: single GPU/CPU (baseline), multiple GPUs, TPU
+* Compatible with Python 2 and 3 (using [six](https://pythonhosted.org/six/) when necessary)
+* Conform to [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)
 
-To install dependencies pass `-r official/requirements.txt` to pip. (i.e. `pip3 install --user -r official/requirements.txt`)
+**Implementation guidelines**
 
-To make Official Models easier to use, we are planning to create a pip installable Official Models package. This is being tracked in [#917](https://github.com/tensorflow/models/issues/917).
+These guidelines exist so the model implementations are consistent for better readability and maintainability.
+
+* Use [common utility functions](utils)
+* Export SavedModel at the end of training.
+* Consistent flags and flag-parsing library ([read more here](utils/flags/guidelines.md))
+* Produce benchmarks and logs ([read more here](utils/logs/guidelines.md))
diff --git a/official/transformer/README.md b/official/transformer/README.md
@@ -37,7 +37,7 @@ Below are the commands for running the Transformer model. See the [Detailed inst
 cd /path/to/models/official/transformer
 
 # Ensure that PYTHONPATH is correctly defined as described in
-# https://github.com/tensorflow/models/tree/master/official#running-the-models
+# https://github.com/tensorflow/models/tree/master/official#requirements
 # export PYTHONPATH="$PYTHONPATH:/path/to/models"
 
 # Export variables
@@ -94,7 +94,7 @@ big | 28.9
 0. ### Environment preparation
 
    #### Add models repo to PYTHONPATH
-   Follow the instructions described in the [Running the models](https://github.com/tensorflow/models/tree/master/official#running-the-models) section to add the models folder to the python path.
+   Follow the instructions described in the [Requirements](https://github.com/tensorflow/models/tree/master/official#requirements) section to add the models folder to the python path.
 
    #### Export variables (optional)
 

diff --git a/official/transformer/transformer_main.py b/official/transformer/transformer_main.py
@@ -189,19 +189,18 @@ def get_train_op_and_metrics(loss, params):
         loss, tvars, colocate_gradients_with_ops=True)
     minimize_op = optimizer.apply_gradients(
         gradients, global_step=global_step, name="train")
-
     update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
     train_op = tf.group(minimize_op, update_ops)
 
-    metrics = {"learning_rate": learning_rate}
+    train_metrics = {"learning_rate": learning_rate}
 
     if not params["use_tpu"]:
       # gradient norm is not included as a summary when running on TPU, as
       # it can cause instability between the TPU and the host controller.
       gradient_norm = tf.global_norm(list(zip(*gradients))[0])
-      metrics["global_norm/gradient_norm"] = gradient_norm
+      train_metrics["global_norm/gradient_norm"] = gradient_norm
 
-    return train_op, metrics
+    return train_op, train_metrics
 
 
 def translate_and_compute_bleu(estimator, subtokenizer, bleu_source, bleu_ref):
@@ -237,6 +236,13 @@ def evaluate_and_log_bleu(estimator, bleu_source, bleu_ref, vocab_file):
   tf.logging.info("Bleu score (cased):", cased_score)
   return uncased_score, cased_score
 
+
+def _validate_file(filepath):
+  """Make sure that file exists."""
+  if not tf.gfile.Exists(filepath):
+    raise tf.errors.NotFoundError(None, None, "File %s not found." % filepath)
+
+
 def run_loop(
     estimator, schedule_manager, train_hooks=None, benchmark_logger=None,
     bleu_source=None, bleu_ref=None, bleu_threshold=None, vocab_file=None):
@@ -276,7 +282,14 @@ def run_loop(
   Raises:
     ValueError: if both or none of single_iteration_train_steps and
       single_iteration_train_epochs were defined.
+    NotFoundError: if the vocab file or bleu files don't exist.
   """
+  if bleu_source:
+    _validate_file(bleu_source)
+  if bleu_ref:
+    _validate_file(bleu_ref)
+  if vocab_file:
+    _validate_file(vocab_file)
 
   evaluate_bleu = bleu_source is not None and bleu_ref is not None
   if evaluate_bleu and schedule_manager.use_tpu:
@@ -444,23 +457,29 @@ def _check_train_limits(flag_dict):
 
   @flags.multi_flags_validator(
       ["bleu_source", "bleu_ref"],
-      message="Files specified by --bleu_source and/or --bleu_ref don't exist. "
-              "Please ensure that the file paths are correct.")
+      message="Both or neither --bleu_source and --bleu_ref must be defined.")
   def _check_bleu_files(flags_dict):
-    """Validate files when bleu_source and bleu_ref are defined."""
-    if flags_dict["bleu_source"] is None or flags_dict["bleu_ref"] is None:
-      return True
-    return all([
-        tf.gfile.Exists(flags_dict["bleu_source"]),
-        tf.gfile.Exists(flags_dict["bleu_ref"])])
-
-  @flags.validator("vocab_file", "File set by --vocab_file does not exist.")
-  def _check_vocab_file(vocab_file):
-    """Ensure that vocab file exists."""
-    if vocab_file:
-      return tf.gfile.Exists(vocab_file)
-
-  flags_core.require_cloud_storage(["data_dir", "model_dir"])
+    return (flags_dict["bleu_source"] is None) == (
+        flags_dict["bleu_ref"] is None)
+
+  @flags.multi_flags_validator(
+      ["bleu_source", "bleu_ref", "vocab_file"],
+      message="--vocab_file must be defined if --bleu_source and --bleu_ref "
+              "are defined.")
+  def _check_bleu_vocab_file(flags_dict):
+    if flags_dict["bleu_source"] and flags_dict["bleu_ref"]:
+      return flags_dict["vocab_file"] is not None
+    return True
+
+  @flags.multi_flags_validator(
+      ["export_dir", "vocab_file"],
+      message="--vocab_file must be defined if --export_dir is set.")
+  def _check_export_vocab_file(flags_dict):
+    if flags_dict["export_dir"]:
+      return flags_dict["vocab_file"] is not None
+    return True
+
+  flags_core.require_cloud_storage(["data_dir", "model_dir", "export_dir"])
 
 
 def construct_estimator(flags_obj, params, schedule_manager):

diff --git a/official/utils/flags/README.md b/official/utils/flags/README.md
@@ -72,32 +72,6 @@ def _check_pal(provided_pal_flag):
 Validators take the form that returning True (truthy) passes, and all others 
 (False, None, exception) fail.
 
-## Common Flags
-Common flags (i.e. batch_size, model_dir, etc.) are provided by various flag definition functions,
-and channeled through `official.utils.flags.core`. For instance to define common supervised
-learning parameters one could use the following code:
-
-```$xslt
-from absl import app as absl_app
-from absl import flags
-
-from official.utils.flags import core as flags_core
-
-
-def define_flags():
-  flags_core.define_base()
-  flags.adopt_key_flags(flags_core)
-  
-  
-def main(_):
-  flags_obj = flags.FLAGS
-  print(flags_obj)
-  
-  
-if __name__ == "__main__"
-  absl_app.run(main)
-```
-
 ## Testing
 To test using absl, simply declare flags in the setupClass method of TensorFlow's TestCase.
 
@@ -121,32 +95,3 @@ class BaseTester(unittest.TestCase):
     self.AssertEqual(flags.FLAGS.test_flag, "def")
     
 ```
-
-## Immutability
-Flag values should not be mutated. Instead, use getter functions to return
-the desired values. An example getter function is `get_loss_scale` function
-below:
-
-```
-# Map string to (TensorFlow dtype, default loss scale)
-DTYPE_MAP = {
-    "fp16": (tf.float16, 128),
-    "fp32": (tf.float32, 1),
-}
-
-
-def get_loss_scale(flags_obj):
-  if flags_obj.loss_scale is not None:
-    return flags_obj.loss_scale
-  return DTYPE_MAP[flags_obj.dtype][1]
-
-
-def main(_):
-  flags_obj = flags.FLAGS()
-
-  # Do not mutate flags_obj
-  # if flags_obj.loss_scale is None:
-  #   flags_obj.loss_scale = DTYPE_MAP[flags_obj.dtype][1] # Don't do this
-  print(get_loss_scale(flags_obj))
-  ...
-```
diff --git a/official/utils/flags/guidelines.md b/official/utils/flags/guidelines.md
@@ -0,0 +1,64 @@
+# Using flags in official models
+
+1. **All common flags must be incorporated in the models.**
+
+   Common flags (i.e. batch_size, model_dir, etc.) are provided by various flag definition functions,
+   and channeled through `official.utils.flags.core`. For instance to define common supervised
+   learning parameters one could use the following code:
+
+   ```$xslt
+   from absl import app as absl_app
+   from absl import flags
+
+   from official.utils.flags import core as flags_core
+
+
+   def define_flags():
+     flags_core.define_base()
+     flags.adopt_key_flags(flags_core)
+
+
+   def main(_):
+     flags_obj = flags.FLAGS
+     print(flags_obj)
+
+
+   if __name__ == "__main__"
+     absl_app.run(main)
+   ```
+2. **Validate flag values.**
+
+   See the [Validators](#validators) section for implementation details.
+
+   Validators in the official model repo should not access the file system, such as verifying
+   that files exist, due to the strict ordering requirements.
+
+3. **Flag values should not be mutated.**
+
+   Instead of mutating flag values, use getter functions to return the desired values. An example
+   getter function is `get_loss_scale` function below:
+
+   ```
+   # Map string to (TensorFlow dtype, default loss scale)
+   DTYPE_MAP = {
+       "fp16": (tf.float16, 128),
+       "fp32": (tf.float32, 1),
+   }
+
+
+   def get_loss_scale(flags_obj):
+     if flags_obj.loss_scale is not None:
+       return flags_obj.loss_scale
+     return DTYPE_MAP[flags_obj.dtype][1]
+
+
+   def main(_):
+     flags_obj = flags.FLAGS()
+
+     # Do not mutate flags_obj
+     # if flags_obj.loss_scale is None:
+     #   flags_obj.loss_scale = DTYPE_MAP[flags_obj.dtype][1] # Don't do this
+
+     print(get_loss_scale(flags_obj))
+     ...
+   ```
diff --git a/official/utils/logs/guidelines.md b/official/utils/logs/guidelines.md
@@ -0,0 +1,58 @@
+# Logging in official models
+
+This library adds logging functions that print or save tensor values. Official models should define all common hooks
+(using hooks helper) and a benchmark logger.
+
+1. **Training Hooks**
+
+   Hooks are a TensorFlow concept that define specific actions at certain points of the execution. We use them to obtain and log
+   tensor values during training.
+
+   hooks_helper.py provides an easy way to create common hooks. The following hooks are currently defined:
+   * LoggingTensorHook: Logs tensor values
+   * ProfilerHook: Writes a timeline json that can be loaded into chrome://tracing.
+   * ExamplesPerSecondHook: Logs the number of examples processed per second.
+   * LoggingMetricHook: Similar to LoggingTensorHook, except that the tensors are logged in a format defined by our data
+     anaylsis pipeline.
+
+
+2. **Benchmarks**
+
+   The benchmark logger provides useful functions for logging environment information, and evaluation results.
+   The module also contains a context which is used to update the status of the run.
+
+Example usage:
+
+```
+from absl import app as absl_app
+
+from official.utils.logs import hooks_helper
+from official.utils.logs import logger
+
+def model_main(flags_obj):
+  estimator = ...
+
+  benchmark_logger = logger.get_benchmark_logger()
+  benchmark_logger.log_run_info(...)
+
+  train_hooks = hooks_helper.get_train_hooks(...)
+
+  for epoch in range(10):
+    estimator.train(..., hooks=train_hooks)
+    eval_results = estimator.evaluate(...)
+
+    # Log a dictionary of metrics
+    benchmark_logger.log_evaluation_result(eval_results)
+
+    # Log an individual metric
+    benchmark_logger.log_metric(...)
+
+
+def main(_):
+  with logger.benchmark_context(flags.FLAGS):
+    model_main(flags.FLAGS)
+
+if __name__ == "__main__":
+  # define flags
+  absl_app.run(main)
+```
diff --git a/official/utils/logs/hooks_helper.py b/official/utils/logs/hooks_helper.py
@@ -143,8 +143,7 @@ def get_logging_metric_hook(tensors_to_log=None,
       10 mins.
 
   Returns:
-    Returns a ProfilerHook that writes out timelines that can be loaded into
-    profiling tools like chrome://tracing.
+    Returns a LoggingMetricHook that saves tensor values in a JSON format.
   """
   if tensors_to_log is None:
     tensors_to_log = _TENSORS_TO_LOG