|
| 1 | +# tf.Transform |
| 2 | + |
| 3 | +**tf.Transform** is a library for doing data preprocessing with TensorFlow. It |
| 4 | +allows users to combine various data processing frameworks (currently Apache |
| 5 | +Beam is supported but tf.Transform can be extended to support other frameworks), |
| 6 | +with TensorFlow, to transform data. Because tf.Transform is built on TensorFlow, |
| 7 | +it allows users to export a graph which re-creates the transformations they did |
| 8 | +to their data as a TensorFlow graph. This is important as the user can then |
| 9 | +incorporate the exported TensorFlow graph into their serving model, thus |
| 10 | +avoiding skew between the served model and the training data. |
| 11 | + |
| 12 | +## tf.Transform Concepts |
| 13 | + |
| 14 | +The most important concept of tf.Transform is the "preprocessing function". This |
| 15 | +is a logical description of a transformation of a dataset. The dataset is |
| 16 | +conceptualized as a dictionary of columns, and the preprocessing function is |
| 17 | +defined by means of two kinds of function: |
| 18 | + |
| 19 | +1) A "transform" which is a function defined using TensorFlow that accepts and |
| 20 | +returns tensors. Such a function is applied to some input columns and generates |
| 21 | +transformed columns. Users define their own transforms by first defining a |
| 22 | +function that operates on tensors, and then applying this to columns using the |
| 23 | +`tf_transform.transform` function. |
| 24 | + |
| 25 | +2) An "analyzer" which is a function that accepts columns and returns a |
| 26 | +"statistic". A statistic is like a column except that it only has a single |
| 27 | +value. An example of an analyzer is `tf_transform.min` which computes the |
| 28 | +minimum of a column. Currently tf.Transform provides a fixed set of analyzers. |
| 29 | + |
| 30 | +By combining analyzers and transforms, users can create arbitrary pipelines for |
| 31 | +transforming their data. In particular, users should define a "preprocessing |
| 32 | +function" which accepts and returns columns. |
| 33 | + |
| 34 | +Columns are not themselves wrappers around data, rather they are placeholders |
| 35 | +used to construct a definition of the user's logical pipeline. In order to apply |
| 36 | +such a pipeline to data, we rely on the implementation. The Apache Beam |
| 37 | +implementation provides `PTransform`s that apply a user's preprocessing function |
| 38 | +to data. The typical workflow of a tf.Transform user will be to construct a |
| 39 | +preprocessing function, and then incorporate this into a large Beam pipeline, |
| 40 | +ultimately materializing the data for training. |
| 41 | + |
| 42 | +## Background |
| 43 | + |
| 44 | +While TensorFlow allows users to do arbitrary manipulations on a single instance |
| 45 | +or batch of instances, some kinds of preprocessing require a full pass over the |
| 46 | +dataset. For example, normalizing an input value, computing a vocabulary for a |
| 47 | +string input (and then mapping the string to an int with this vocabulary), or |
| 48 | +bucketizing an input. While some of these operations can be done with TensorFlow |
| 49 | +in a streaming manner (e.g. calculating a running mean for normalization), in |
| 50 | +general it may be preferable or necessary to calculate these with a full pass |
| 51 | +over the data. |
| 52 | + |
| 53 | +## Installation and Dependencies |
| 54 | + |
| 55 | +The easiest way to install tf.Transform is with the PyPI package. |
| 56 | + |
| 57 | +`pip install tensorflow_transform` |
| 58 | + |
| 59 | +Currently tf.Transform requires that TensorFlow be installed but does not have |
| 60 | +an explicit dependency on TensorFlow as a package. See [TensorFlow |
| 61 | +documentation](https://www.tensorflow.org/get_started/os_setup) for more |
| 62 | +information on installing TensorFlow. |
| 63 | + |
| 64 | +This package depends on the Google Cloud Dataflow distribution of Apache Beam. |
| 65 | +Apache Beam is the package used to run distributed pipelines. Apache Beam is |
| 66 | +able to run pipelines in multiple ways, depending on the "runner" used. While |
| 67 | +Apache Beam is an open source package, currently the only distribution on PyPI |
| 68 | +is the Cloud Dataflow distribution. This package can run beam pipelines locally, |
| 69 | +or on Google Cloud Dataflow. |
| 70 | + |
| 71 | +When a base package for Apache Beam (containing no runners) is available, the |
| 72 | +tf.Transform package will depend only on this base package, and users will be |
| 73 | +able to install their own runners. tf.Transform will attempt to be as |
| 74 | +independent from the specific runner as possible. |
| 75 | + |
| 76 | +## Getting Started |
| 77 | + |
| 78 | +For instructions on using tf.Transform see the [getting started |
| 79 | +guide](./getting_started.md) |
0 commit comments