Skip to content

Commit ceb6c15

Browse files
tf-transform-teamelmer-garduno
tf-transform-team
authored andcommitted
Project import generated by Copybara (go/copybara).
PiperOrigin-RevId: 145722621 Change-Id: Idfcb69aef09195dc779dc7a0337ccfdac0149f1f
1 parent 3492bc6 commit ceb6c15

33 files changed

+3726
-5
lines changed

CONTRIBUTING.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,14 +17,13 @@ Follow either of the two links above to access the appropriate CLA and instructi
1717

1818
### Contributing code
1919

20-
If you have improvements to TensorFlow Serving, send us your pull requests!
20+
If you have improvements to TensorFlow Transform, send us your pull requests!
2121
For those just getting started, Github has a [howto](https://help.github.com/articles/using-pull-requests/).
2222

2323
If you want to contribute but you're not sure where to start, take a look at the
24-
[issues with the "contributions welcome" label](https://github.com/tensorflow/serving/labels/contributions%20welcome).
24+
[issues with the "contributions welcome" label](https://github.com/tensorflow/transform/labels/contributions%20welcome).
2525
These are issues that we believe are particularly well suited for outside
2626
contributions, often because we probably won't get to them right now. If you
2727
decide to start on an issue, leave a comment so that other people know that
2828
you're working on it. If you want to help out, but not alone, use the issue
2929
comment thread to coordinate.
30-

LICENSE

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -201,4 +201,3 @@ Copyright 2015 The TF.Transform Authors. All rights reserved.
201201
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
202202
See the License for the specific language governing permissions and
203203
limitations under the License.
204-

README

Lines changed: 0 additions & 1 deletion
This file was deleted.

README.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# tf.Transform
2+
3+
**tf.Transform** is a library for doing data preprocessing with TensorFlow. It
4+
allows users to combine various data processing frameworks (currently Apache
5+
Beam is supported but tf.Transform can be extended to support other frameworks),
6+
with TensorFlow, to transform data. Because tf.Transform is built on TensorFlow,
7+
it allows users to export a graph which re-creates the transformations they did
8+
to their data as a TensorFlow graph. This is important as the user can then
9+
incorporate the exported TensorFlow graph into their serving model, thus
10+
avoiding skew between the served model and the training data.
11+
12+
## tf.Transform Concepts
13+
14+
The most important concept of tf.Transform is the "preprocessing function". This
15+
is a logical description of a transformation of a dataset. The dataset is
16+
conceptualized as a dictionary of columns, and the preprocessing function is
17+
defined by means of two kinds of function:
18+
19+
1) A "transform" which is a function defined using TensorFlow that accepts and
20+
returns tensors. Such a function is applied to some input columns and generates
21+
transformed columns. Users define their own transforms by first defining a
22+
function that operates on tensors, and then applying this to columns using the
23+
`tf_transform.transform` function.
24+
25+
2) An "analyzer" which is a function that accepts columns and returns a
26+
"statistic". A statistic is like a column except that it only has a single
27+
value. An example of an analyzer is `tf_transform.min` which computes the
28+
minimum of a column. Currently tf.Transform provides a fixed set of analyzers.
29+
30+
By combining analyzers and transforms, users can create arbitrary pipelines for
31+
transforming their data. In particular, users should define a "preprocessing
32+
function" which accepts and returns columns.
33+
34+
Columns are not themselves wrappers around data, rather they are placeholders
35+
used to construct a definition of the user's logical pipeline. In order to apply
36+
such a pipeline to data, we rely on the implementation. The Apache Beam
37+
implementation provides `PTransform`s that apply a user's preprocessing function
38+
to data. The typical workflow of a tf.Transform user will be to construct a
39+
preprocessing function, and then incorporate this into a large Beam pipeline,
40+
ultimately materializing the data for training.
41+
42+
## Background
43+
44+
While TensorFlow allows users to do arbitrary manipulations on a single instance
45+
or batch of instances, some kinds of preprocessing require a full pass over the
46+
dataset. For example, normalizing an input value, computing a vocabulary for a
47+
string input (and then mapping the string to an int with this vocabulary), or
48+
bucketizing an input. While some of these operations can be done with TensorFlow
49+
in a streaming manner (e.g. calculating a running mean for normalization), in
50+
general it may be preferable or necessary to calculate these with a full pass
51+
over the data.
52+
53+
## Installation and Dependencies
54+
55+
The easiest way to install tf.Transform is with the PyPI package.
56+
57+
`pip install tensorflow_transform`
58+
59+
Currently tf.Transform requires that TensorFlow be installed but does not have
60+
an explicit dependency on TensorFlow as a package. See [TensorFlow
61+
documentation](https://www.tensorflow.org/get_started/os_setup) for more
62+
information on installing TensorFlow.
63+
64+
This package depends on the Google Cloud Dataflow distribution of Apache Beam.
65+
Apache Beam is the package used to run distributed pipelines. Apache Beam is
66+
able to run pipelines in multiple ways, depending on the "runner" used. While
67+
Apache Beam is an open source package, currently the only distribution on PyPI
68+
is the Cloud Dataflow distribution. This package can run beam pipelines locally,
69+
or on Google Cloud Dataflow.
70+
71+
When a base package for Apache Beam (containing no runners) is available, the
72+
tf.Transform package will depend only on this base package, and users will be
73+
able to install their own runners. tf.Transform will attempt to be as
74+
independent from the specific runner as possible.
75+
76+
## Getting Started
77+
78+
For instructions on using tf.Transform see the [getting started
79+
guide](./getting_started.md)

docs/index.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Tensorflow Transform
2+
3+
TFTransform is a framework for transforming data with TensorFlow.
4+
It allows users to combine transformations defined in terms of functions that
5+
act on tensors, with full pass operations that analyze the entire dataset.
6+
By doing so we allow users to construct complex transformations involving
7+
vocabularies, normalization and bucketizing, while allow a user to easily
8+
incorporate these transformations into the serving graph.

getting_started.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+

setup.py

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Copyright 2016 Google Inc. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
"""Package Setup script for the tf.Transform binary.
15+
"""
16+
17+
from setuptools import find_packages
18+
from setuptools import setup
19+
20+
21+
def get_required_install_packages():
22+
required_install_packages = [
23+
# TODO(elmerg) the beam dependency will come directly from the apache beam
24+
# package once it is available.
25+
'google-cloud-dataflow >= 0.4.4',
26+
]
27+
return required_install_packages
28+
29+
30+
def get_version():
31+
return '0.1'
32+
33+
34+
setup(
35+
name='tensorflow-transform',
36+
version=get_version(),
37+
author='Google',
38+
author_email='[email protected]',
39+
namespace_packages=[],
40+
install_requires=get_required_install_packages(),
41+
packages=find_packages(),
42+
include_package_data=True,
43+
description='Tensorflow Transform',
44+
requires=[])

tensorflow_transform/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+

0 commit comments

Comments
 (0)