-
Notifications
You must be signed in to change notification settings - Fork 115
Data Transform Process
Transform the table schema to be wide (aka. one table column one feature) if the original table schema is not. We implement it using a batch processing job such as a MaxCompute job.
Q: How to describe the Table Flatten behavior using SQLFlow?
Calculate the statistical value for the following transform code_gen.
Q: How to get statistical dictionary.
We can use keras layer + feature column to do the data transformation. Please look at the Google Cloud Sample.
Idea (Need Sample Code): dataset_fn
can be auto-generated from the normalized table schema.
Build the common transform function set using TensorFlow op. It can be fed into tf.keras.layers.Lambda or normalizer_fn
of numeric_column.
As the transform function set is built upon TensorFlow op, we can ensure the consistency between training and inference.
The functions in this library can be executed and debugged in both eager mode and graph mode.
Key point: Express the Transform function using COLUMN expression. How to design the syntax in SQLFlow to express our functions elegantly?
We want to settle down the patten of the mode definition. In this way, we can generate the code according to this pattern.
Transform Layers => Feature Columns + DenseFeatures => Neural Network Structure
Transform Work: tf.keras.layers.Lambda
Multiple Column Transform: tf.keras.layers.Lambda
Feature Column: Categorical mapper
Embedding:
- Dense embedding -> tf.keras.layers.Embedding
- Sparse embedding -> Both embedding_column and tf.keras.layers.Embedding + keras combiner layer are fine. We can use SparseEmbedding if keras provides the native SparseEmbedding layer in the future.
Combine the model definition from model zoo and the generated transform code to the entire submitter code
Open Questions:
- Can Lambda layer handle the input or output of the SparseTensor?
- How to implement the
apply_vocab
from vocab file using Keras Lambda Layer? - How to combine multiple inputs into one and add individual offset to each inputs at the same time using Lambda Layer?
- What's the code structure of the model defintion using subclass way?