Description
SQLFlow describes an end-to-end machine learning pipeline. Data transformation is an important part in the entire process.
- For the per data instance transformation, we will use
COLUMN
clause. - For the non per data instance transformation, we propose
TO RUN
clause to support this functionality. Please refer to the discussion [Discussion] How to generate time series features using tsfresh in SQLFlow #2137
Please check the following example SQL statement:
SELECT * FROM {source_table}
TO RUN {function_name}
WITH
param_a = value_a,
param_b = value_b
INTO {result_table}
{function_name}
is the name of data transformation function. It can be either a built-in function from SQLFlow or the customized function provided by the users. We will support built-in function at the first step. TSFresh is our first built-in function.
{source_table}
is the name of the input table from which the transform function above read the data.
{result_table}
is the name of the output table into which the transform function above will write the processed result.
The design doc
link.
Task break down
-
Upgrade parser to support TO RUN statement
-
Translate
TO RUN
to a workflow -
Upgrade goalisa to submit PyODPS task. Enable submitting ODPS SQL and PyODPS task on the deployment of Dataworks.
-
sqlflow.runner module.
-
TSFresh high level api implementation and docker image.
-
TO RUN function repo sample.