-
Notifications
You must be signed in to change notification settings - Fork 704
Open
Labels
Description
Background
In #2168 we've figured out the direction of migrating SQLFlow core from golang to python.
In #2214 and #2235 , we' ve proven up that attribute checking can be implemented in python in a precise and concise way.
In #2225 , we've made it clear that how to reuse existing golang DB code (especially goalisa
) to avoid unnecessary rewriting.
Plan
Python Version
At the moment, we should be compatible with both python2 and python3 in most of the modules.
Style
We follow the Google Style Guide
Preparation
We have to define a command-line flag --python_refactory
and use that flag to make sure the existing code is still working before we finish the refactorying.
- A tiny modification can be implemented as:
# In some function if !flags.python_refactory { // The existing logic } else { // The new logic }
- After the refactory, we remove the flag and all the legacy code
- Big changes can be written as:
or even
func generateTrainStmt # ... func generateTrainStmtForPython # ... # ... struct SQLStatement {} struct SQLStatementForPython {}
if the change is big enough.pkg/ir/ pkg/ir2/
- After the refactory, we remove the flag and all the legacy code and rename the
pkg
s,func
s andstruct
s by removing the suffixForPython
.
- After the refactory, we remove the flag and all the legacy code and rename the
Modules kept in golang
- Parser (no modification)
- Workflow (no modification)
- godrivers and pkg/database
- Wrap the
godriver
s andpkg/database
into a python module as described in Wrap goalisa into a python module #2225
- Wrap the
- IR Generator
- The data structure of the
SQLStatement
interface such asTrainStmt
and so on have to be redefined to be concise and python-compatible, there're several ways to do this- Define a
SQLStatementForPython
struct inpkg/ir
and wrap it into a python module usingpygo
orcgo
, use json to serialize in go and deserialize in python - (Prefered) Use a protobuf message to redefine the IR data structure
- In the generated python code, pass the fields as arguments to the python API
- Define a
- Remove the
feature derivation
-relative code because thefeature derivation
module will be implemented into python
- The data structure of the
Modules to be moved into python
-
Feature Derivation
- Depends on
the wrapper of godriver
s - Depends on
contracts
- Implement the basic feature derivation logic and columns in python (
features.py
andcolumns
package), calling thegodriver
wrapper to get database table field metadata - Modify the
ir_generator.go
to forward the column functions in a SQL statement to python function call
For an over simplified example:will map to the python codeselect * from iris_train TO TRAIN ... COLUMN numeric_column(sepal_width) ...
@brightcoder01 @typhoonzero Please review this design.# In features.py def get_columns(column_clause): '''column_clause is a list of str: ['column_func(field_name, *user_defined_args)']''' feature_columns = {} for column_str in column_clause: name, column = eval(column_str) # the call to numeric_column(sepal_width) should be protected by `contracts` (#2235 ) feature_columns[name] = column return feature_columns # In columns/tf_columns.py def numeric_column(name, *args) value = get_fieldsvalue(name) contracts.check_requirements_for_existed(tf.numeric_column, *value) return tf.numeric_column(*value)
- Depends on
-
Codegen
- Depends on the migration of feature derivation. (the
SQLStatementForPython
struct) - There're no codegen in the new architecture, we only need a
func
like
package executor func execute(stmt SQLStatementForPython) { ` import sqlflow import sys stmt = sqlflow.SQLStatementForPython() context.ParseFrom(sys.stdin) sqlflow.execute(stmt) ` # ... write the above to a "main.py" file and create a cmd cmd := exec.Command('python', 'main.py') cmd.Stdin = stmt.Serialize() cmd.Run() }
- Remove the
codegen
package after refactory
- Depends on the migration of feature derivation. (the
-
Attibute checker and diagnostics
- It's a direct dependency of
feature derivation
,Python API
- Partly solved by
contracts.py
in Support contracts based argument checking #2235
- It's a direct dependency of
-
the visitor pattern and submitters
- Depends on Python API
- Move all submitters to python, the double dispatching can be implemented simply in a dynamic-type language like python. For a simplified example:
platform = os.env("SQLFLOW_PLATFORM") # The original SQLFLOW_submitter environment variable eval(f'{platform}.{stmt.stmt_type}')(sql=stmt.StandardSQL, column=column, model_params=model_params...) # The above calls `pai.train` for example.
Other go packages
- Remove
sqlfs
,model
andverifier
because they'll have their python counterparts (probably as functions or classes in a module) - Remove
pkg/sql/codegen
as described above - Move
database
into a seperate repo to generate the python module - Move
tablewriter
intostep
- Rename
pkg/sql
topkg/executor
because theexecutor.go
would spawn the python process to execute the statement (notrunner
because it may be mistaken as the implementation ofTO RUN
) - Move
pipe
andlog
to a new directory namesutils
- Supposed go packages
pkg
|- ir
|- utils
|- parser
|- proto
|- server
|- executor
|- step
|- workflow
New python modules that have to be implementet from scratch
- Python API
- Supposed packages and modules
sqlflow
|- platform # files under this package must implement the same set of functions(`train`, `evaluate`, `explain`, `predict` and `solve`) with the same signatures(`train(stmt: SQLStatement)`)
|- pai.py # pai.train, pai.explain... submit a script to the platform
|- alisa.py # alisa.train, alisa.explain... submit a script to the platform
|- pai_entry.tmpl # a template python script to be submitted to the pai/alisa platform, call sqflow_submitter.xgboost or sqlflow_submitter.tensorflow
|- default.py # default.train, default.explain... call sqlflow_submitter.xgboost or sqlflow_submitter.tensorflow
|- ...
|- columns
|- column_base.py
|- xgboost_columns.py #
|- tf_columns.py
|- ...
|- sqlflow.py # sqlflow.execute depends on platform and feature.py
|- client.py # a client with sqlflow.train, sqlflow.explain, forwarding a statement/program to sqlflow_server
|- features.py # depends on godriver and contracts.py and columns
|- sql_statement_pb2.py
|- _dbdriver.so
|- sqlflow_submitters # The same package we now already have
|- ... # omitted
|- contracts.py # type and contract checking for model classes and feature columns
|- diagnostics.py # generate uniform diagnostic messages to be regexed by the go parts
Priority
Let's sum up.
- Python
- Wrapping
godriver
- Doesn't depend on other works in this list
- Depended by
feature derivation
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.contracts
anddiagnostics
- Doesn't depend on other works in this list
- Depended by
feature derivation
andsqlflow_submitter
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.feature derivation(features
andcolumns
)- Depends on
contracts
,diagnostics
andgodriver
- Depends on the definition of the pb message
SQLStatement
and the modification ofir.go
andir_generator.go
- Depended by all other modules in the
sqlflow
package
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.platforms
andsqlflow.py
- Depend on
feature derivation
- Depended by removing
pkg/sql/codegen
and definingpkg/executor
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.sqlflow_submitter
- Depends on
contracts
anddiagnostics
- Only minimal modification
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
- Golang
ir.go
andir_generator.go
- Do anything the python
feature derivation
needs
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.Removing codegen and defining pkg/executor
- Depends on the of
platform
andsqlflow.py
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.Removing submitter/executors
- Depends on the of
platform
andsqlflow.py
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Supplementary notes of the new architecture on several scenarios
The PAI platform
sqlflow/platform/pai.py
copies the whole package and anentry.py
to PAI usingodpscmd
oralisa
entry.py
callssqlflow.execute
as what thesqlflow/pkg/execute
does.
The Workflow
sqlflow_server
generates the workflow as the original architecture, thesqlflow
python package is used by thestep
binary.
Metadata
Metadata
Labels
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
[-]Refactory Discussion 1: code generator and Python API[/-][+]Refactory Discussion 2: the python modules/packages and the go pkgs after refactory[/+]Yancey0623 commentedon May 19, 2020
After talked with @shendiaomo, we can refactor the workflow codebase:
Move couler codegen to Python API so that we only need ONE code generator.
Run
andFetch
gRPC interface.Run
generate Argo workflow and return a workflowID by the following SQLFlow Python API, here we don't need the couler codegen.sqlflow.execute(engine="couler", ir=...)
would call couler/fluid API to generate a workflow YAML, and submit it to Kubernetes.Don't need
step
go binary file, each step can execute a Python API call like: