Refactory Discussion 2: the python modules/packages and the go pkgs after refactory

### Background
In #2168 we've figured out the direction of migrating SQLFlow core from golang to python.
In  #2214  and #2235 ,  we' ve proven up that attribute checking can be implemented in python in a precise and concise way. 
In #2225 , we've made it clear that how to reuse existing golang DB code (especially `goalisa`) to avoid unnecessary rewriting.

### Plan
#### Python Version
At the moment, we should be compatible with both python2 and python3 in most of the modules.
#### Style
We follow the [Google Style Guide](http://google.github.io/styleguide/pyguide.html)
#### Preparation
We have to define a command-line flag `--python_refactory` and use that flag to make sure the existing code is still working before we finish the refactorying.
- A tiny modification can be implemented as:
    ```golang
     # In some function
        if !flags.python_refactory {
            // The existing logic
        } else {
            // The new logic
        }
    ```
    - After the refactory, we remove the flag and all the legacy code
- Big changes can be written as:
    ```golang
    func generateTrainStmt # ...
    func generateTrainStmtForPython # ...
    # ...
    struct SQLStatement {}
    struct SQLStatementForPython {}
    ```
    or even
    ```
    pkg/ir/
    pkg/ir2/
    ```
    if the change is big enough.
    - After the refactory, we remove the flag and all the legacy code and rename the `pkg`s, `func` s and `struct`s by removing the suffix `ForPython`.

#### Modules kept in golang
- Parser (no modification)
- Workflow (no modification)
- godrivers and pkg/database
    - Wrap the `godriver`s  and `pkg/database` into a python module as described in #2225 
- IR Generator 
    - The data structure of the `SQLStatement` interface such as `TrainStmt` and so on have to be redefined to be concise and  python-compatible, there're several ways to do this 
        1. Define a `SQLStatementForPython` struct  in `pkg/ir`  and wrap it into a python module using `pygo` or `cgo`, use json to serialize in go and deserialize in python
        1. (Prefered) Use a protobuf message to redefine the IR data structure
        1. In the generated python code, pass the fields as arguments to the python API
    - Remove the `feature derivation`-relative code because the `feature derivation` module will be implemented into python

#### Modules to be moved into python
- Feature Derivation
    - Depends on `the wrapper of godriver`s
    - Depends on `contracts`
    - Implement the basic feature derivation logic and columns in python (`features.py` and `columns` package), calling the `godriver` wrapper to get database table field metadata
    - Modify the `ir_generator.go` to forward the column functions in a SQL statement to python function call
        For an over simplified example:
        ```sql
        select * from iris_train TO TRAIN ... COLUMN numeric_column(sepal_width) ...
        ```
        will map to the python code
        ```python
        # In features.py 
        def get_columns(column_clause):  
            '''column_clause is a list of str: ['column_func(field_name, *user_defined_args)']'''
            feature_columns = {}
            for column_str in column_clause:
                name, column  = eval(column_str)  # the call to numeric_column(sepal_width) should be protected by `contracts` (#2235 )
                feature_columns[name] = column
            return feature_columns
        
        # In columns/tf_columns.py
        def numeric_column(name, *args)
            value = get_fieldsvalue(name)
            contracts.check_requirements_for_existed(tf.numeric_column, *value)
            return tf.numeric_column(*value)
        ```
        @brightcoder01 @typhoonzero Please review this design.
- Codegen
    - Depends on the migration of feature derivation. (the `SQLStatementForPython` struct)
    - There're no codegen in the new architecture, we only need a `func` like
    ```golang
    package executor
    func execute(stmt SQLStatementForPython) {
        `
        import sqlflow
        import sys
        stmt = sqlflow.SQLStatementForPython()
        context.ParseFrom(sys.stdin)
        sqlflow.execute(stmt)
        `
        # ... write the above to a "main.py" file and create a cmd
        cmd := exec.Command('python', 'main.py')
        cmd.Stdin = stmt.Serialize()
        cmd.Run()
    }
    ```
    - Remove the `codegen` package after refactory
- Attibute checker and diagnostics
    - It's a direct dependency of `feature derivation`, `Python API` 
    - Partly solved by `contracts.py` in #2235 

- the visitor pattern and submitters 
    - Depends on Python API
    - Move all submitters to python, the double dispatching can be implemented simply in a dynamic-type language like python. For a simplified example:
    ```python
    platform = os.env("SQLFLOW_PLATFORM")   # The original SQLFLOW_submitter environment variable
    eval(f'{platform}.{stmt.stmt_type}')(sql=stmt.StandardSQL, column=column, model_params=model_params...)
   # The above calls `pai.train` for example. 
    ```
#### Other go packages
- Remove `sqlfs`, `model` and `verifier` because they'll have their python counterparts (probably as functions or classes in a module)
- Remove `pkg/sql/codegen`as described above
- Move `database` into a seperate repo to generate the python module
- Move `tablewriter` into `step`
- Rename `pkg/sql` to `pkg/executor` because the `executor.go` would spawn the python process to execute the statement (not `runner` because it may be mistaken as the implementation of `TO RUN`)
- Move `pipe` and `log` to a new directory names `utils`
- Supposed go packages
```bash
pkg
   |- ir
   |- utils
   |- parser
   |- proto
   |- server
   |- executor
   |- step
   |- workflow
```
#### New python modules that have to be implementet from scratch 
- Python API
    - Supposed packages and modules
```bash
sqlflow
    |- platform  # files under this package must implement the same set of functions(`train`, `evaluate`, `explain`, `predict` and `solve`) with the same signatures(`train(stmt: SQLStatement)`)
        |- pai.py  # pai.train, pai.explain... submit a script to the platform
        |- alisa.py  # alisa.train, alisa.explain... submit a script to the platform
        |- pai_entry.tmpl   # a template python script to be submitted to the pai/alisa platform, call sqflow_submitter.xgboost or sqlflow_submitter.tensorflow
        |- default.py  # default.train, default.explain... call sqlflow_submitter.xgboost or sqlflow_submitter.tensorflow
        |- ...
    |- columns
        |- column_base.py
        |- xgboost_columns.py  # 
        |- tf_columns.py
        |- ...
    |- sqlflow.py  # sqlflow.execute depends on platform and feature.py
    |- client.py  # a client with sqlflow.train, sqlflow.explain, forwarding a statement/program to sqlflow_server 
    |- features.py  # depends on godriver and contracts.py and columns
    |- sql_statement_pb2.py
    |- _dbdriver.so
    |- sqlflow_submitters  # The same package we now already have
        |- ...  # omitted
    |- contracts.py  # type and contract checking for model classes and feature columns
    |- diagnostics.py  # generate uniform diagnostic messages to be regexed by the go parts
```
### Priority
Let's sum up.
1. Python
    - [ ] Wrapping `godriver`
        - Doesn't depend on other works in this list
        - Depended by `feature derivation`
    - [ ] `contracts` and `diagnostics`
        - Doesn't depend on other works in this list
        - Depended  by `feature derivation` and `sqlflow_submitter`
    - [ ]  feature derivation(`features` and `columns`)
        - Depends on `contracts`, `diagnostics` and `godriver`
        - Depends on the definition of the pb message `SQLStatement` and the modification of `ir.go` and `ir_generator.go`
        -  Depended by all other modules in the `sqlflow` package
    - [ ] `platforms` and `sqlflow.py`
        - Depend on `feature derivation`
        - Depended by removing `pkg/sql/codegen` and defining  `pkg/executor` 
    - [ ] `sqlflow_submitter`
        - Depends on `contracts` and `diagnostics`
        - Only minimal modification
2. Golang
    - [ ] `ir.go` and `ir_generator.go`
        - Do anything the python `feature derivation` needs
    - [ ] `Removing codegen and defining pkg/executor`
        - Depends on the  of `platform` and  `sqlflow.py`
    - [ ] `Removing submitter/executors`
        - Depends on the  of `platform` and  `sqlflow.py`

### Supplementary notes of the new architecture on several scenarios
#### The PAI platform
1. `sqlflow/platform/pai.py` copies the whole package and an `entry.py` to PAI using `odpscmd` or `alisa`
1. `entry.py` calls `sqlflow.execute` as what the `sqlflow/pkg/execute` does.

#### The Workflow
- `sqlflow_server` generates the workflow as the original architecture, the `sqlflow` python package is used by the `step` binary.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactory Discussion 2: the python modules/packages and the go pkgs after refactory #2287

Background

Plan

Python Version

Style

Preparation

Modules kept in golang

Modules to be moved into python

Other go packages

New python modules that have to be implementet from scratch

Priority

Supplementary notes of the new architecture on several scenarios

The PAI platform

The Workflow

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Refactory Discussion 2: the python modules/packages and the go pkgs after refactory #2287

Description

Background

Plan

Python Version

Style

Preparation

Modules kept in golang

Modules to be moved into python

Other go packages

New python modules that have to be implementet from scratch

Priority

Supplementary notes of the new architecture on several scenarios

The PAI platform

The Workflow

Activity

Yancey0623 commented on May 19, 2020

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions