Skip to content

Refactory Discussion 2: the python modules/packages and the go pkgs after refactory #2287

@shendiaomo

Description

@shendiaomo
Collaborator

Background

In #2168 we've figured out the direction of migrating SQLFlow core from golang to python.
In #2214 and #2235 , we' ve proven up that attribute checking can be implemented in python in a precise and concise way.
In #2225 , we've made it clear that how to reuse existing golang DB code (especially goalisa) to avoid unnecessary rewriting.

Plan

Python Version

At the moment, we should be compatible with both python2 and python3 in most of the modules.

Style

We follow the Google Style Guide

Preparation

We have to define a command-line flag --python_refactory and use that flag to make sure the existing code is still working before we finish the refactorying.

  • A tiny modification can be implemented as:
     # In some function
        if !flags.python_refactory {
            // The existing logic
        } else {
            // The new logic
        }
    • After the refactory, we remove the flag and all the legacy code
  • Big changes can be written as:
    func generateTrainStmt # ...
    func generateTrainStmtForPython # ...
    # ...
    struct SQLStatement {}
    struct SQLStatementForPython {}
    or even
    pkg/ir/
    pkg/ir2/
    
    if the change is big enough.
    • After the refactory, we remove the flag and all the legacy code and rename the pkgs, func s and structs by removing the suffix ForPython.

Modules kept in golang

  • Parser (no modification)
  • Workflow (no modification)
  • godrivers and pkg/database
  • IR Generator
    • The data structure of the SQLStatement interface such as TrainStmt and so on have to be redefined to be concise and python-compatible, there're several ways to do this
      1. Define a SQLStatementForPython struct in pkg/ir and wrap it into a python module using pygo or cgo, use json to serialize in go and deserialize in python
      2. (Prefered) Use a protobuf message to redefine the IR data structure
      3. In the generated python code, pass the fields as arguments to the python API
    • Remove the feature derivation-relative code because the feature derivation module will be implemented into python

Modules to be moved into python

  • Feature Derivation

    • Depends on the wrapper of godrivers
    • Depends on contracts
    • Implement the basic feature derivation logic and columns in python (features.py and columns package), calling the godriver wrapper to get database table field metadata
    • Modify the ir_generator.go to forward the column functions in a SQL statement to python function call
      For an over simplified example:
      select * from iris_train TO TRAIN ... COLUMN numeric_column(sepal_width) ...
      will map to the python code
      # In features.py 
      def get_columns(column_clause):  
          '''column_clause is a list of str: ['column_func(field_name, *user_defined_args)']'''
          feature_columns = {}
          for column_str in column_clause:
              name, column  = eval(column_str)  # the call to numeric_column(sepal_width) should be protected by `contracts` (#2235 )
              feature_columns[name] = column
          return feature_columns
      
      # In columns/tf_columns.py
      def numeric_column(name, *args)
          value = get_fieldsvalue(name)
          contracts.check_requirements_for_existed(tf.numeric_column, *value)
          return tf.numeric_column(*value)
      @brightcoder01 @typhoonzero Please review this design.
  • Codegen

    • Depends on the migration of feature derivation. (the SQLStatementForPython struct)
    • There're no codegen in the new architecture, we only need a func like
    package executor
    func execute(stmt SQLStatementForPython) {
        `
        import sqlflow
        import sys
        stmt = sqlflow.SQLStatementForPython()
        context.ParseFrom(sys.stdin)
        sqlflow.execute(stmt)
        `
        # ... write the above to a "main.py" file and create a cmd
        cmd := exec.Command('python', 'main.py')
        cmd.Stdin = stmt.Serialize()
        cmd.Run()
    }
    • Remove the codegen package after refactory
  • Attibute checker and diagnostics

  • the visitor pattern and submitters

    • Depends on Python API
    • Move all submitters to python, the double dispatching can be implemented simply in a dynamic-type language like python. For a simplified example:
    platform = os.env("SQLFLOW_PLATFORM")   # The original SQLFLOW_submitter environment variable
    eval(f'{platform}.{stmt.stmt_type}')(sql=stmt.StandardSQL, column=column, model_params=model_params...)
    # The above calls `pai.train` for example. 

Other go packages

  • Remove sqlfs, model and verifier because they'll have their python counterparts (probably as functions or classes in a module)
  • Remove pkg/sql/codegenas described above
  • Move database into a seperate repo to generate the python module
  • Move tablewriter into step
  • Rename pkg/sql to pkg/executor because the executor.go would spawn the python process to execute the statement (not runner because it may be mistaken as the implementation of TO RUN)
  • Move pipe and log to a new directory names utils
  • Supposed go packages
pkg
   |- ir
   |- utils
   |- parser
   |- proto
   |- server
   |- executor
   |- step
   |- workflow

New python modules that have to be implementet from scratch

  • Python API
    • Supposed packages and modules
sqlflow
    |- platform  # files under this package must implement the same set of functions(`train`, `evaluate`, `explain`, `predict` and `solve`) with the same signatures(`train(stmt: SQLStatement)`)
        |- pai.py  # pai.train, pai.explain... submit a script to the platform
        |- alisa.py  # alisa.train, alisa.explain... submit a script to the platform
        |- pai_entry.tmpl   # a template python script to be submitted to the pai/alisa platform, call sqflow_submitter.xgboost or sqlflow_submitter.tensorflow
        |- default.py  # default.train, default.explain... call sqlflow_submitter.xgboost or sqlflow_submitter.tensorflow
        |- ...
    |- columns
        |- column_base.py
        |- xgboost_columns.py  # 
        |- tf_columns.py
        |- ...
    |- sqlflow.py  # sqlflow.execute depends on platform and feature.py
    |- client.py  # a client with sqlflow.train, sqlflow.explain, forwarding a statement/program to sqlflow_server 
    |- features.py  # depends on godriver and contracts.py and columns
    |- sql_statement_pb2.py
    |- _dbdriver.so
    |- sqlflow_submitters  # The same package we now already have
        |- ...  # omitted
    |- contracts.py  # type and contract checking for model classes and feature columns
    |- diagnostics.py  # generate uniform diagnostic messages to be regexed by the go parts

Priority

Let's sum up.

  1. Python
    • Wrapping godriver
      • Doesn't depend on other works in this list
      • Depended by feature derivation
    • contracts and diagnostics
      • Doesn't depend on other works in this list
      • Depended by feature derivation and sqlflow_submitter
    • feature derivation(features and columns)
      • Depends on contracts, diagnostics and godriver
      • Depends on the definition of the pb message SQLStatement and the modification of ir.go and ir_generator.go
      • Depended by all other modules in the sqlflow package
    • platforms and sqlflow.py
      • Depend on feature derivation
      • Depended by removing pkg/sql/codegen and defining pkg/executor
    • sqlflow_submitter
      • Depends on contracts and diagnostics
      • Only minimal modification
  • Golang
    • ir.go and ir_generator.go
      • Do anything the python feature derivation needs
    • Removing codegen and defining pkg/executor
      • Depends on the of platform and sqlflow.py
    • Removing submitter/executors
      • Depends on the of platform and sqlflow.py
  • Supplementary notes of the new architecture on several scenarios

    The PAI platform

    1. sqlflow/platform/pai.py copies the whole package and an entry.py to PAI using odpscmd or alisa
    2. entry.py calls sqlflow.execute as what the sqlflow/pkg/execute does.

    The Workflow

    • sqlflow_server generates the workflow as the original architecture, the sqlflow python package is used by the step binary.

    Activity

    changed the title [-]Refactory Discussion 1: code generator and Python API[/-] [+]Refactory Discussion 2: the python modules/packages and the go pkgs after refactory[/+] on May 18, 2020
    Yancey0623

    Yancey0623 commented on May 19, 2020

    @Yancey0623
    Collaborator

    After talked with @shendiaomo, we can refactor the workflow codebase:

    1. Move couler codegen to Python API so that we only need ONE code generator.

      1. keep the Run and Fetch gRPC interface.
      2. Run generate Argo workflow and return a workflowID by the following SQLFlow Python API, here we don't need the couler codegen.
         sqlflow.execute(engine="couler", ir=[{parsed result of sql1}, {parsed result of sql2}...])
      3. sqlflow.execute(engine="couler", ir=...) would call couler/fluid API to generate a workflow YAML, and submit it to Kubernetes.
    2. Don't need step go binary file, each step can execute a Python API call like:

      sqlflow.execute(engine="pai", ir={parsed result of sql})
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Metadata

    Metadata

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @Yancey0623@wangkuiyi@llxxxll@shendiaomo@lhw362950217

        Issue actions

          Refactory Discussion 2: the python modules/packages and the go pkgs after refactory · Issue #2287 · sql-machine-learning/sqlflow