Skip to content

Distributed data computing in TO RUN clause.  #2297

Open
@brightcoder01

Description

@brightcoder01

In SQLFlow TO RUN clause, it will call a python function to do the data processing/computing. Such as use TSFresh to extract features from time series data.

  • If the data size is small, the python function can be executed in a single process.
  • If the data size is large scale, we need execute the python data computing in a distributed way.

There are two options for large-scale data computing with python:

  • Dask
    MaxCompute Support: ×
    Kubernetes Support: √ link
    TSFresh integration: The official distributed computing support for TSFresh is Dask. link

  • Mars
    MaxCompute Support: √
    Kubernetes Support: √ link - Verifying using Minikube, downloading image is too slow. We can build the image on our Mac at the first step. link
    TSFresh integration: no pre-made solution, need development from mars team, issue.

The compare between Dask and mars: issue

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions