Open
Description
In SQLFlow TO RUN
clause, it will call a python function to do the data processing/computing. Such as use TSFresh
to extract features from time series data.
- If the data size is small, the python function can be executed in a single process.
- If the data size is large scale, we need execute the python data computing in a distributed way.
There are two options for large-scale data computing with python:
-
Dask
MaxCompute Support: ×
Kubernetes Support: √ link
TSFresh integration: The official distributed computing support for TSFresh is Dask. link -
Mars
MaxCompute Support: √
Kubernetes Support: √ link - Verifying using Minikube, downloading image is too slow. We can build the image on our Mac at the first step. link
TSFresh integration: no pre-made solution, need development from mars team, issue.
The compare between Dask and mars: issue
- Ray?