COLUMN clause for XGBoost

The root discussion issue for Data analysis and transform is [here](https://github.com/sql-machine-learning/elasticdl/issues/1670).


**Transformation Requirements**
XGBoost model only accept the numerical values as input.
For numerical input, we don't need transformation such as normalization/standardization, [issue](https://github.com/dmlc/xgboost/issues/357).
For categorical input, we need convert it into numerical before feeding it to the model. The typical transformation is `convert to one hot vector`, `convert to integer id by looking up vocabulary` and `hash`. We can use `ONEHOT`, `VOCABULARIZE`, `HASH` transform function in SQLFlow COLUMN clause. 

 ```SQL
SELECT cont_1, cont_2, cont_3, cat_1, cat_2, cat_3
FROM source_table
TO TRAIN xgboost
COLUMN VOCABULARIZE(cat_1), HASH(cat_2), ONEHOT(cat_3)
INTO my_model
```

Let's take `VOCABULARIZE` function for example. The workflow contains two steps:
1. Do data analysis on the column of string type, and get the vocabulary.
2. Do the transformation of mapping string to id at the training stage.


**Discussion Points**
1. How do we execute the string to id mapping transformation?
- Do this transformation during the process of generating DMatrix from the source table.
*For example, load the source data as numpy array -> do transformation on the numpy array -> build DMatrix from this preprocessed numpy array -> feed DMatrix to XGBoost for training. Please check the [official document](https://xgboost.readthedocs.io/en/latest/python/python_intro.html#data-interface)*.
- Do this transformation in each iteration of XGBoost training. 
*Does XGBoost have the flexibility? - To be investigated*
- Build a [sklearn pipeline](https://scikit-learn.org/stable/modules/compose.html), the pipeline contains transformation and XGBoost training. [sample video](https://campus.datacamp.com/courses/extreme-gradient-boosting-with-xgboost/using-xgboost-in-pipelines?ex=7).

2. How to save the transform logic together with the model?
- Export both the model and transform logic together to PMML.
*We can try to use the sklearn pipeline to compose the preprocess logic and XGBoost model. And then use [sklearn2pmml](https://github.com/jpmml/sklearn2pmml) to export the transformation and model to PMML*

3. Distributed data preprocess and XGBoost training?
- **[Need Investigation] Can we use sklearn pipeline for XGBoost distributed training?**
- × We can investigate [Spark Pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html) and [XGBoost on Spark](https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html).  **(Heavy dependency)**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

COLUMN clause for XGBoost #2190

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

COLUMN clause for XGBoost #2190

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions