Open
Description
The root discussion issue for Data analysis and transform is here.
Transformation Requirements
XGBoost model only accept the numerical values as input.
For numerical input, we don't need transformation such as normalization/standardization, issue.
For categorical input, we need convert it into numerical before feeding it to the model. The typical transformation is convert to one hot vector
, convert to integer id by looking up vocabulary
and hash
. We can use ONEHOT
, VOCABULARIZE
, HASH
transform function in SQLFlow COLUMN clause.
SELECT cont_1, cont_2, cont_3, cat_1, cat_2, cat_3
FROM source_table
TO TRAIN xgboost
COLUMN VOCABULARIZE(cat_1), HASH(cat_2), ONEHOT(cat_3)
INTO my_model
Let's take VOCABULARIZE
function for example. The workflow contains two steps:
- Do data analysis on the column of string type, and get the vocabulary.
- Do the transformation of mapping string to id at the training stage.
Discussion Points
- How do we execute the string to id mapping transformation?
- Do this transformation during the process of generating DMatrix from the source table.
For example, load the source data as numpy array -> do transformation on the numpy array -> build DMatrix from this preprocessed numpy array -> feed DMatrix to XGBoost for training. Please check the official document. - Do this transformation in each iteration of XGBoost training.
Does XGBoost have the flexibility? - To be investigated - Build a sklearn pipeline, the pipeline contains transformation and XGBoost training. sample video.
- How to save the transform logic together with the model?
- Export both the model and transform logic together to PMML.
We can try to use the sklearn pipeline to compose the preprocess logic and XGBoost model. And then use sklearn2pmml to export the transformation and model to PMML
- Distributed data preprocess and XGBoost training?
- [Need Investigation] Can we use sklearn pipeline for XGBoost distributed training?
- × We can investigate Spark Pipeline and XGBoost on Spark. (Heavy dependency)