Skip to content

COLUMN clause for XGBoost #2190

Open
Open
@brightcoder01

Description

@brightcoder01

The root discussion issue for Data analysis and transform is here.

Transformation Requirements
XGBoost model only accept the numerical values as input.
For numerical input, we don't need transformation such as normalization/standardization, issue.
For categorical input, we need convert it into numerical before feeding it to the model. The typical transformation is convert to one hot vector, convert to integer id by looking up vocabulary and hash. We can use ONEHOT, VOCABULARIZE, HASH transform function in SQLFlow COLUMN clause.

SELECT cont_1, cont_2, cont_3, cat_1, cat_2, cat_3
FROM source_table
TO TRAIN xgboost
COLUMN VOCABULARIZE(cat_1), HASH(cat_2), ONEHOT(cat_3)
INTO my_model

Let's take VOCABULARIZE function for example. The workflow contains two steps:

  1. Do data analysis on the column of string type, and get the vocabulary.
  2. Do the transformation of mapping string to id at the training stage.

Discussion Points

  1. How do we execute the string to id mapping transformation?
  • Do this transformation during the process of generating DMatrix from the source table.
    For example, load the source data as numpy array -> do transformation on the numpy array -> build DMatrix from this preprocessed numpy array -> feed DMatrix to XGBoost for training. Please check the official document.
  • Do this transformation in each iteration of XGBoost training.
    Does XGBoost have the flexibility? - To be investigated
  • Build a sklearn pipeline, the pipeline contains transformation and XGBoost training. sample video.
  1. How to save the transform logic together with the model?
  • Export both the model and transform logic together to PMML.
    We can try to use the sklearn pipeline to compose the preprocess logic and XGBoost model. And then use sklearn2pmml to export the transformation and model to PMML
  1. Distributed data preprocess and XGBoost training?
  • [Need Investigation] Can we use sklearn pipeline for XGBoost distributed training?
  • × We can investigate Spark Pipeline and XGBoost on Spark. (Heavy dependency)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions