Open
Description
Data preprocessing usually includes not only the general statistical classes but also the following processing (see the figure below). Some data processing can be easily implemented using SQL, while others are more difficult. It is important to embed the data preprocessing mechanism into SQLFLow.
The following is one of the ways of data preprocessing.
- Load data preprocessing: The original data is divided into a training set and test set. The index col and target col of the training set (test set) are reserved columns, and the remaining columns are used as feature columns (x_trainset) for data preprocessing.
- Missing and outlier detection:
- Delete rows (remove duplicate data).
- Delete columns (data missing exceeds threshold a, numeric values are set to null if field value exceeds threshold range b).
- Data type identification (mostly can be implemented by feature column), there are some processing details such as: in the category field, if a certain value is less than a certain threshold (such as 5%), it is classified into other categories.
- Null value filling:
- For numeric variables, lai is usually filled with null values.
- Numeric Null Value Filling: Grouped by category, where the category is a certain feature combination set, such as the combination of the two characteristics of "gender" and "car brand", and the null value is randomly selected in the corresponding set, for example (gender I am a male, and I randomly select a value from the feature set of Volkswagen).
- Feature scaling and one-hot encoding of categorical data
Note:
The test set needs to be processed according to the preprocessing of the training set. For example, operations such as deleting columns need to be aligned with the training set.