Skip to content

Data preprocessing mechanism in SQLFLow. #1264

Open
@Echo9573

Description

@Echo9573

Data preprocessing usually includes not only the general statistical classes but also the following processing (see the figure below). Some data processing can be easily implemented using SQL, while others are more difficult. It is important to embed the data preprocessing mechanism into SQLFLow.
The following is one of the ways of data preprocessing.
image

  1. Load data preprocessing: The original data is divided into a training set and test set. The index col and target col of the training set (test set) are reserved columns, and the remaining columns are used as feature columns (x_trainset) for data preprocessing.
  2. Missing and outlier detection:
    1. Delete rows (remove duplicate data).
    2. Delete columns (data missing exceeds threshold a, numeric values ​​are set to null if field value exceeds threshold range b).
  3. Data type identification (mostly can be implemented by feature column), there are some processing details such as: in the category field, if a certain value is less than a certain threshold (such as 5%), it is classified into other categories.
  4. Null value filling:
    1. For numeric variables, lai is usually filled with null values.
    2. Numeric Null Value Filling: Grouped by category, where the category is a certain feature combination set, such as the combination of the two characteristics of "gender" and "car brand", and the null value is randomly selected in the corresponding set, for example (gender I am a male, and I randomly select a value from the feature set of Volkswagen).
  5. Feature scaling and one-hot encoding of categorical data

Note:
The test set needs to be processed according to the preprocessing of the training set. For example, operations such as deleting columns need to be aligned with the training set.

Metadata

Metadata

Assignees

Labels

DataScienceThe issue about the application in data scienceDiDiThe issue publisher is from Didi

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions