[FAQ]Should I use mambular.preprocessing.Preprocessor and how? #234

YuanfengZhang · 2025-03-02T08:51:45Z

Context
There is currently no examples in documentation about the usage of mambular.preprocessing.Preprocessor.
Appearently the Processor is applied in the fit() function.

Describe the task you are trying to achieve.
Manually set the method to preprocess for each column.

Describe the solution you'd like
A minimal example.

AnFreTh · 2025-03-02T09:07:37Z

Thanks for raising this. Below are some examples, but we will add better documentation in the next release. Please leave this issue open until then:

Simple example

Generally, the preprocessor follow the sklearn preprocessing modules, i.e. the methods fit, and fit_transform with a few minor exceptions.

from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing(as_frame=True)
X = california_housing.frame # Pass a pd.DataFrame for column names
y = california_housing.target

from mambular.preprocessing import Preprocessor
prepro = Preprocessor(numerical_preprocessing="ple", n_bins=16, task="regression")

preprocessed_data = prepro.fit_transform(X, y) # pass y here for target aware encodings

Note, that preprocessed_data, other than for standard sklearn preprocessor is now a dictionary, containing the keys "d_type" + "column_name"

Example with individually preprocessed columns

prepro = Preprocessor(
    numerical_preprocessing="minmax",
    feature_preprocessing={"Longitude":"one-hot", "Latitude":"ple"}, 
    n_bins=16, 
    task="regression",
    )

preprocessed_data = prepro.fit_transform(X, y) # pass y here for target aware encodings

assert preprocessed_data["num_Latitude"].shape == (X.shape[0], 16) # assert that Latitude was preprocessed using PLE

Note, that in the current form, the column names are case dependent, so be aware of passing the correct column names.

Get information

Since during ple, the number of set bins can be smaller than those set, when the decision tree finds fewer bins, getting information about the shapes and chosen steps can be useful. To get the information that is displayed when calling model.fit(), you can run the following:

prepro.get_feature_info()

If you have other suggestions/ideas for improvements, feel free to comment/raise another issue.

AnFreTh · 2025-03-02T09:47:09Z

And to clarify: If you fit any model, you do not need to call the preprocessor manually, it is handled inside the build_model() functionality. The arguments from the examples above can be used in the initialization of a model, i.e:

from mambular.models import MambularRegressor

model = MambularRegressor(
    numerical_preprocessing="minmax",
    feature_preprocessing={"Longitude":"one-hot", "Latitude":"ple"}, 
    n_bins=16
)

model.fit(X, y)

Here the preprocessing is applied automatically and there is no need to implicitly call the preprocessor.

YuanfengZhang added the question Further information is requested label Mar 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FAQ]Should I use mambular.preprocessing.Preprocessor and how? #234

[FAQ]Should I use mambular.preprocessing.Preprocessor and how? #234

YuanfengZhang commented Mar 2, 2025

AnFreTh commented Mar 2, 2025

AnFreTh commented Mar 2, 2025 •

edited

Loading

[FAQ]Should I use mambular.preprocessing.Preprocessor and how? #234

[FAQ]Should I use mambular.preprocessing.Preprocessor and how? #234

Comments

YuanfengZhang commented Mar 2, 2025

AnFreTh commented Mar 2, 2025

Simple example

Example with individually preprocessed columns

Get information

AnFreTh commented Mar 2, 2025 • edited Loading

AnFreTh commented Mar 2, 2025 •

edited

Loading