Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FAQ]Should I use mambular.preprocessing.Preprocessor and how? #234

Open
YuanfengZhang opened this issue Mar 2, 2025 · 2 comments
Open
Labels
question Further information is requested

Comments

@YuanfengZhang
Copy link

Context
There is currently no examples in documentation about the usage of mambular.preprocessing.Preprocessor.
Appearently the Processor is applied in the fit() function.

Describe the task you are trying to achieve.
Manually set the method to preprocess for each column.

Describe the solution you'd like
A minimal example.

@YuanfengZhang YuanfengZhang added the question Further information is requested label Mar 2, 2025
@AnFreTh
Copy link
Collaborator

AnFreTh commented Mar 2, 2025

Thanks for raising this. Below are some examples, but we will add better documentation in the next release. Please leave this issue open until then:

Simple example

Generally, the preprocessor follow the sklearn preprocessing modules, i.e. the methods fit, and fit_transform with a few minor exceptions.

from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing(as_frame=True)
X = california_housing.frame # Pass a pd.DataFrame for column names
y = california_housing.target

from mambular.preprocessing import Preprocessor
prepro = Preprocessor(numerical_preprocessing="ple", n_bins=16, task="regression")

preprocessed_data = prepro.fit_transform(X, y) # pass y here for target aware encodings

Note, that preprocessed_data, other than for standard sklearn preprocessor is now a dictionary, containing the keys "d_type" + "column_name"

Example with individually preprocessed columns

prepro = Preprocessor(
    numerical_preprocessing="minmax",
    feature_preprocessing={"Longitude":"one-hot", "Latitude":"ple"}, 
    n_bins=16, 
    task="regression",
    )

preprocessed_data = prepro.fit_transform(X, y) # pass y here for target aware encodings

assert preprocessed_data["num_Latitude"].shape == (X.shape[0], 16) # assert that Latitude was preprocessed using PLE

Note, that in the current form, the column names are case dependent, so be aware of passing the correct column names.

Get information

Since during ple, the number of set bins can be smaller than those set, when the decision tree finds fewer bins, getting information about the shapes and chosen steps can be useful. To get the information that is displayed when calling model.fit(), you can run the following:

prepro.get_feature_info()

If you have other suggestions/ideas for improvements, feel free to comment/raise another issue.

@AnFreTh
Copy link
Collaborator

AnFreTh commented Mar 2, 2025

And to clarify: If you fit any model, you do not need to call the preprocessor manually, it is handled inside the build_model() functionality. The arguments from the examples above can be used in the initialization of a model, i.e:

from mambular.models import MambularRegressor

model = MambularRegressor(
    numerical_preprocessing="minmax",
    feature_preprocessing={"Longitude":"one-hot", "Latitude":"ple"}, 
    n_bins=16
)

model.fit(X, y)

Here the preprocessing is applied automatically and there is no need to implicitly call the preprocessor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants