-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FAQ]Should I use mambular.preprocessing.Preprocessor and how? #234
Comments
Thanks for raising this. Below are some examples, but we will add better documentation in the next release. Please leave this issue open until then: Simple exampleGenerally, the preprocessor follow the sklearn preprocessing modules, i.e. the methods from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing(as_frame=True)
X = california_housing.frame # Pass a pd.DataFrame for column names
y = california_housing.target
from mambular.preprocessing import Preprocessor
prepro = Preprocessor(numerical_preprocessing="ple", n_bins=16, task="regression")
preprocessed_data = prepro.fit_transform(X, y) # pass y here for target aware encodings Note, that preprocessed_data, other than for standard sklearn preprocessor is now a dictionary, containing the keys "d_type" + "column_name" Example with individually preprocessed columnsprepro = Preprocessor(
numerical_preprocessing="minmax",
feature_preprocessing={"Longitude":"one-hot", "Latitude":"ple"},
n_bins=16,
task="regression",
)
preprocessed_data = prepro.fit_transform(X, y) # pass y here for target aware encodings
assert preprocessed_data["num_Latitude"].shape == (X.shape[0], 16) # assert that Latitude was preprocessed using PLE Note, that in the current form, the column names are case dependent, so be aware of passing the correct column names. Get informationSince during ple, the number of set bins can be smaller than those set, when the decision tree finds fewer bins, getting information about the shapes and chosen steps can be useful. To get the information that is displayed when calling model.fit(), you can run the following: prepro.get_feature_info() If you have other suggestions/ideas for improvements, feel free to comment/raise another issue. |
And to clarify: If you fit any model, you do not need to call the preprocessor manually, it is handled inside the build_model() functionality. The arguments from the examples above can be used in the initialization of a model, i.e: from mambular.models import MambularRegressor
model = MambularRegressor(
numerical_preprocessing="minmax",
feature_preprocessing={"Longitude":"one-hot", "Latitude":"ple"},
n_bins=16
)
model.fit(X, y) Here the preprocessing is applied automatically and there is no need to implicitly call the preprocessor. |
Context
There is currently no examples in documentation about the usage of mambular.preprocessing.Preprocessor.
Appearently the Processor is applied in the fit() function.
Describe the task you are trying to achieve.
Manually set the method to preprocess for each column.
Describe the solution you'd like
A minimal example.
The text was updated successfully, but these errors were encountered: