Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 60 additions & 45 deletions python_scripts/parameter_tuning_grid_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,77 +116,92 @@
# %% [markdown]
# ## Tuning using a grid-search
#
# In the previous exercise we used one `for` loop for each hyperparameter to
# find the best combination over a fixed grid of values. `GridSearchCV` is a
# scikit-learn class that implements a very similar logic with less repetitive
# code.
# In the previous exercise (M3.01) we used two nested `for` loops (one for each
# hyperparameter) to test different combinations over a fixed grid of
# hyperparameter values. In each iteration of the loop, we used
# `cross_val_score` to compute the mean score (as averaged across
# cross-validation splits), and compared those mean scores to select the best
# combination. `GridSearchCV` is a scikit-learn class that implements a very
# similar logic with less repetitive code. The suffix `CV` refers to the
# cross-validation it runs internally (instead of the `cross_val_score` we
# "hard" coded).
#
# Let's see how to use the `GridSearchCV` estimator for doing such search. Since
# the grid-search is costly, we only explore the combination learning-rate and
# the maximum number of nodes.
# The `GridSearchCV` estimator takes a `param_grid` parameter which defines all
# hyperparameters and their associated values. The grid-search is in charge of
# creating all possible combinations and testing them.
#
# The number of combinations is equal to the product of the number of values to
# explore for each parameter. Thus, adding new parameters with their associated
# values to be explored rapidly becomes computationally expensive. Because of
# that, here we only explore the combination learning-rate and the maximum
# number of nodes for a total of 4 x 3 = 12 combinations.

# %%
# %%time
from sklearn.model_selection import GridSearchCV

param_grid = {
"classifier__learning_rate": (0.01, 0.1, 1, 10),
"classifier__max_leaf_nodes": (3, 10, 30),
}
"classifier__learning_rate": (0.01, 0.1, 1, 10), # 4 possible values
"classifier__max_leaf_nodes": (3, 10, 30), # 3 possible values
} # 12 unique combinations
model_grid_search = GridSearchCV(model, param_grid=param_grid, n_jobs=2, cv=2)
model_grid_search.fit(data_train, target_train)

# %% [markdown]
# Finally, we check the accuracy of our model using the test set.
# You can access the best combination of hyperparameters found by the grid
# search using the `best_params_` attribute.

# %%
accuracy = model_grid_search.score(data_test, target_test)
print(
f"The test accuracy score of the grid-searched pipeline is: {accuracy:.2f}"
)

# %% [markdown]
# ```{warning}
# Be aware that the evaluation should normally be performed through
# cross-validation by providing `model_grid_search` as a model to the
# `cross_validate` function.
#
# Here, we used a single train-test split to evaluate `model_grid_search`. In
# a future notebook will go into more detail about nested cross-validation, when
# you use cross-validation both for hyperparameter tuning and model evaluation.
# ```
print(f"The best set of parameters is: {model_grid_search.best_params_}")

# %% [markdown]
# The `GridSearchCV` estimator takes a `param_grid` parameter which defines all
# hyperparameters and their associated values. The grid-search is in charge
# of creating all possible combinations and test them.
#
# The number of combinations are equal to the product of the number of values to
# explore for each parameter (e.g. in our example 4 x 3 combinations). Thus,
# adding new parameters with their associated values to be explored become
# rapidly computationally expensive.
#
# Once the grid-search is fitted, it can be used as any other predictor by
# calling `predict` and `predict_proba`. Internally, it uses the model with the
# Once the grid-search is fitted, it can be used as any other estimator, i.e. it
# has `predict` and `score` methods. Internally, it uses the model with the
# best parameters found during `fit`.
#
# Get predictions for the 5 first samples using the estimator with the best
# parameters.
# Let's get the predictions for the 5 first samples using the estimator with the
# best parameters:

# %%
model_grid_search.predict(data_test.iloc[0:5])

# %% [markdown]
# You can know about these parameters by looking at the `best_params_`
# attribute.
# Finally, we check the accuracy of our model using the test set.

# %%
print(f"The best set of parameters is: {model_grid_search.best_params_}")
accuracy = model_grid_search.score(data_test, target_test)
print(
f"The test accuracy score of the grid-search pipeline is: {accuracy:.2f}"
)

# %% [markdown]
# The accuracy and the best parameters of the grid-searched pipeline are similar
# The accuracy and the best parameters of the grid-search pipeline are similar
# to the ones we found in the previous exercise, where we searched the best
# parameters "by hand" through a double for loop.
# parameters "by hand" through a double `for` loop.
#
# ## The need for a validation set
#
# In the previous section, the selection of the best hyperparameters was done
# using the train set, coming from the initial train-test split. Then, we
# evaluated the generalization performance of our tuned model on the left out
# test set. This can be shown schematically as follows:
#
# ![Cross-validation tuning
# diagram](../figures/cross_validation_train_test_diagram.png)
#
# ```{note}
# This figure shows the particular case of **K-fold** cross-validation strategy
# using `n_splits=5` to further split the train set coming from a train-test
# split. For each cross-validation split, the procedure trains a model on all
# the red samples, evaluates the score of a given set of hyperparameters on the
# green samples. The best combination of hyperparameters `best_params` is selected
# based on those intermediate scores.
#
# Then a final model is refitted using `best_params` on the concatenation of the
# red and green samples and evaluated on the blue samples.
#
# The green samples are sometimes referred as the **validation set** to
# differentiate them from the final test set in blue.
# ```
#
# In addition, we can inspect all results which are stored in the attribute
# `cv_results_` of the grid-search. We filter some specific columns from these
Expand Down