Replies: 2 comments 4 replies
-
|
@hvarfner should have some answers / thoughts on this :) |
Beta Was this translation helpful? Give feedback.
-
|
@TobyBoyne Sorry for the delayed response, I had to make sure of this myself. Regarding the 100 categories: this would indeed be treated as a 101-dimensional problem, as you suggested. I share your concerns about the fairness of this approach. While we have periodically revisited how we handle categorical variables, we have continued to use one-hot encoding up to this point. What inductive biases do we introduce when we model categoricals? random change in a continuous variable will have a much larger effect than a change in a categorical variable |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
Last year, there were a few changes to the default kernel for purely continuous spaces (#2451) following the brilliant high-dim vanilla BO paper [hvarfner2024vanilla]. I was curious if any of the ideas from those changes extended to the default kernel for mixed spaces (in
MixedSingleTaskGP), and if there has been any investigation into whether such changes would actually impact BO performance?Specifically, here are three changes to the default kernel that follow on from some of the ideas in the literature:
Learned output scale
Should the
ScaleKernelthat wraps theCategoricalKernelbe removed? TheSingleTaskGPcontinuous kernel now by default does not have a learnable outputscale. I imagine the learned outputscale on categorical inputs would lead to the same shrinkage effect observed in the paper?I think this outputscale at the very least could be removed. The other
ScaleKernelsat least provide some relative weighting between the different kernels, so I can see how they may be necessary for expressivity.Lengthscale prior
Should the lengthscale have a dimension-scaled prior? The idea that higher dimension problems should use longer lengthscale priors to avoid overinflating complexity should extend to categorical features as well, as far as I understand. Currently, using the
MixedSingleTaskGPAPI has no way to provide priors to the lengthscale of the categorical kernels.Learned relative weights of sum/product kernel
As far as I can tell, the first paper that proposed the kernel structure of
A * (cat + cont) + B * (cat * cont)was [ru2020cocabo]. In their experimental results, they seemed to find that just setting A = B = 0.5 worked better than learning the tradeoff between A and B. That choice also reduces the number of parameters to be learned. I was curious if there was any experimental results that suggested learning A and B led to stronger performance?I don't have any results, nor have experienced any problems in particular. Would be interested to hear others' thoughts, and if anyone has tested this for themselves!
[hvarfner2024vanilla] Hvarfner et al. "Vanilla Bayesian Optimization Performs Great in High Dimensions" ICML (https://proceedings.mlr.press/v235/hvarfner24a.html)
[ru2020cocabo] Ru et al. "Bayesian Optimisation over Multiple Continuous and Categorical Inputs" ICML (https://proceedings.mlr.press/v119/ru20a.html)
Beta Was this translation helpful? Give feedback.
All reactions