Design decisions for default mixed-space kernel #3063

TobyBoyne · 2025-10-27T16:23:30Z

TobyBoyne
Oct 27, 2025

Hi all,

Last year, there were a few changes to the default kernel for purely continuous spaces (#2451) following the brilliant high-dim vanilla BO paper [hvarfner2024vanilla]. I was curious if any of the ideas from those changes extended to the default kernel for mixed spaces (in MixedSingleTaskGP), and if there has been any investigation into whether such changes would actually impact BO performance?

Specifically, here are three changes to the default kernel that follow on from some of the ideas in the literature:

Learned output scale
Should the ScaleKernel that wraps the CategoricalKernel be removed? The SingleTaskGP continuous kernel now by default does not have a learnable outputscale. I imagine the learned outputscale on categorical inputs would lead to the same shrinkage effect observed in the paper?

I think this outputscale at the very least could be removed. The other ScaleKernels at least provide some relative weighting between the different kernels, so I can see how they may be necessary for expressivity.

Lengthscale prior
Should the lengthscale have a dimension-scaled prior? The idea that higher dimension problems should use longer lengthscale priors to avoid overinflating complexity should extend to categorical features as well, as far as I understand. Currently, using the MixedSingleTaskGP API has no way to provide priors to the lengthscale of the categorical kernels.

Learned relative weights of sum/product kernel
As far as I can tell, the first paper that proposed the kernel structure of A * (cat + cont) + B * (cat * cont) was [ru2020cocabo]. In their experimental results, they seemed to find that just setting A = B = 0.5 worked better than learning the tradeoff between A and B. That choice also reduces the number of parameters to be learned. I was curious if there was any experimental results that suggested learning A and B led to stronger performance?

I don't have any results, nor have experienced any problems in particular. Would be interested to hear others' thoughts, and if anyone has tested this for themselves!

[hvarfner2024vanilla] Hvarfner et al. "Vanilla Bayesian Optimization Performs Great in High Dimensions" ICML (https://proceedings.mlr.press/v235/hvarfner24a.html)
[ru2020cocabo] Ru et al. "Bayesian Optimisation over Multiple Continuous and Categorical Inputs" ICML (https://proceedings.mlr.press/v119/ru20a.html)

Balandat · 2025-10-27T19:42:16Z

Balandat
Oct 27, 2025
Collaborator

@hvarfner should have some answers / thoughts on this :)

4 replies

hvarfner Oct 27, 2025

Hi @TobyBoyne ,

Thanks for the kind words regarding the paper!

...if there has been any investigation into whether such changes would actually impact BO performance?
There hasn't been a focused investigation on the MixedSingleTaskGP specifically, but non-continuous benchmarks were part of the evaluation when we decided to change the defaults. The D-scaled kernel is applied to discrete variables in Ax since these variables are transformed to a zero-one unit cube, where the D-scaled RBF is applied to continuous and discrete variables alike. I want to emphasize that they are scaled to the unit cube (as opposed to 0, 1, 2, 3—one integer per category), since that would imply that a categorical variable with more than two choices is less important. I think there is still a good amount of headway here.

Learned output scale
From what I've seen, the outputscale is very unstable when used on discrete variables and can easily shrink towards zero. I'm a proponent of removing it just about everywhere. I recall testing CoCaBO-like models in particular and found that the use of a learnable outputscale led to extreme shrinkage more often than not. As you pointed out, once such shrinkage occurs (which is bound to be very early if we don't have a lot of data), you are unlikely to "unshrink" since you lose all incentive to explore said dimensions.

Learned relative weights of sum/product kernel
I'm sure there are diverging opinions on this. I have not personally observed consistently better performance from learning the output scale along the lines the paper suggests. In the cases where it does produce lower regret, I find that it typically does so by squashing one of the parameters and excessively local searching.

This is a really interesting discussion, so I'm happy to continue it!

TobyBoyne Oct 27, 2025
Author

Thanks for the reply :)

Yes, that makes sense that the same approach can be taken for discrete, ordinal inputs (eg integers). Does Ax use any kind of lengthscale prior for unordered discrete inputs (eg. different choices of solvent in a chemical reaction)? Since the categorical distance is either 1 or 0 (two inputs have same value or different values respectively), we won't even need to worry about normalizing these to the unit cube.

I've also found the outputscale leading to shrinkage ever since I started looking for it after reading your paper! I had a particular instance where I was using MixedSingleTaskGP + IndexKernel which led to the product of two scalars. One outputscale exploded, the other shrunk to zero, and I was having very poor performance. Simplifying the model by removing an outputscale led to much better performance. It also means there's one less parameter to fit/sample which is always a nice bonus!

It's definitely very tricky to make general comments about this sort of thing - "no free lunch" and all that. I may soon try this out on a high(ish) dimensional categorical problem - the pest control benchmark, for example - and see if these changes have much of an impact on downstream performance.

hvarfner Oct 28, 2025

Does Ax use any kind of lengthscale prior for unordered discrete inputs...?
https://github.com/facebook/Ax/blob/main/ax/adapter/transforms/one_hot.py

We one-hot encode these inputs, unless there are only two available values—in that case, we use a single dimension and let them take values 0 and 1. Then, we throw these into the standard dim-scaled RBF. We've experimented with various categorical kernels in the past, but one-hot encoding is indeed the default.

A somewhat nerdy observation: If you map categoricals to 0 and 1, or one-hot encode them, any difference in the categorical variable is (under the prior) equivalent to the maximal difference in a continuous variable—that is, from zero to one. As a result, a random change in a categorical variable has the same impact as the largest possible change in a continuous variable. On average, though, a random change in a continuous variable will have a much larger effect than a change in a categorical variable.
All this is to say that there are still unsolved challenges in handling categorical variables, and I actually think we do need to worry about how we normalize categoricals. That said, compared to hyperparameterization, this is definitely a smaller issue.

Re. your experience with parametrizing a GP with categoricals:
Agree 100%. The over-parametrization becomes a problem extraordinarily quickly, and I've found that to be the case for basically any method I've tried. Feel free to share any learnings you come across on the matter, either here or in private!

TobyBoyne Oct 28, 2025
Author

Ah, I hadn't realised that Ax used one hot as a default, interesting! How do you compute the D in the dimension scaling? For example, if I have an input space with one continuous input, and one categorical input with 100 possible values, is it D==2 or D==101? Or is something else going on under the hood? If D==2, then your lengthscales on your categorical input may be too short. If D==100 then your lengthscale on your continuous input will be too long, and you will unfairly weight the categorical.

I have come across this sort of issue when using fingerprints for molecules (which can have dimensionality >1000), for which I settled on a CoCaBO-style kernel where I separated the low-dim continuous features from the high-dim fingerprints (which were still continuous, just very high dimensional!)

The one-hot approach also means that you have one lengthscale per value, and so you can learn more expressive kernels over categories. With C categorical values, you have C degrees of freedom for your kernel, compared to the 1 degree of freedom for the categorical kernel used in MixedSingleTaskGP. This has the same Occam's razor kind of simplicity vs. expressivity tradeoff that I'm sure there's no one best answer to!

nerdy observation

I think this is a really important point! You're right - the distances being bounded between 0 and 1 isn't the same as being normalized. A nerdy tangent on this: when I was looking into using Bayesian tree models, I wondered if some of our performance could be attributed to categorical distances no longer being limited to 0 and 1, as you note. We sample decision rules at splitting nodes in the tree. We have a prior over decision rules for categories where we sample from the superset of categories. For an input with four possible values $(A, B, C, D)$, we sample the left path from $((A), (B), (C), (D), (A, B), (A, C), \ldots, (B, C, D))$. This means that under our prior, the probability that a pair of nodes are in the same path is 0.5 for one split, decreasing you make more splits on that input. We didn't explore this too much, but it's been something I wanted to look into more - what inductive biases do we introduce when we model categoricals?

random change in a continuous variable will have a much larger effect than a change in a categorical variable

I'm not sure I understand this point. Is this an observation from real-world functions, or a mathematical point? I would have thought the opposite is true. Having worked with chemists, changing the (categorical) reactant can lead to drastically different underlying reaction, sometimes entirely changing the black box.

I'll be sure to update this thread if I come across any relevant insights!

hvarfner · 2025-10-30T13:37:15Z

hvarfner
Oct 30, 2025

@TobyBoyne Sorry for the delayed response, I had to make sure of this myself.

Regarding the 100 categories: this would indeed be treated as a 101-dimensional problem, as you suggested. I share your concerns about the fairness of this approach. While we have periodically revisited how we handle categorical variables, we have continued to use one-hot encoding up to this point.
The example you mentioned is certainly not ideal. If you do use one-hot encoding, it likely makes sense to down-weight the categorical dimensions specifically. In summary: all dimensions should probably be considered equally important a priori, unless there is a specific reason to weight them differently.

What inductive biases do we introduce when we model categoricals?
Yeah, this seems like an evergreen area of research. Personally, I think that papers on categorical variables are short on model fitting metrics to show that their kernel design actually works. BARK is a certainly a welcomed change to that! (or maybe I'm just wrong and have missed the model quality assessment in other papers)

random change in a continuous variable will have a much larger effect than a change in a categorical variable
Certainly More of a heuristic than a strict rule. In the vanilla BO paper, I reasoned that as the dimensionality D increases, the distance between random points scales with sqrt(D). Therefore, scaling the lengthscales by sqrt(D) keeps the expected scaled distance between points roughly constant as D increases. By analogy, for categorical variables, I would want the expected scaled distance between random points to remain constant as the number of categories increases.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design decisions for default mixed-space kernel #3063

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Design decisions for default mixed-space kernel #3063

Uh oh!

TobyBoyne Oct 27, 2025

Replies: 2 comments · 4 replies

Uh oh!

Balandat Oct 27, 2025 Collaborator

Uh oh!

Uh oh!

hvarfner Oct 27, 2025

Uh oh!

TobyBoyne Oct 27, 2025 Author

Uh oh!

hvarfner Oct 28, 2025

Uh oh!

TobyBoyne Oct 28, 2025 Author

Uh oh!

Uh oh!

hvarfner Oct 30, 2025

TobyBoyne
Oct 27, 2025

Replies: 2 comments 4 replies

Balandat
Oct 27, 2025
Collaborator

TobyBoyne Oct 27, 2025
Author

TobyBoyne Oct 28, 2025
Author

hvarfner
Oct 30, 2025