Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding Quality Difference #1161

Open
erenozcelik opened this issue Nov 18, 2024 · 4 comments
Open

Embedding Quality Difference #1161

erenozcelik opened this issue Nov 18, 2024 · 4 comments

Comments

@erenozcelik
Copy link

Hello @timsainb ,

When using Parametric UMAP for supervised tasks, the quality of the embeddings is significantly worse compared to the embeddings produced by standard UMAP. This difference is observed across multiple datasets and configurations. What could be the reason and can it be improved?

@erenozcelik erenozcelik changed the title Embeddin Embedding Quality Difference Nov 18, 2024
@timsainb
Copy link
Collaborator

If you provide a specific issue and colab link reproducing it I can take a look. As it stands this issue is too vaguely described.

@erenozcelik
Copy link
Author

erenozcelik commented Dec 5, 2024

Hi,

Here is the colab link comparing parametric UMAP and standard UMAP for supervised FMNIST.
Open in Colab

@erenozcelik
Copy link
Author

Hello @timsainb,

Were you able to look at it?

@timsainb
Copy link
Collaborator

timsainb commented Feb 1, 2025

Thanks for providing the colab notebook. Note that you are plotting the results on the training data here and not the held out testing data. This is very important when you consider the difference between parametric umap and umap. Supervised nonparametric umap is performing an embedding by balancing your distance metric in data spece (e.g. euclidean distance) and in categorical space. If you were to set the balance to 100% categorical distance, you would get perfect separation between classes, but it wouldn't practically tell you anything about your data. Parametric UMAP can't do that because the embedding is parametrically related to the input data using a neural network. Imagine you were to sample data as two classes from the same gaussian distribution. Since it comes from the same distribution, even a supervised neural network won't allow you to separate the classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants