Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for the same set of data, the centroids vary for new run #1

Open
krishnakumar85 opened this issue Sep 25, 2013 · 8 comments
Open

for the same set of data, the centroids vary for new run #1

krishnakumar85 opened this issue Sep 25, 2013 · 8 comments

Comments

@krishnakumar85
Copy link

For each new run of node-kmeans on the same set of data, the clusters and centroids vary. Is there any way we can fix the skewed results or probably start with a constant seed.

@listonb
Copy link

listonb commented Nov 7, 2014

I'm also seeing this problem. Appears to generate new centroids on every run of identical data

@Philmod
Copy link
Owner

Philmod commented Nov 7, 2014

Is there a lot of local minima in your data set?

@listonb
Copy link

listonb commented Nov 7, 2014

Yes. This is pixel RGB color data from an image

@Philmod
Copy link
Owner

Philmod commented Nov 7, 2014

Yes, that's linked to your problem.

Finding the global minimum of the k-means problem is NP-hard in general.

@listonb
Copy link

listonb commented Nov 7, 2014

Any easy fix?

@Philmod
Copy link
Owner

Philmod commented Nov 7, 2014

This is one todo.

I think that can be solved with different solutions:

  • replicates: trying many random starting points and merging
  • adding some randomness

I'm happy if you create a Pull Request with a solution.

Thanks,
Philmod

@listonb
Copy link

listonb commented Nov 7, 2014

Appreciate the time. I'll try to look into it after next week if i have some time!

@Morikko
Copy link

Morikko commented Aug 30, 2018

One of the solution used in sklearn is to used the inertia:

  1. Do the kmean many times with different initiation
  2. For each result, compute the inertia
  3. Keep the results with the lowest inertia

Note about inertia (from sklearn): Sum of squared distances of samples to their closest cluster center.

@Philmod I can do a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants