Skip to content

KMeans cluster analysis is non-deterministic when using KMeansYinyang initialization, even with fixed MLContext seed #6375

@mikegoatly

Description

@mikegoatly

System Information (please complete the following information):

  • OS & Version: Windows 10
  • ML.NET Version: ML.NET v1.7.1 (also tested with 2.0.0-preview.22313.1)
  • .NET Version: NET 6.0

Describe the bug
When creating a KMeans cluster prediction engine for a training data set that does not change, the predicted cluster ids
are not consistent, even when the seed is specified for the MLContext.

To Reproduce
For this fixed data set:

using Microsoft.ML;
using Microsoft.ML.Data;

public class ModelData
{
    public float Value1 { get; set; }
    public float Value2 { get; set; }
}

public class ClusterPrediction
{
    [ColumnName("PredictedLabel")]
    public uint PredictedClusterId;

    [ColumnName("Score")]
    public float[] Distances = null!;

    [ColumnName("Features")]
    public float[] Features = null!;
}

var data = Enumerable.Range(0, 60).Select(x => new ModelData { Value1 = Random.Shared.Next(0, 2000), Value2 = Random.Shared.Next(0, 7) }).ToList();

And this function to create a new instance of the prediction engine:

const string FeaturesColumnName = "Features";
const int ClusterCount = 4;

public PredictionEngine<ModelData, ClusterPrediction> CreateModel(IEnumerable<ModelData> data)
{
    var mlContext = new MLContext(seed: 0);

    var dataView = mlContext.Data.LoadFromEnumerable(data);

    IEstimator<ITransformer> pipeline = mlContext.Transforms
        .Concatenate(FeaturesColumnName, new[] { nameof(ModelData.Value1), nameof(ModelData.Value2) })
        .Append(mlContext.Clustering.Trainers.KMeans(FeaturesColumnName, numberOfClusters: ClusterCount));

    var model = pipeline.Fit(dataView);

    return mlContext.Model.CreatePredictionEngine<ModelData, ClusterPrediction>(model);
}

We should be able to create the same prediction engine producing the same results many times. The following creates the engine in a loop and calculates the cluster ids for each of the data set's data points, displaying the number of items that end up in each of the clusters:

using System.Linq;

for (var i = 0; i < 10; i++)
{
    var engine = CreateModel(data);

    var clusterCounts = data.Select(d => engine.Predict(d).PredictedClusterId).ToLookup(x => (int)x);

    Console.WriteLine(string.Join(" ", Enumerable.Range(1, ClusterCount).Select(x => $"Cluster {x}: {clusterCounts[x].Count()} items")));
}

This outputs:

Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 13 items Cluster 2: 20 items Cluster 3: 12 items Cluster 4: 15 items
Cluster 1: 15 items Cluster 2: 15 items Cluster 3: 17 items Cluster 4: 13 items
Cluster 1: 23 items Cluster 2: 22 items Cluster 3: 8 items Cluster 4: 7 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 22 items Cluster 2: 23 items Cluster 3: 8 items Cluster 4: 7 items
Cluster 1: 20 items Cluster 2: 13 items Cluster 3: 15 items Cluster 4: 12 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items
Cluster 1: 20 items Cluster 2: 15 items Cluster 3: 12 items Cluster 4: 13 items

Expected behavior
I would expect that each time the cluster is constructed from an MLContext with a fixed seed, the predicted cluster counts would be identical, with the same data points associated to them.

Screenshots, Code, Sample Projects
I've attached a .NET Interactive notebook (zipped) for ease of reproduction.

Activity

ghost added
untriagedNew issue has not been triaged
on Oct 13, 2022
changed the title [-]KMeans cluster analysis is non-deterministic, even with fixed MLContext seed[/-] [+]KMeans cluster analysis is non-deterministic when using KMeansYinyang initialization, even with fixed MLContext seed[/+] on Nov 17, 2022
mikegoatly

mikegoatly commented on Nov 17, 2022

@mikegoatly
Author

Further investigation has shown that if I use the KMeansPlusPlus initialization algorithm then the clustering becomes deterministic, so this looks like it's a bug in the KMeansYinyang initialization algorithm.

added this to the ML.NET 3.0 milestone on Nov 28, 2022
ghost removed
untriagedNew issue has not been triaged
on Nov 28, 2022
michaelgsharp

michaelgsharp commented on Nov 28, 2022

@michaelgsharp
Contributor

Thanks for finding this for us! We will take a look and see what we can figure out.

superichmann

superichmann commented on Mar 27, 2025

@superichmann

#7429 STILL HAPPENS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @mikegoatly@ericstj@michaelgsharp@superichmann

        Issue actions

          KMeans cluster analysis is non-deterministic when using KMeansYinyang initialization, even with fixed MLContext seed · Issue #6375 · dotnet/machinelearning