Skip to content

Commit 9e6a174

Browse files
author
Andrey Oskin
committed
Finalized coresets and tests
1 parent 9af947f commit 9e6a174

18 files changed

+273
-33
lines changed

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "ParallelKMeans"
22
uuid = "42b8e9d4-006b-409a-8472-7f34b3fb58af"
33
authors = ["Bernard Brenyah", "Andrey Oskin"]
4-
version = "0.1.5"
4+
version = "0.1.6"
55

66
[deps]
77
Distances = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7"

docs/src/index.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,15 +72,16 @@ git checkout experimental
7272
- [X] Implementation of [Hamerly implementation](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster).
7373
- [X] Interface for inclusion in Alan Turing Institute's [MLJModels](https://github.com/alan-turing-institute/MLJModels.jl#who-is-this-repo-for).
7474
- [X] Full Implementation of Triangle inequality based on [Elkan - 2003 Using the Triangle Inequality to Accelerate K-Means"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf).
75-
- [X] Implementation of [Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ding15.pdf)
75+
- [X] Implementation of [Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ding15.pdf).
76+
- [X] Implementation of [Coresets](http://proceedings.mlr.press/v51/lucic16-supp.pdf).
7677
- [ ] Implementation of [Geometric methods to accelerate k-means algorithm](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf).
78+
- [X] Support for weighted K-means.
7779
- [ ] Support for other distance metrics supported by [Distances.jl](https://github.com/JuliaStats/Distances.jl#supported-distances).
7880
- [ ] Support of MLJ Random generation hyperparameter.
7981
- [ ] Native support for tabular data inputs outside of MLJModels' interface.
8082
- [ ] Refactoring and finalizaiton of API desgin.
8183
- [ ] GPU support.
8284
- [ ] Distributed calculations support.
83-
- [ ] Implementation of other K-Means algorithm variants based on recent literature.
8485
- [ ] Optimization of code base.
8586
- [ ] Improved Documentation
8687
- [ ] More benchmark tests.
@@ -123,6 +124,7 @@ r.converged # whether the procedure converged
123124
- [Hamerly()](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster) - Hamerly is good for moderate number of clusters (< 50?) and moderate dimensions (<100?).
124125
- [Elkan()](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) - Recommended for high dimensional data.
125126
- [Yinyang()](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ding15.pdf) - Recommended for large dimensions and/or large number of clusters.
127+
- [Coreset()](http://proceedings.mlr.press/v51/lucic16-supp.pdf) - Recommended for very fast clustering of very large datasets, when extreme accuracy is not important.
126128
- [Geometric()](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf) - (Coming soon)
127129
- [MiniBatch()](https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf) - (Coming soon)
128130

@@ -204,6 +206,7 @@ ________________________________________________________________________________
204206
- 0.1.3 Faster & optimized execution.
205207
- 0.1.4 Bug fixes.
206208
- 0.1.5 Added `Yinyang` algorithm.
209+
- 0.1.6 Added support for weighted k-means; Added `Coreset` algorithm; improved support for different types of the design matrix.
207210

208211
## Contributing
209212

src/ParallelKMeans.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,6 @@ include("mlj_interface.jl")
1717
include("coreset.jl")
1818

1919
export kmeans
20-
export Lloyd, Hamerly, Elkan, Yinyang, Coreset
20+
export Lloyd, Hamerly, Elkan, Yinyang, 阴阳, Coreset
2121

2222
end # module

src/coreset.jl

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,22 +3,37 @@
33
44
Coreset algorithm implementation, based on "Lucic, Mario & Bachem,
55
Olivier & Krause, Andreas. (2015). Strong Coresets for Hard and Soft Bregman
6-
Clustering with Applications to Exponential Family Mixtures. "
6+
Clustering with Applications to Exponential Family Mixtures."
7+
8+
`Coreset` supports following arguments:
9+
- `m`: default 100, subsample size
10+
- `alg`: default `Lloyd()`, algorithm used to clusterize sample
711
812
It can be used directly in `kmeans` function
913
1014
```julia
1115
X = rand(30, 100_000) # 100_000 random points in 30 dimensions
1216
13-
kmeans(Coreset(), X, 3) # 3 clusters, Coreset algorithm
17+
# 3 clusters, Coreset algorithm with default Lloyd algorithm and 100 subsamples
18+
kmeans(Coreset(), X, 3)
19+
20+
# 3 clusters, Coreset algorithm with Hamerly algorithm and 500 subsamples
21+
kmeans(Coreset(m = 500, alg = Hamerly()), X, 3)
22+
kmeans(Coreset(500, Hamerly()), X, 3)
23+
24+
# alternatively short form can be used for defining subsample size or algorithm only
25+
kmeans(Coreset(500), X, 3) # sample of the size 500, Lloyd clustering algorithm
26+
kmeans(Coreset(Hamerly()), X, 3) # sample of the size 100, Hamerly clustering algorithm
1427
```
1528
"""
1629
struct Coreset{T <: AbstractKMeansAlg} <: AbstractKMeansAlg
1730
m::Int
1831
alg::T
1932
end
2033

21-
Coreset() = Coreset(100, Lloyd())
34+
Coreset(; m = 100, alg = Lloyd()) = Coreset(m, alg)
35+
Coreset(m::Int) = Coreset(m, Lloyd())
36+
Coreset(alg::AbstractKMeansAlg) = Coreset(100, alg)
2237

2338
function kmeans!(alg::Coreset, containers, X, k, weights;
2439
n_threads = Threads.nthreads(),

src/elkan.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ function kmeans!(alg::Elkan, containers, X, k, weights;
2323
k_init = "k-means++", max_iters = 300,
2424
tol = eltype(X)(1e-6), verbose = false, init = nothing)
2525
nrow, ncol = size(X)
26-
centroids = init == nothing ? smart_init(X, k, n_threads, init=k_init).centroids : deepcopy(init)
26+
centroids = init == nothing ? smart_init(X, k, n_threads, weights, init=k_init).centroids : deepcopy(init)
2727

2828
update_containers(alg, containers, centroids, n_threads)
2929
@parallelize n_threads ncol chunk_initialize(alg, containers, centroids, X, weights)

src/hamerly.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ function kmeans!(alg::Hamerly, containers, X, k, weights;
2323
k_init = "k-means++", max_iters = 300,
2424
tol = eltype(X)(1e-6), verbose = false, init = nothing)
2525
nrow, ncol = size(X)
26-
centroids = init == nothing ? smart_init(X, k, n_threads, init=k_init).centroids : deepcopy(init)
26+
centroids = init == nothing ? smart_init(X, k, n_threads, weights, init=k_init).centroids : deepcopy(init)
2727

2828
@parallelize n_threads ncol chunk_initialize(alg, containers, centroids, X, weights)
2929

src/lloyd.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ function kmeans!(alg::Lloyd, containers, X, k, weights;
1919
k_init = "k-means++", max_iters = 300,
2020
tol = eltype(design_matrix)(1e-6), verbose = false, init = nothing)
2121
nrow, ncol = size(X)
22-
centroids = isnothing(init) ? smart_init(X, k, n_threads, init=k_init).centroids : deepcopy(init)
22+
centroids = isnothing(init) ? smart_init(X, k, n_threads, weights, init=k_init).centroids : deepcopy(init)
2323

2424
T = eltype(X)
2525
converged = false

src/seeding.jl

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,16 +10,18 @@ end
1010

1111

1212
"""
13-
chunk_colwise!(target, x, y, r)
13+
chunk_colwise!(target, x, y, i, weights, r, idx)
1414
15-
Utility function for calculation of the `colwise!(target, x, y, n_threads)` function.
15+
Utility function for the calculation of the weighted distance between points `x` and
16+
centroid vector `y[:, i]`.
1617
UnitRange argument `r` select subarray of original design matrix `x` that is going
1718
to be processed.
1819
"""
19-
function chunk_colwise(target, x, y, i, r, idx)
20+
function chunk_colwise(target, x, y, i, weights, r, idx)
2021
T = eltype(x)
2122
@inbounds for j in r
2223
dist = distance(x, y, j, i)
24+
dist = isnothing(weights) ? dist : weights[j] * dist
2325
target[j] = dist < target[j] ? dist : target[j]
2426
end
2527
end
@@ -35,7 +37,7 @@ of centroids from X used if any other string is attempted.
3537
3638
A named tuple representing centroids and indices respecitively is returned.
3739
"""
38-
function smart_init(X, k, n_threads = Threads.nthreads();
40+
function smart_init(X, k, n_threads = Threads.nthreads(), weights = nothing;
3941
init = "k-means++")
4042

4143
nrow, ncol = size(X)
@@ -50,7 +52,7 @@ function smart_init(X, k, n_threads = Threads.nthreads();
5052
# TODO relax constraints on distances, may be should
5153
# define `X` as X::AbstractArray{T} where {T <: Number}
5254
# and use this T for all calculations.
53-
rand_idx = rand(1:ncol)
55+
rand_idx = isnothing(weights) ? rand(1:ncol) : wsample(1:ncol, weights)
5456
rand_indices[1] = rand_idx
5557
@inbounds for j in axes(X, 1)
5658
centroids[j, 1] = X[j, rand_idx]
@@ -61,7 +63,7 @@ function smart_init(X, k, n_threads = Threads.nthreads();
6163
distances = fill(T(Inf), ncol)
6264

6365
# compute distances from the first centroid chosen to all the other data points
64-
@parallelize n_threads ncol chunk_colwise(distances, X, centroids, 1)
66+
@parallelize n_threads ncol chunk_colwise(distances, X, centroids, 1, weights)
6567
distances[rand_idx] = zero(T)
6668

6769
for i = 2:k
@@ -77,7 +79,7 @@ function smart_init(X, k, n_threads = Threads.nthreads();
7779
i == k && break
7880

7981
# compute distances from the centroids to all data points
80-
@parallelize n_threads ncol chunk_colwise(distances, X, centroids, i)
82+
@parallelize n_threads ncol chunk_colwise(distances, X, centroids, i, weights)
8183

8284
distances[r_idx] = zero(T)
8385
end

src/yinyang.jl

Lines changed: 25 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,34 +8,51 @@ Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015"
88
Generally it outperform `Hamerly` algorithm and has roughly the same time as `Elkan`
99
algorithm with much lower memory consumption.
1010
11+
12+
`Yinyang` supports following arguments:
13+
`auto`: `Bool`, indicates whether to perform automated or manual grouping
14+
`group_size`: `Int`, estimation of average number of clusters per group. Lower numbers
15+
corresponds to higher calculation speed and higher memory consumption and vice versa.
16+
1117
It can be used directly in `kmeans` function
1218
1319
```julia
1420
X = rand(30, 100_000) # 100_000 random points in 30 dimensions
1521
16-
kmeans(Yinyang(), X, 3) # 3 clusters, Yinyang algorithm
17-
```
22+
# 3 clusters, Yinyang algorithm, with deault 7 group_size
23+
kmeans(Yinyang(), X, 3)
1824
19-
`Yinyang` supports following arguments:
20-
`auto`: `Bool`, indicates whether to perform automated or manual grouping
21-
`group_size`: `Int`, estimation of average number of clusters per group. Lower numbers
22-
corresponds to higher calculation speed and higher memory consumption and vice versa.
25+
# Following are equivalent
26+
# 3 clusters, Yinyang algorithm with 10 group_size
27+
kmeans(Yinyang(group_size = 10), X, 3)
28+
kmeans(Yinyang(10), X, 3)
29+
30+
# One group with the size of the number of points
31+
kmeans(Yinyang(auto = false), X, 3)
32+
kmeans(Yinyang(false), X, 3)
33+
34+
# Chinese writing can be used
35+
kmeans(阴阳(), X, 3)
36+
```
2337
"""
2438
struct Yinyang <: AbstractKMeansAlg
2539
auto::Bool
2640
group_size::Int
2741
end
2842

29-
Yinyang() = Yinyang(true, 7)
3043
Yinyang(auto::Bool) = Yinyang(auto, 7)
3144
Yinyang(group_size::Int) = Yinyang(true, group_size)
45+
Yinyang(; group_size = 7, auto = true) = Yinyang(auto, group_size)
46+
阴阳(auto::Bool) = Yinyang(auto, 7)
47+
阴阳(group_size::Int) = Yinyang(true, group_size)
48+
阴阳(; group_size = 7, auto = true) = Yinyang(auto, group_size)
3249

3350
function kmeans!(alg::Yinyang, containers, X, k, weights;
3451
n_threads = Threads.nthreads(),
3552
k_init = "k-means++", max_iters = 300,
3653
tol = 1e-6, verbose = false, init = nothing)
3754
nrow, ncol = size(X)
38-
centroids = init == nothing ? smart_init(X, k, n_threads, init=k_init).centroids : deepcopy(init)
55+
centroids = init == nothing ? smart_init(X, k, n_threads, weights, init=k_init).centroids : deepcopy(init)
3956

4057
# create initial groups of centers, step 1 in original paper
4158
initialize(alg, containers, centroids, n_threads)

test/test01_distance.jl

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,27 @@
11
module TestDistance
2-
using ParallelKMeans: colwise!
2+
using ParallelKMeans: chunk_colwise, @parallelize
33
using Test
44

55
@testset "naive singlethread colwise" begin
66
X = [1.0 3.0 4.0; 2.0 5.0 6.0]
7-
y = [1.0, 2.0]
8-
r = Vector{Float64}(undef, 3)
7+
y = permutedims([1.0, 2.0]')
8+
ncol = size(X, 2)
9+
r = fill(Inf, ncol)
10+
n_threads = 1
911

10-
colwise!(r, X, y, 1)
12+
@parallelize n_threads ncol chunk_colwise(r, X, y, 1, nothing)
1113
@test all(r .≈ [0.0, 13.0, 25.0])
1214
end
1315

1416
@testset "multithread colwise" begin
1517
X = [1.0 3.0 4.0; 2.0 5.0 6.0]
16-
y = [1.0, 2.0]
17-
r = Vector{Float64}(undef, 3)
18+
y = permutedims([1.0, 2.0]')
19+
ncol = size(X, 2)
20+
r = fill(Inf, ncol)
21+
n_threads = 2
22+
23+
@parallelize n_threads ncol chunk_colwise(r, X, y, 1, nothing)
1824

19-
colwise!(r, X, y, 2)
2025
@test all(r .≈ [0.0, 13.0, 25.0])
2126
end
2227

0 commit comments

Comments
 (0)