fix #37312 fast extrema computation on sparse arrays #37429

cfauchereau · 2020-09-06T14:01:42Z

Fix #37312
Exploit sparsity to speed up extrema computation. It iterates over non-zeros values and then account for the zero value if necessary.
It is consistently faster than the generic algorithm.

jebej · 2020-09-06T17:47:28Z

Maybe merge the two methods?

cfauchereau · 2020-09-06T22:13:17Z

Maybe merge the two methods?

I agree. I merged them in higherorderfns.jl which is made for these kind of functions if i understand it right.
minimum and maximum are still ten times faster on large arrays so there is room for improvements.

cfauchereau · 2020-09-06T22:30:34Z

Actually the performance issue remains with Arrays so this is extrema in Base which is slow. This is out of the scope of this pull request.
The reason may be that minimum and maximum use mapreduce but extrema don't.

dkarrasch · 2020-09-07T07:54:58Z

Yes, the generic case is tracked by #31442 and there is some promising work in #36265.

dkarrasch

LGTM!

dkarrasch · 2020-09-08T18:34:21Z

I wonder whether we should do the same for minimum and maximum. I just quickly benchmarked the following:

julia> using SparseArrays, BenchmarkTools

julia> x = sprandn(1000, 1, 0.2);

julia> @btime minimum($x);
  617.713 ns (5 allocations: 144 bytes)

julia> x = sparse(vec(x));

julia> @btime minimum($x);
  505.575 ns (0 allocations: 0 bytes)

dkarrasch · 2020-09-08T18:40:01Z

stdlib/SparseArrays/src/higherorderfns.jl

+
+# (13) extrema methods optimized for sparse vectors/matrices.
+function extrema(f, A::SparseVecOrMat)
+    if iszero(length(A))


In the spirit of

julia/stdlib/SparseArrays/src/sparsevector.jl

Lines 1360 to 1367 in dbd3d4c

function maximum(x::AbstractSparseVector{T}) where T<:Real

n = length(x)

n > 0 || throw(ArgumentError("maximum over empty array is not allowed."))

m = nnz(x)

(m == 0 ? zero(T) :

m == n ? maximum(nonzeros(x)) :

max(zero(T), maximum(nonzeros(x))))::T

end

you may want to introduce a M = length(A) and reuse it below. As I said, I wonder whether we should move those here and extend them to SparseVecOrMat. Hopefully, this doesn't introduce method ambiguities, but can be done in a separate PR.

Done.
Concerning minimum and maximum, they definitely need to be moved and applied on SparseVecOrMat. There are no optimizations for sparse matrices on either of these functions for now.

JeffBezanson · 2020-09-08T19:46:42Z

stdlib/SparseArrays/src/higherorderfns.jl

@@ -1154,4 +1155,23 @@ map(f::Tf, A::SparseOrStructuredMatrix, Bs::Vararg{SparseOrStructuredMatrix,N})
 map!(f::Tf, C::AbstractSparseMatrixCSC, A::SparseOrStructuredMatrix, Bs::Vararg{SparseOrStructuredMatrix,N}) where {Tf,N} =
    (_checksameshape(C, A, Bs...); _noshapecheck_map!(f, C, _sparsifystructured(A), map(_sparsifystructured, Bs)...))

+
+# (13) extrema methods optimized for sparse vectors/matrices.
+function extrema(f, A::SparseVecOrMat)


This doesn't handle dims or dims=:, so I believe those will still call the slower generic code. We might want to at least handle dims=:, and use invoke to call the AbstractArray method for general dims.

I think it's worse because dims is a keyword argument, so this will error now if you pass a dims argument. This will probably need to be renamed to a hidden function _extrema and extrema needs to check for a dims argument and fall back to the slow code if dims is not :. An alternative of course would be to handle the dims argument here directly.

It doesn't throw errors. It only use the slow generic code.
There are only thew cases.
extrema(f, A::SparseVector, dims=1), extrema(f, A::AbstractSparseMatrix, dims=[1,2]) and extrema(f, SparseVecOrMat, ::Colon)
which are equivalent to extrema(f, A::SparseVecOrMat).
We still need to implement extrema(f, A::SparseVecOrMat, dims=1) and extrema(f, A::SparseVecOrMat, dims=2).
In my opinion there should be two separate functions because the optimized algorithms are quite different.
I believe it covers every cases.

The problem is that dims is not a positional argument and keyword args don't participate in dispatch. The correct function signature should probably be Base._extrema_dims(f, A::SparseVecOrMat, ::Colon), so this only gets called for the dims=: case (which is also the default).

I implemented the @JeffBezanson's idea of using invoke. Actually reimplementing the functions that handles dims would greatly improve speed but it is a bit tricky to get them consistent with the Base implementation.

codecov-commenter · 2020-09-09T00:01:41Z

Codecov Report

Merging #37429 into master will increase coverage by 0.49%.
The diff coverage is 72.36%.

@@            Coverage Diff             @@
##           master   #37429      +/-   ##
==========================================
+ Coverage   87.05%   87.55%   +0.49%     
==========================================
  Files         357      351       -6     
  Lines       72041    71010    -1031     
==========================================
- Hits        62718    62174     -544     
+ Misses       9323     8836     -487

Impacted Files	Coverage Δ
base/Base.jl	`50.98% <ø> (ø)`
base/expr.jl	`89.33% <69.56%> (-4.55%)`	⬇️
stdlib/Printf/src/Printf.jl	`69.89% <69.89%> (-17.28%)`	⬇️
base/reflection.jl	`88.88% <86.95%> (-0.10%)`	⬇️
base/compiler/ssair/show.jl	`83.40% <88.88%> (+0.14%)`	⬆️
base/compiler/optimize.jl	`75.00% <95.23%> (+0.92%)`	⬆️
base/abstractarray.jl	`88.34% <100.00%> (ø)`
base/compiler/tfuncs.jl	`86.64% <100.00%> (+0.10%)`	⬆️
base/logging.jl	`89.78% <100.00%> (+0.53%)`	⬆️
base/show.jl	`89.14% <100.00%> (ø)`
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bf886b5...ae6f998. Read the comment docs.

simeonschaub · 2020-09-09T13:56:42Z

stdlib/SparseArrays/src/higherorderfns.jl

+    vmin, vmax
+end
+
+extrema(A::SparseVecOrMat) = extrema(identity, A)


I believe this fallback is already defined in Base.

The more specific definition is
extrema(A::AbstractArray; dims = :) = _extrema_dims(identity, A, dims)
in base/multidimensional.jl line 1599.
So it seems this is needed.
I benchmarked extrema with:

julia> S = sprand(10000, 10000, 0.00001); julia> @btime extrema(S) 3.328 μs (1 allocation: 32 bytes) (0.0, 0.9993973695872556)

and without this line:

julia> @btime extrema(S) 643.939 ms (1 allocation: 32 bytes) (0.0, 0.9993973695872556)

simeonschaub · 2020-09-09T20:55:18Z

Why do you define separate methods for extrema_ for SparseVector and AbstractSparseMatrixCSC? I personally think just overloading Base._extrema_dims would be a much cleaner solution, as I would generally consider using invoke to be a bit of an antipattern, but I will leave the final judgement here to @JeffBezanson and @dkarrasch.

simeonschaub · 2020-09-09T20:57:26Z

stdlib/SparseArrays/test/higherorderfns.jl

+    n = 10
+    A = sprand(n, n, 0.2)
+    B = Array(A)
+    x = sprand(n, 0.2)
+    y = Array(x)
+    f(x) = x^3
+    @test extrema(A) == extrema(B)
+    @test extrema(x) == extrema(y)
+    @test extrema(f, A) == extrema(f, B)
+    @test extrema(f, x) == extrema(f, y)
+    @test extrema(spzeros(n, n)) == (0.0, 0.0)
+    @test extrema(spzeros(n)) == (0.0, 0.0)
+    @test_throws ArgumentError extrema(spzeros(0, 0))
+    @test_throws ArgumentError extrema(spzeros(0))
+    @test extrema(sparse(ones(n, n))) == (1.0, 1.0)
+    @test extrema(sparse(ones(n))) == (1.0, 1.0)


Can you add tests with dims=: and dims=2 to test that the fallback still works correctly?

I added some tests. It should cover most cases.

dkarrasch · 2020-09-10T07:08:54Z

I think I agree with @simeonschaub. If the generic extrema function (with the dims keyword) is the single entry point which calls Base._extrema_dims, then we can specialize the latter for SparseVecOrMat first, and depending on vector vs matrix and the dims argument, call

(for dims = : [or dims = (1,2) in the matrix case and dims = 1 in the vector case]) another "helper function" whose body is the fast, main routine added here,
(for dims = 1 and matrices) call extrema on column views (which should be not difficult to write and fast)
(for dims = 2) the invoke thing.

I wonder if it's acceptable for sparse matrices(!) and dims=2 to form the transpose and then compute extrema(...; dims=1). It can be a trap if the matrix isn't really sparse due to memory allocation, but it may be much faster for truly sparse matrices... But I guess allocating potentially much memory goes against the very concept of a reduction method.

cfauchereau · 2020-09-15T18:30:16Z

Why do you define separate methods for extrema_ for SparseVector and AbstractSparseMatrixCSC? I personally think just overloading Base._extrema_dims would be a much cleaner solution, as I would generally consider using invoke to be a bit of an antipattern, but I will leave the final judgement here to @JeffBezanson and @dkarrasch.

It is indeed quite cleaner to extend '_extrema_dims'. In the same idea, I rename 'extrema' so it extends '_extrema_itr'. The new tests are passing and it works well in manual experimentations. However some other tests are failing. It seems to be an ambiguity with '_extrema_dims'. I haven't figure out why yet.

dkarrasch · 2020-09-16T13:24:35Z

stdlib/SparseArrays/src/higherorderfns.jl

+    invoke(_extrema_dims, Tuple{Any, AbstractArray, Any}, f, A, dims)
+end
+
+_extrema_dims(f, A::SparseVecOrMat, ::Colon) = _extrema_itr(f, A)


I'm not sure, but maybe splitting this into two helps with the ambiguity?

Suggested change

_extrema_dims(f, A::SparseVecOrMat, ::Colon) = _extrema_itr(f, A)

_extrema_dims(f, A::AbstractSparseVector, ::Colon) = _extrema_itr(f, A)

_extrema_dims(f, A::AbstractSparseMatrix, ::Colon) = _extrema_itr(f, A)

I believe you could generalize the method on line 1174 towards AbstractSparseVector... well, or restrict the matrix methods to SparseMatrixCSC. I'm not sure what the API for the abstract ones is.

Unfortunately it doesn't work.
The method on line 1174 can't be generalized because nnz is not defined for AbstractSparseVector. Almost every function in Sparsearrays is defined for SparseVector anf AbstractSparseMatric, actually.

No problem, that's fine. But does splitting the methods into SparseVector and AbstractSparseMatrix help? I think there might be an ambiguity because the method here is more abstract in the second argument but more concrete in the third (but I forgot compared to which other method).

dkarrasch · 2020-09-20T13:01:21Z

This is certainly "ready for review", if not for merging. 😉

cfauchereau · 2020-09-26T14:13:32Z

It seems that the freebsd test is failing in test/stress.jl. I can't see how it is related to this pull request.

dkarrasch · 2020-09-26T14:29:53Z

Thank you @Clement-F for this great work.

cfauchereau force-pushed the sparse_extrema branch from fad44f3 to 3cf63de Compare September 6, 2020 16:33

cfauchereau force-pushed the sparse_extrema branch from 3cf63de to ae6f998 Compare September 6, 2020 22:04

dkarrasch approved these changes Sep 7, 2020

View reviewed changes

dkarrasch reviewed Sep 8, 2020

View reviewed changes

JeffBezanson reviewed Sep 8, 2020

View reviewed changes

simeonschaub reviewed Sep 9, 2020

View reviewed changes

cfauchereau force-pushed the sparse_extrema branch from ae6f998 to fe000fd Compare September 9, 2020 20:25

simeonschaub reviewed Sep 9, 2020

View reviewed changes

cfauchereau force-pushed the sparse_extrema branch from fe000fd to 63bd1fa Compare September 15, 2020 18:21

cfauchereau marked this pull request as draft September 15, 2020 18:21

dkarrasch reviewed Sep 16, 2020

View reviewed changes

fast extrema computation on sparse arrays

d2a6822

cfauchereau force-pushed the sparse_extrema branch from 63bd1fa to d2a6822 Compare September 20, 2020 11:42

cfauchereau marked this pull request as ready for review September 20, 2020 17:12

dkarrasch merged commit 93bbe08 into JuliaLang:master Sep 26, 2020

cfauchereau deleted the sparse_extrema branch September 26, 2020 15:13

tkf mentioned this pull request Jan 6, 2022

Fix extrema(x; dims) for inputs with NaN/missing #43604

Merged

5 tasks

	function maximum(x::AbstractSparseVector{T}) where T<:Real
	n = length(x)
	n > 0 \|\| throw(ArgumentError("maximum over empty array is not allowed."))
	m = nnz(x)
	(m == 0 ? zero(T) :
	m == n ? maximum(nonzeros(x)) :
	max(zero(T), maximum(nonzeros(x))))::T
	end

	_extrema_dims(f, A::SparseVecOrMat, ::Colon) = _extrema_itr(f, A)
	_extrema_dims(f, A::AbstractSparseVector, ::Colon) = _extrema_itr(f, A)
	_extrema_dims(f, A::AbstractSparseMatrix, ::Colon) = _extrema_itr(f, A)

Uh oh!

fix #37312 fast extrema computation on sparse arrays #37429

fix #37312 fast extrema computation on sparse arrays #37429

Uh oh!

Conversation

cfauchereau commented Sep 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jebej commented Sep 6, 2020

Uh oh!

cfauchereau commented Sep 6, 2020

Uh oh!

cfauchereau commented Sep 6, 2020

Uh oh!

dkarrasch commented Sep 7, 2020

Uh oh!

dkarrasch left a comment

Choose a reason for hiding this comment

Uh oh!

dkarrasch commented Sep 8, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simeonschaub Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Sep 9, 2020

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simeonschaub commented Sep 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkarrasch commented Sep 10, 2020

Uh oh!

cfauchereau commented Sep 15, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkarrasch commented Sep 20, 2020

Uh oh!

cfauchereau commented Sep 26, 2020

Uh oh!

dkarrasch commented Sep 26, 2020

Uh oh!

Uh oh!

cfauchereau commented Sep 6, 2020 •

edited

Loading

simeonschaub Sep 9, 2020 •

edited

Loading