Refactor get_degrees() to make it faster #732

alyst · 2025-03-28T00:18:08Z

Profiling indicated that in my use case (symbolic multiplication of several large sparse matrices takes almost 1h) a lot of time is spent in get_degrees().

So this PR is my attempt to make it faster and use less memory:

some trivial fixes to avoid broadcasting
extract operator handling into separate routines (get_degrees(op, expr)) instead of if op == + .. elseif op == * end.
That increases memory footprint by 10%, but also makes it faster by 10% (I guess type inference is improved).
avoid calling sorted_arguments() within get_degrees(), as that itself calls get_degrees(), so get_degrees() gets called multiple times for the same expression.
Instead, get_degrees() uses arguments() and sorts the list of degrees itself.
Cache the get_degrees(getindex, expr): that increases the speed and 2x reduces memory footprint (in the test example below)

On the test case I observe 3x speed-up and 3x reduction in memory footprint.

The benchmark script

using SymbolicUtils, BenchmarkTools

@syms x y[1:3] z[1:2, 1:2]
# Create a complex polynomial expression
y1 = term(getindex, y, 1, type=Number)
y2 = term(getindex, y, 2, type=Number)
y3 = term(getindex, y, 3, type=Number)
z11 = term(getindex, z, 1, 1, type=Number)
z12 = term(getindex, z, 1, 2, type=Number)
z23 = term(getindex, z, 2, 3, type=Number)
large_poly = SymbolicUtils.expand((x^2 + 2y1 + 3z12 + y2*z23 + x*y1*z12 - x^2*z12 + x*z11 + y3 + y2 + z23 + 1)^8);

b = @benchmark SymbolicUtils.get_degrees($large_poly) samples=50 evals=1

Results on a master (Julia 1.11.4):

BenchmarkTools.Trial: 16 samples with 1 evaluation per sample.
 Range (min … max):  298.827 ms … 359.108 ms  ┊ GC (min … max): 0.00% … 12.83%
 Time  (median):     311.555 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   319.948 ms ±  19.761 ms  ┊ GC (mean ± σ):  4.18% ±  5.27%

      █                                   ▁                      
  ▆▆▁▆█▁▁▆▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▆▁▁▁▆▁█▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  299 ms           Histogram: frequency by time          359 ms <

 Memory estimate: 116.91 MiB, allocs estimate: 3390531.

BenchmarkTools.Trial: 16 samples with 1 evaluation per sample.
 Range (min … max):  299.432 ms … 360.751 ms  ┊ GC (min … max): 0.00% … 13.12%
 Time  (median):     318.398 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   320.586 ms ±  19.074 ms  ┊ GC (mean ± σ):  4.72% ±  5.29%

  ▃   ▃█                                                         
  █▁▁▇██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▇▇▇▁▇▇▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  299 ms           Histogram: frequency by time          361 ms <

 Memory estimate: 116.91 MiB, allocs estimate: 3390531.

Results for this PR:

BenchmarkTools.Trial: 43 samples with 1 evaluation per sample.
 Range (min … max):  107.692 ms … 156.781 ms  ┊ GC (min … max): 0.00% … 22.67%
 Time  (median):     109.903 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   116.491 ms ±  11.735 ms  ┊ GC (mean ± σ):  3.87% ±  6.74%

  █▁                                                             
  ██▃▁▁▃▁▃▃▁▄▁▃▁▁▁▆▄▁▄▃▁▁▃▁▁▁▁▁▃▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃ ▁
  108 ms           Histogram: frequency by time          157 ms <

 Memory estimate: 43.92 MiB, allocs estimate: 1448421.

BenchmarkTools.Trial: 43 samples with 1 evaluation per sample.
 Range (min … max):  107.790 ms … 146.127 ms  ┊ GC (min … max): 0.00% … 23.08%
 Time  (median):     111.730 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   116.886 ms ±  10.678 ms  ┊ GC (mean ± σ):  3.84% ±  6.82%

  █                                                              
  ██▆▃▁▁▃▁▃▁▃▁▁▃▃▁▁▁▁▁▃▄▃▇▁▆▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▃▁▁▃ ▁
  108 ms           Histogram: frequency by time          146 ms <

 Memory estimate: 43.92 MiB, allocs estimate: 1448421.

github-actions · 2025-03-28T00:24:59Z

Benchmark Results

	master	`f76e4d0`...	master/f76e4d0eb5f1a3...
overhead/acrule/a+2	0.988 ± 0.3 μs	0.979 ± 0.32 μs	1.01
overhead/acrule/a+2+b	1.14 ± 0.33 μs	0.975 ± 0.33 μs	1.17
overhead/acrule/a+b	0.267 ± 0.027 μs	0.264 ± 0.0082 μs	1.01
overhead/acrule/noop:Int	1.55 ± 0.01 ns	1.56 ± 0.01 ns	0.994
overhead/acrule/noop:Sym	27.4 ± 6.1 ns	27.5 ± 6.1 ns	0.998
overhead/rule/noop:Int	0.0355 ± 0.025 μs	0.0352 ± 0.024 μs	1.01
overhead/rule/noop:Sym	0.0329 ± 0.024 μs	0.0334 ± 0.024 μs	0.987
overhead/rule/noop:Term	0.0333 ± 0.025 μs	0.0335 ± 0.025 μs	0.995
overhead/ruleset/noop:Int	0.119 ± 0.0032 μs	0.118 ± 0.0093 μs	1.02
overhead/ruleset/noop:Sym	0.129 ± 0.0044 μs	0.13 ± 0.01 μs	0.991
overhead/ruleset/noop:Term	4.33 ± 0.57 μs	4.52 ± 0.86 μs	0.959
overhead/simplify/noop:Int	0.242 ± 0.031 μs	0.213 ± 0.034 μs	1.14
overhead/simplify/noop:Sym	0.239 ± 0.031 μs	0.215 ± 0.034 μs	1.11
overhead/simplify/noop:Term	0.0372 ± 0.0027 ms	0.0377 ± 0.0048 ms	0.986
overhead/simplify/randterm (+, *):serial	0.0944 ± 0.0034 s	0.0914 ± 0.0041 s	1.03
overhead/simplify/randterm (+, *):thread	0.0563 ± 0.0066 s	0.0571 ± 0.0058 s	0.986
overhead/simplify/randterm (/, *):serial	0.204 ± 0.023 ms	0.208 ± 0.026 ms	0.982
overhead/simplify/randterm (/, *):thread	0.237 ± 0.022 ms	0.237 ± 0.026 ms	0.998
overhead/substitute/a	0.102 ± 0.013 ms	0.101 ± 0.012 ms	1.01
overhead/substitute/a,b	0.0882 ± 0.011 ms	0.0891 ± 0.01 ms	0.99
overhead/substitute/a,b,c	20.1 ± 2.3 μs	19.5 ± 2.3 μs	1.03
polyform/easy_iszero	0.039 ± 0.0025 ms	0.0422 ± 0.0064 ms	0.923
polyform/isone	3.1 ± 0.01 ns	2.79 ± 0.01 ns	1.11
polyform/iszero	1.33 ± 0.033 ms	1.34 ± 0.034 ms	0.995
polyform/simplify_fractions	1.81 ± 0.052 ms	1.81 ± 0.044 ms	1
time_to_load	1 ± 0.0081 s	1.02 ± 0.011 s	0.988

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

AayushSabharwal · 2025-03-28T11:41:49Z

Thanks for implementing this, the performance improvement is great. I have a couple concerns:

It seems to have different behavior. The Symbolics reference test is failing
Could _get_degrees be written so it doesn't dispatch on op? Since BasicSymbolic is type-unstable (out of necessity) it's always going to face runtime dispatch this way, and would be better off as an ifelse chain.

alyst · 2025-03-28T16:58:14Z

It seems to have different behavior. The Symbolics reference test is failing

Yes, I think that's because I have added support for /, so now it puts the negative powers first.
I believe before fractions were going the "zzzz" + hash() path.
I can fix it, but probably that's the right moment to clarify what is the expected sorting for the terms with negative powers?
Should the terms be sorted by abs(deg)?

Could _get_degrees be written so it doesn't dispatch on op? Since BasicSymbolic is type-unstable (out of necessity) it's always going to face runtime dispatch this way, and would be better off as an ifelse chain.

I can do that, but, as I wrote in the description, the dispatching version is 10% faster.
I guess that's because type inference for local variables that are reused by the if branches works better when the code is in separate functions.
So maybe if the variable names in different op branches don't overlap, it will be faster -- but if that's the case, it is some fragile code with obscure requirements to be fast (plus it is less readable).

ignore degrees (a[2], b[2]) if syms (a[1], b[1]) are not equal

* replace if op = ... with get_degrees(op, expr) for extendability * avoid sorted_arguments() within get_degrees() * reduce allocations by iterating of degs in efficient way * support /: const/expr and expr/const

keep vectors of degrees to optimize memory usage

For long polynomials get_degree(term) causes most allocations and time. Cashing get_degree(term) to speeds it up and reduces memory footprint.

alyst · 2025-04-02T00:03:22Z

I've changed the sorting of negative degree:

for lexlt(): terms are sorted by descending abs(deg), if equal, positive degree is first
for monomial_lt(): terms are sorted by ascending abs(deg), if equal, positive degree is first

That should fix the Symbolics.jl integration tests (I've also added the failing expression to SymbolicUtils.jl unit tests).
The failures in ModelingToolkit.jl don't seem to be directly related to this PR. Probably some spurious numerical method failures.

Alexey Stukalov added 6 commits April 1, 2025 16:08

fix lexlt

f9bbf44

ignore degrees (a[2], b[2]) if syms (a[1], b[1]) are not equal

get_degrees(): avoid broadcasting

24e1e55

get_degrees(): extend docstring

d29d180

get_degrees(): refactor

2718054

* replace if op = ... with get_degrees(op, expr) for extendability * avoid sorted_arguments() within get_degrees() * reduce allocations by iterating of degs in efficient way * support /: const/expr and expr/const

get_degrees(): convert to tuple at the very end

929eeeb

keep vectors of degrees to optimize memory usage

get_degees(): cache term syms

bcec2ee

For long polynomials get_degree(term) causes most allocations and time. Cashing get_degree(term) to speeds it up and reduces memory footprint.

alyst force-pushed the refactor_get_degrees branch from 2c9922a to bcec2ee Compare April 1, 2025 23:09

monomial/lex_lt: sort by abs degree, positive first

f76e4d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor get_degrees() to make it faster #732

Refactor get_degrees() to make it faster #732

alyst commented Mar 28, 2025

github-actions bot commented Mar 28, 2025 •

edited

Loading

AayushSabharwal commented Mar 28, 2025

alyst commented Mar 28, 2025

alyst commented Apr 2, 2025

Refactor get_degrees() to make it faster #732

Are you sure you want to change the base?

Refactor get_degrees() to make it faster #732

Conversation

alyst commented Mar 28, 2025

github-actions bot commented Mar 28, 2025 • edited Loading

Benchmark Results

Benchmark Plots

AayushSabharwal commented Mar 28, 2025

alyst commented Mar 28, 2025

alyst commented Apr 2, 2025

github-actions bot commented Mar 28, 2025 •

edited

Loading