Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor get_degrees() to make it faster #732

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

alyst
Copy link

@alyst alyst commented Mar 28, 2025

Profiling indicated that in my use case (symbolic multiplication of several large sparse matrices takes almost 1h) a lot of time is spent in get_degrees().

So this PR is my attempt to make it faster and use less memory:

  • some trivial fixes to avoid broadcasting
  • extract operator handling into separate routines (get_degrees(op, expr)) instead of if op == + .. elseif op == * end.
    That increases memory footprint by 10%, but also makes it faster by 10% (I guess type inference is improved).
  • avoid calling sorted_arguments() within get_degrees(), as that itself calls get_degrees(), so get_degrees() gets called multiple times for the same expression.
    Instead, get_degrees() uses arguments() and sorts the list of degrees itself.
  • Cache the get_degrees(getindex, expr): that increases the speed and 2x reduces memory footprint (in the test example below)

On the test case I observe 3x speed-up and 3x reduction in memory footprint.

The benchmark script

using SymbolicUtils, BenchmarkTools

@syms x y[1:3] z[1:2, 1:2]
# Create a complex polynomial expression
y1 = term(getindex, y, 1, type=Number)
y2 = term(getindex, y, 2, type=Number)
y3 = term(getindex, y, 3, type=Number)
z11 = term(getindex, z, 1, 1, type=Number)
z12 = term(getindex, z, 1, 2, type=Number)
z23 = term(getindex, z, 2, 3, type=Number)
large_poly = SymbolicUtils.expand((x^2 + 2y1 + 3z12 + y2*z23 + x*y1*z12 - x^2*z12 + x*z11 + y3 + y2 + z23 + 1)^8);

b = @benchmark SymbolicUtils.get_degrees($large_poly) samples=50 evals=1

Results on a master (Julia 1.11.4):

BenchmarkTools.Trial: 16 samples with 1 evaluation per sample.
 Range (min … max):  298.827 ms … 359.108 ms  ┊ GC (min … max): 0.00% … 12.83%
 Time  (median):     311.555 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   319.948 ms ±  19.761 ms  ┊ GC (mean ± σ):  4.18% ±  5.27%

      █                                   ▁                      
  ▆▆▁▆█▁▁▆▁▁▁▁▁▁▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▆▁▆▁▁▁▆▁█▁▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ ▁
  299 ms           Histogram: frequency by time          359 ms <

 Memory estimate: 116.91 MiB, allocs estimate: 3390531.

BenchmarkTools.Trial: 16 samples with 1 evaluation per sample.
 Range (min … max):  299.432 ms … 360.751 ms  ┊ GC (min … max): 0.00% … 13.12%
 Time  (median):     318.398 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   320.586 ms ±  19.074 ms  ┊ GC (mean ± σ):  4.72% ±  5.29%

  ▃   ▃█                                                         
  █▁▁▇██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▇▇▇▁▇▇▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  299 ms           Histogram: frequency by time          361 ms <

 Memory estimate: 116.91 MiB, allocs estimate: 3390531.

Results for this PR:

BenchmarkTools.Trial: 43 samples with 1 evaluation per sample.
 Range (min … max):  107.692 ms … 156.781 ms  ┊ GC (min … max): 0.00% … 22.67%
 Time  (median):     109.903 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   116.491 ms ±  11.735 ms  ┊ GC (mean ± σ):  3.87% ±  6.74%

  █▁                                                             
  ██▃▁▁▃▁▃▃▁▄▁▃▁▁▁▆▄▁▄▃▁▁▃▁▁▁▁▁▃▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃ ▁
  108 ms           Histogram: frequency by time          157 ms <

 Memory estimate: 43.92 MiB, allocs estimate: 1448421.

BenchmarkTools.Trial: 43 samples with 1 evaluation per sample.
 Range (min … max):  107.790 ms … 146.127 ms  ┊ GC (min … max): 0.00% … 23.08%
 Time  (median):     111.730 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   116.886 ms ±  10.678 ms  ┊ GC (mean ± σ):  3.84% ±  6.82%

  █                                                              
  ██▆▃▁▁▃▁▃▁▃▁▁▃▃▁▁▁▁▁▃▄▃▇▁▆▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▃▁▁▃ ▁
  108 ms           Histogram: frequency by time          146 ms <

 Memory estimate: 43.92 MiB, allocs estimate: 1448421.

Copy link
Contributor

github-actions bot commented Mar 28, 2025

Benchmark Results

master f76e4d0... master/f76e4d0eb5f1a3...
overhead/acrule/a+2 0.988 ± 0.3 μs 0.979 ± 0.32 μs 1.01
overhead/acrule/a+2+b 1.14 ± 0.33 μs 0.975 ± 0.33 μs 1.17
overhead/acrule/a+b 0.267 ± 0.027 μs 0.264 ± 0.0082 μs 1.01
overhead/acrule/noop:Int 1.55 ± 0.01 ns 1.56 ± 0.01 ns 0.994
overhead/acrule/noop:Sym 27.4 ± 6.1 ns 27.5 ± 6.1 ns 0.998
overhead/rule/noop:Int 0.0355 ± 0.025 μs 0.0352 ± 0.024 μs 1.01
overhead/rule/noop:Sym 0.0329 ± 0.024 μs 0.0334 ± 0.024 μs 0.987
overhead/rule/noop:Term 0.0333 ± 0.025 μs 0.0335 ± 0.025 μs 0.995
overhead/ruleset/noop:Int 0.119 ± 0.0032 μs 0.118 ± 0.0093 μs 1.02
overhead/ruleset/noop:Sym 0.129 ± 0.0044 μs 0.13 ± 0.01 μs 0.991
overhead/ruleset/noop:Term 4.33 ± 0.57 μs 4.52 ± 0.86 μs 0.959
overhead/simplify/noop:Int 0.242 ± 0.031 μs 0.213 ± 0.034 μs 1.14
overhead/simplify/noop:Sym 0.239 ± 0.031 μs 0.215 ± 0.034 μs 1.11
overhead/simplify/noop:Term 0.0372 ± 0.0027 ms 0.0377 ± 0.0048 ms 0.986
overhead/simplify/randterm (+, *):serial 0.0944 ± 0.0034 s 0.0914 ± 0.0041 s 1.03
overhead/simplify/randterm (+, *):thread 0.0563 ± 0.0066 s 0.0571 ± 0.0058 s 0.986
overhead/simplify/randterm (/, *):serial 0.204 ± 0.023 ms 0.208 ± 0.026 ms 0.982
overhead/simplify/randterm (/, *):thread 0.237 ± 0.022 ms 0.237 ± 0.026 ms 0.998
overhead/substitute/a 0.102 ± 0.013 ms 0.101 ± 0.012 ms 1.01
overhead/substitute/a,b 0.0882 ± 0.011 ms 0.0891 ± 0.01 ms 0.99
overhead/substitute/a,b,c 20.1 ± 2.3 μs 19.5 ± 2.3 μs 1.03
polyform/easy_iszero 0.039 ± 0.0025 ms 0.0422 ± 0.0064 ms 0.923
polyform/isone 3.1 ± 0.01 ns 2.79 ± 0.01 ns 1.11
polyform/iszero 1.33 ± 0.033 ms 1.34 ± 0.034 ms 0.995
polyform/simplify_fractions 1.81 ± 0.052 ms 1.81 ± 0.044 ms 1
time_to_load 1 ± 0.0081 s 1.02 ± 0.011 s 0.988

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@AayushSabharwal
Copy link
Member

Thanks for implementing this, the performance improvement is great. I have a couple concerns:

  • It seems to have different behavior. The Symbolics reference test is failing
  • Could _get_degrees be written so it doesn't dispatch on op? Since BasicSymbolic is type-unstable (out of necessity) it's always going to face runtime dispatch this way, and would be better off as an ifelse chain.

@alyst
Copy link
Author

alyst commented Mar 28, 2025

It seems to have different behavior. The Symbolics reference test is failing

Yes, I think that's because I have added support for /, so now it puts the negative powers first.
I believe before fractions were going the "zzzz" + hash() path.
I can fix it, but probably that's the right moment to clarify what is the expected sorting for the terms with negative powers?
Should the terms be sorted by abs(deg)?

Could _get_degrees be written so it doesn't dispatch on op? Since BasicSymbolic is type-unstable (out of necessity) it's always going to face runtime dispatch this way, and would be better off as an ifelse chain.

I can do that, but, as I wrote in the description, the dispatching version is 10% faster.
I guess that's because type inference for local variables that are reused by the if branches works better when the code is in separate functions.
So maybe if the variable names in different op branches don't overlap, it will be faster -- but if that's the case, it is some fragile code with obscure requirements to be fast (plus it is less readable).

Alexey Stukalov added 6 commits April 1, 2025 16:08
ignore degrees (a[2], b[2]) if syms (a[1], b[1]) are not equal
* replace if op = ... with get_degrees(op, expr) for extendability
* avoid sorted_arguments() within get_degrees()
* reduce allocations by iterating of degs in efficient way
* support /: const/expr and expr/const
keep vectors of degrees to optimize memory usage
For long polynomials get_degree(term) causes most allocations and time.
Cashing get_degree(term) to speeds it up and reduces memory footprint.
@alyst alyst force-pushed the refactor_get_degrees branch from 2c9922a to bcec2ee Compare April 1, 2025 23:09
@alyst
Copy link
Author

alyst commented Apr 2, 2025

I've changed the sorting of negative degree:

  • for lexlt(): terms are sorted by descending abs(deg), if equal, positive degree is first
  • for monomial_lt(): terms are sorted by ascending abs(deg), if equal, positive degree is first

That should fix the Symbolics.jl integration tests (I've also added the failing expression to SymbolicUtils.jl unit tests).
The failures in ModelingToolkit.jl don't seem to be directly related to this PR. Probably some spurious numerical method failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants