Optimize matmuls involving block diagonal matrices #1493

jessegrabowski · 2025-06-21T20:10:15Z

Description

This PR adds a rewrite to optimize matrix multiplication involving block diagonal matrices. When we have a a matrix X = BlockDiag(A, B), when you do Z = X @ Y, there's no interaction between terms in the A part and B part of the X matrix. So the dot can be instead computed as row_stack(A @ Y[:X.shape[0]], B @ Y[X.shape[0]:] (or in the general case, Y can be split into n pieces with appropriate shapes, and do row_stack([diag_component @ y_split for diag_component, y_split in zip(BlockDiag.inputs, split(Y, *args)]). If the case where the blockdiag matrix is right-multiplying, you instead col_stack and slice on axis=1.

Anyway, it's a lot faster to do this, because matmuls scale really badly in the dimension of the input, so doing two smaller operations is preferred. Here are the benchmarks, small has n=10, medium has n=100, large has n=1000. But in all cases it shows at least 2x speedup.

---------------------------------------------------------------------------------------------------------------- benchmark: 6 tests ----------------------------------------------------------------------------------------------------------------
Name (time in us)                                                       Min                   Max                  Mean              StdDev              Median                 IQR             Outliers           OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_block_diag_dot_to_dot_concat_benchmark[small-rewrite]           4.5830 (1.0)         90.7090 (2.13)         5.3007 (1.0)        1.6155 (2.53)       5.2080 (1.0)        0.1660 (1.0)         67;560  188,654.6546 (1.0)       12533           1
test_block_diag_dot_to_dot_concat_benchmark[small-no_rewrite]        8.5420 (1.86)        90.1670 (2.12)        10.1183 (1.91)       1.6055 (2.51)      10.0000 (1.92)       0.1680 (1.01)      430;2150   98,830.6599 (0.52)      18721           1

test_block_diag_dot_to_dot_concat_benchmark[medium-rewrite]          6.1250 (1.34)        44.8750 (1.05)         7.2724 (1.37)       0.6386 (1.0)        7.4170 (1.42)       0.2490 (1.50)     7575;7886  137,505.3510 (0.73)      35875           1
test_block_diag_dot_to_dot_concat_benchmark[medium-no_rewrite]      14.0420 (3.06)        42.6250 (1.0)         16.5707 (3.13)       1.3341 (2.09)      17.2500 (3.31)       2.1660 (13.05)     1174;108   60,347.4538 (0.32)      12177           1

test_block_diag_dot_to_dot_concat_benchmark[large-rewrite]          14.6660 (3.20)       248.2920 (5.83)        16.5375 (3.12)       4.7284 (7.40)      16.1250 (3.10)       0.4590 (2.76)      249;1621   60,468.5555 (0.32)      18765           1
test_block_diag_dot_to_dot_concat_benchmark[large-no_rewrite]      788.6250 (172.08)   1,982.7500 (46.52)    1,019.2728 (192.29)   150.6524 (235.91)   987.3335 (189.58)   130.6250 (786.86)      132;63      981.0916 (0.01)        734           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Related Issue

Closes Add rewrite to optimize block_diag(a, b) @ c #1044
Related to #

Checklist

Checked that the pre-commit linting/style checks pass
Included tests that prove the fix is effective or that the new feature works
Added necessary documentation (docstrings and/or example notebooks)
If you are a pro: each commit corresponds to a relevant logical change

Type of change

📚 Documentation preview 📚: https://pytensor--1493.org.readthedocs.build/en/1493/

Copilot

Pull Request Overview

This PR introduces an optimization that rewrites matrix multiplications involving block diagonal matrices into separate smaller multiplications and concatenations, yielding significant performance gains. It also adds tests to verify the rewrite and benchmarks to measure its impact.

Implement local_block_diag_dot_to_dot_block_diag rewrite in math.py
Import and wire up necessary primitives (split, join, BlockDiagonal)
Add unit tests and benchmarks in test_math.py to validate correctness and performance

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
pytensor/tensor/rewriting/math.py	Added the `local_block_diag_dot_to_dot_block_diag` rewrite and required imports (`split`, `join`, `BlockDiagonal`)
tests/tensor/rewriting/test_math.py	Added tests (`test_local_block_diag_dot_to_dot_block_diag`) and benchmarks (`test_block_diag_dot_to_dot_concat_benchmark`)

Comments suppressed due to low confidence (1)

pytensor/tensor/rewriting/math.py:191

The name Blockwise is referenced but not imported, which will raise a NameError if the first condition is false. Add from pytensor.tensor.slinalg import Blockwise (or the correct module) at the top of the file.

            or isinstance(x.owner.op, Blockwise)

tests/tensor/rewriting/test_math.py

codecov · 2025-06-21T21:03:27Z

Codecov Report

Attention: Patch coverage is 85.71429% with 4 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@d3bbc20). Learn more about missing BASE report.
Report is 24 commits behind head on main.

Files with missing lines	Patch %	Lines
pytensor/tensor/rewriting/math.py	85.71%	2 Missing and 2 partials ⚠️

❌ Your patch check has failed because the patch coverage (85.71%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1493   +/-   ##
=======================================
  Coverage        ?   81.98%           
=======================================
  Files           ?      231           
  Lines           ?    52231           
  Branches        ?     9196           
=======================================
  Hits            ?    42822           
  Misses          ?     7098           
  Partials        ?     2311

Files with missing lines	Coverage Δ
pytensor/tensor/rewriting/math.py	`89.43% <85.71%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ricardoV94

Looks great! Some minor optimization questions

pytensor/tensor/rewriting/math.py

ricardoV94 · 2025-06-26T09:23:24Z

pytensor/tensor/rewriting/math.py

+        # non-block diagonal, and return a new block diagonal
+        if check_for_block_diag(x) and not check_for_block_diag(y):
+            components = x.owner.inputs
+            y_splits = split(


Isn't this and the join along the 0th axis assuming a BlockwiseBlockDiagonal without batch dims?

Also not sure why you look for Dot but not Blockwise of _matrix_matrix_matmul. It doesn't always get rewritten as a Dot.

I guess because you don't look for batch dot you can only have a BlockDiagonal without batch dims. That's fine but maybe a bit implicit. You can also wait for the useless BlockwiseBlockdiagonal to be rewritten as BlockDiagonal and only track that.

More importantly because you track a regular dot and not the matmul you may have a vector * matrix or matrix * vector product. Does the rewrite handle these correctly?

If this is what's implied it's not on purpose. I'll modify it to account for blockwise dot.

There's no canonical dot form we rewrite to in an intermediate step to make reasoning about graphs easier? It seems nuts to have to to look for a bunch of different _matrix_matrix_matmul or _matrix_vec_matmul or dot22 or whatever.

I'm simplifying every blockwise as blockwise 2x2 dot (ie matmul) in #1471

The dot22 and dot22scalar are stuff from the blas pipeline and I've been hesitant to touch it. As first steps I would like to move them after specialize and to get rid of dot22scalar (should just be gemm). Those blas stuff should also work with blockwise but they currently don't.

Anyway if you target blockwise and core 2x2dot in this PR that should be the most robust going forward even if it misses some cases now. I suggest you explicitly exclude the vector matrix dots for now.

skipping matrix-vector is a bit of a bummer

jessegrabowski added 2 commits June 21, 2025 19:50

block_diag dot rewrite

9ad9540

Handle right-multiplication case

ffb71d3

jessegrabowski added graph rewriting performance linalg Linear algebra labels Jun 21, 2025

jessegrabowski requested review from Copilot and ricardoV94 and removed request for Copilot June 21, 2025 20:14

Copilot AI reviewed Jun 21, 2025

View reviewed changes

tests/tensor/rewriting/test_math.py Outdated Show resolved Hide resolved

The robot was right!

c5137d7

ricardoV94 added the enhancement New feature or request label Jun 21, 2025

ricardoV94 reviewed Jun 21, 2025

View reviewed changes

pytensor/tensor/rewriting/math.py Show resolved Hide resolved

pytensor/tensor/rewriting/math.py Outdated Show resolved Hide resolved

pytensor/tensor/rewriting/math.py Outdated Show resolved Hide resolved

Respond to feedback

3b66eba

jessegrabowski requested a review from ricardoV94 June 22, 2025 10:37

Use rewrite_mode defined in test_math.py for testing

09bddf1

ricardoV94 reviewed Jun 23, 2025

View reviewed changes

pytensor/tensor/rewriting/math.py Outdated Show resolved Hide resolved

Handle case with multiple clients

17fbeb3

ricardoV94 reviewed Jun 26, 2025

View reviewed changes

pytensor/tensor/rewriting/math.py Outdated Show resolved Hide resolved

use continue on rewrite failures when checking clients

7cef064

ricardoV94 reviewed Jun 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize matmuls involving block diagonal matrices #1493

Optimize matmuls involving block diagonal matrices #1493

jessegrabowski commented Jun 21, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

codecov bot commented Jun 21, 2025 •

edited

Loading

Uh oh!

ricardoV94 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ricardoV94 Jun 26, 2025

Uh oh!

ricardoV94 Jun 26, 2025 •

edited

Loading

Uh oh!

jessegrabowski Jun 26, 2025

Uh oh!

ricardoV94 Jun 26, 2025

Uh oh!

jessegrabowski Jun 26, 2025

Uh oh!

Uh oh!

Optimize matmuls involving block diagonal matrices #1493

Are you sure you want to change the base?

Optimize matmuls involving block diagonal matrices #1493

Conversation

jessegrabowski commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Checklist

Type of change

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

codecov bot commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ricardoV94 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ricardoV94 Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

ricardoV94 Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jessegrabowski Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

ricardoV94 Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

jessegrabowski Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jessegrabowski commented Jun 21, 2025 •

edited

Loading

codecov bot commented Jun 21, 2025 •

edited

Loading

ricardoV94 Jun 26, 2025 •

edited

Loading