Add a pass that fuses matmul and transpose operations #567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

FloatingcloudKnight wants to merge 1 commit into buddy-compiler:main from FloatingcloudKnight:test

Contributor

FloatingcloudKnight commented Oct 2, 2025

For the transpose-matrix multiplication-transpose pattern in Llama2, perform fusion and vectorization during dialect reduction. The acceleration ratio at the operator level is 1.84 before and after fusion, while it is not visible at the model level.
After applying this pass, the computational speed of transpose-matmul-transpose changed from 0.00528216 to 0.00191212.


          Add a pass that fuses matmul and transpose operations

f95364d

For the transpose-matrix multiplication-transpose pattern in Llama2, perform fusion and vectorization during dialect reduction.
The acceleration ratio at the operator level is 1.84 before and after fusion, while it is not visible at the model level.

linuxlonelyeagle reviewed

View reviewed changes

midend/lib/Conversion/TransposeOptimization/TransposeFusionVectorization.cpp

+                  Value B = op->getOperand(1);
+                  Value C = op->getOpResult(0);
+                  tosa::ReshapeOp reshapeBOp = B.getDefiningOp<tosa::ReshapeOp>();

Member

linuxlonelyeagle Oct 2, 2025

maybe you can use auto here, getDefiningOptosa::ReshapeOp(); we know the type of the op

linuxlonelyeagle suggested changes

View reviewed changes

Member

linuxlonelyeagle left a comment

a brief review.

midend/lib/Conversion/TransposeOptimization/TransposeFusionVectorization.cpp

+                  if (!transposeBOp) {
+                    return failure();
+                  }
+                  Value::user_iterator reshapeCUserIt = C.getUsers().begin();

Member

linuxlonelyeagle Oct 2, 2025

C.getUsers().empty() is good

midend/lib/Conversion/TransposeOptimization/TransposeFusionVectorization.cpp

+                  ShapedType newBType =
+                      cast<ShapedType>(transposeBOp.getOperand(0).getType());
+                  ShapedType newCType =
+                      cast<ShapedType>(transposeCOp->getOpResult(0).getType());

Member

linuxlonelyeagle Oct 2, 2025

can you use ransposeCOp->getresult->getType()

Member

linuxlonelyeagle Oct 2, 2025

or transposeCOp.getType()

midend/lib/Conversion/TransposeOptimization/TransposeFusionVectorization.cpp

+                  Value vlStep = rewriter.create<arith::ConstantIndexOp>(loc, vecSize);
+                  Value zero = rewriter.create<arith::ConstantOp>(
+                      loc, rewriter.getZeroAttr(elementType));
+                  const AffineExpr d0 = rewriter.getAffineDimExpr(0);

Member

linuxlonelyeagle Oct 2, 2025

don't use const

midend/lib/Conversion/TransposeOptimization/TransposeFusionVectorization.cpp

+                  // Create pass through vector.
+                  Value passThroughVec = rewriter.create<SplatOp>(loc, vectorTy, zero);
+                  Value newA = rewriter.create<bufferization::ToMemrefOp>(

Member

linuxlonelyeagle Oct 2, 2025

Is it possible to avoid using the bufferize dialect here?This is just a fusion pattern.

midend/lib/Conversion/TransposeOptimization/TransposeFusionVectorization.cpp

+                  Value aCol = rewriter.create<memref::DimOp>(loc, newA, c2);
+                  Value bCol = rewriter.create<memref::DimOp>(loc, newB, c3);
+                  Value upperBoundTmp = rewriter.create<arith::SubIOp>(loc, bCol, vlStep);

Member

linuxlonelyeagle Oct 2, 2025

sub and add we can use affine,Rather than creating add and sub operations.

midend/lib/Conversion/TransposeOptimization/TransposeFusionVectorization.cpp

+                  // loopBody->addArguments(types, locs);
+                  Block &loopBody = parOp.getRegion().front();
+                  rewriter.setInsertionPointToStart(&loopBody);
+                  Value ivs0 = loopBody.getArguments()[0];

Member

linuxlonelyeagle Oct 2, 2025

iv = loopOp.getLoopInductionVar

midend/lib/Conversion/TransposeOptimization/TransposeFusionVectorization.cpp

+                                                              newC,
+                                                              ValueRange{c0, ivs1, ivs0, iv});
+                        Value idx =
+                            nestedBuilder.create<arith::AddIOp>(nestedLoc, iv, vlStep);

Member

linuxlonelyeagle Oct 2, 2025

use affine.apply add

Contributor Author

FloatingcloudKnight Oct 2, 2025

Thank you for your feedback. While considering avoiding the bufferize dialect, I discovered this pass can be moved from the TOSA level to be completed at the Linalg level. I will resubmit all modifications after completing this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet