[examples] Add performance testing for linalg.matmul operator using three passes #614
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
1. Overview
linalg.matmulops appearing inbuild/examples/BuddyDeepSeekR1/subgraph0_prefill.mlirandbuild/examples/BuddyDeepSeekR1/subgraph0_decode.mlir.linalg.matmulacross different dimensions.2. Experiment Environment
Hardware Configuration
Software Environment
examples/BuddyNext/next-linalg-matmul-vec-perfmake next-linalg-matmul-vec-perf-run3. Statistics of linalg.matmul op sizes
The following table summarizes the sizes and occurrences of
linalg.matmulops found insubgraph0_prefill.mlirandsubgraph0_decode.mlir.subgraph0_prefill.mlirsubgraph0_decode.mlir4. Comparison of Different Passes
In
examples/BuddyNext/next-linalg-matmul-perf.mlir, the 10 differentlinalg.matmulop sizes were tested with three optimization passes:-matmul-parallel-vectorization-optimize-matmul-vectorization-matmul-vectorization-blisThe execution times were recorded for each case.
Multi-thread Comparison
The table below shows execution times (in seconds) of each
linalg.matmulop under different matrix vectorization passes with multi-threading enabled.Single-thread Comparison
In the multi-thread version, the passes include parallelization optimizations. To isolate the effect of each
matmul vectorization pass, the parallelization-related passes were removed for a single-thread comparison.Test Screenshots
Due to the large number of screenshots, only the case of
-matmul-parallel-vectorization-optimizefor the[1024,1536,8960]size—where long execution time was observed—is shown below.Although all results were obtained using the same pass, the timing varies significantly. The upper row shows the expected result.
5. Conclusion
linalg.matmulop size under multi-threaded parallel conditions.matmul-parallel-vectorization-optimize passshows abnormally long execution time at size[1024,1536,8960].