Describe the bug
When I trying to add operator fusion pass such as flashattention or RMsNorm into E2E pass, the performance declined.
before I using norm opt, the E2E performance is:
after i using norm opt, the performance is:
To Reproduce
Reproduce this issue by #596 or #553.