Skip to content

Support Call::dynamic_shuffle for LUT in SVE2#9087

Open
stevesuzuki-arm wants to merge 2 commits intohalide:mainfrom
stevesuzuki-arm:pr-dynamic_shuffle
Open

Support Call::dynamic_shuffle for LUT in SVE2#9087
stevesuzuki-arm wants to merge 2 commits intohalide:mainfrom
stevesuzuki-arm:pr-dynamic_shuffle

Conversation

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor

OptimizeShuffles pass is enabled in CodeGen_ARM for SVE2.

  • Detects gather load where index range is bounded within certain value e.g. Look Up Table
  • Transforms it into contiguous load + Call::dynamic_shuffle, which is lowered to TBL instruction by codegen.

This is especially useful to vectorize with long vector in SME2 streaming mode where general form of gather load is unsupported.

OptimizeShuffles is modified so that we can use it commonly between targets (for now, Hexagon and ARM SVE2).

Checklist

  • Tests added or updated (not required for docs, CI config, or typo fixes)
  • Documentation updated (if public API changed)
  • Python bindings updated (if public API changed)
  • Benchmarks are included here if the change is intended to affect performance.
  • Commits include AI attribution where applicable (see Code of Conduct)

OptimizeShuffles pass is enabled in CodeGen_ARM for SVE2.
- Detects gather load where index range is bounded within certain value
  e.g. Look Up Table
- Transforms it into contiguous load + Call::dynamic_shuffle,
  which is lowered to TBL instruction by codegen.

This is especially useful to vectorize with long vector
in SME2 streaming mode where general form of gather load is unsupported.

OptimizeShuffles is modified so that we can use it commonly
between targets (for now, Hexagon and ARM SVE2).
@abadams
Copy link
Copy Markdown
Member

abadams commented Mar 31, 2026

Does this happen only for allocations we've made ourselves, or also for input buffers? I see that if pads our own allocations but I don't see that it skips this for input buffers. I think hexagon might play fast and loose when it comes to reading out of bounds on input buffers by up to a vector. I fear in general this will fault if it crosses a page boundary though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants