Skip to content

Conversation

@LuFinch
Copy link
Contributor

@LuFinch LuFinch commented Sep 11, 2025

This is a draft PR to enable SYCL-TLA build in torch-xpu-ops so that we can test SYCL-TLA kernels' accuracy/performance in Pytorch when SDPA/GEMM kernels are ready.

After discussion with Eikan, we decided to put build logic in torch-xpu-ops while put kernels source code in Pytorch in-tree. Please put your SYCL-TLA kernel source code in Pytorch and set its path as part of ATen_XPU_SYCLTLA_SRCS in torch-xpu-ops/src/ATen/CMakeLists.txt.

Since SYCL-TLA has different compilation options compared with normal SYCL kernels in torch-xpu-ops, I make the logic in cmake/BuildFlags.cmake as a macro so that I can reuse the common compilation options.

Since there is not a determined plan of how to import sycl-tla repo, I git clone the main branch in cmake for debug convinence. We can pin commit after sycl-tla has first release tag

Depend on g++ upgrading to gcc13, otherwise the sycltla kernel won't build

@LuFinch LuFinch force-pushed the lfq/cutlass branch 6 times, most recently from aa10375 to e9800af Compare October 20, 2025 07:51
@LuFinch LuFinch changed the title [Cutlass] Enable Cutlass with host compiler [SYCL-TLA] Enable SYCL-TLA build with host compiler Oct 20, 2025
@LuFinch LuFinch marked this pull request as ready for review October 21, 2025 02:33

This comment was marked as abuse.

@LuFinch LuFinch requested a review from EikanWang October 21, 2025 02:34
@LuFinch LuFinch changed the title [SYCL-TLA] Enable SYCL-TLA build with host compiler [SYCL-TLA] Enable SYCL-TLA build Oct 21, 2025
file(GLOB xpu_native_cpp "native/xpu/*.cpp" "native/sparse/*.cpp" "native/sparse/xpu/*.cpp" "native/nested/*.cpp" "native/nested/xpu/*.cpp" "native/transformers/*.cpp" "native/quantized/*.cpp")
file(GLOB xpu_native_cpp "native/xpu/*.cpp" "native/sparse/*.cpp" "native/sparse/xpu/*.cpp" "native/nested/*.cpp" "native/nested/xpu/*.cpp" "native/transformers/*.cpp" "native/quantized/*.cpp" ${TORCH_ROOT}/aten/src/ATen/native/transformers/xpu/flash_attn/*.cpp)
file(GLOB xpu_sycl "native/xpu/sycl/*.cpp" "native/sparse/xpu/sycl/*.cpp" "native/nested/xpu/sycl/*.cpp" "native/transformers/sycl/*.cpp" "native/quantized/sycl/*.cpp")
file(GLOB xpu_sycltla "${TORCH_ROOT}/aten/src/ATen/native/transformers/xpu/flash_attn/sycltla/*.cpp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find the folder under ${TORCH_ROOT}/aten/src/ATen/native/transformers/xpu in https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/native/transformers, is there any dependency?

Copy link
Contributor Author

@LuFinch LuFinch Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR containing ${TORCH_ROOT}/aten/src/ATen/native/transformers/xpu depends on this build enabling PR. I will create a new PR to put sycl-tla flash attention kernel in Pytorch after this PR merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s a bit unusual that we are building some files from PyTorch and some from torch-xpu-ops into a single .so.
Could we decouple them? For example, we could keep the implementations in torch-xpu-ops and provide a header file for PyTorch. PyTorch would then only use the APIs exposed in the header file.

Copy link
Contributor Author

@LuFinch LuFinch Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, we could keep the implementations in torch-xpu-ops and provide a header file for PyTorch. PyTorch would then only use the APIs exposed in the header file.

=> I used to implement sycltla sdpa with this method: put kernel in torch-xpu-ops, expose a header file and call from Pytorch. However, it is hard to maintain the code after upstream. For example, we need to prepare a Pytorch PR and a torch-xpu-ops PR when we want to change the APIs after first upstream. The torch-xpu-ops PR can't be built in CI because the Pytorch PR hasn't merged yet. The Pytorch PR can't be built in CI because the torch-xpu-ops PR hasn't merged yet.

It is more reasonable to put all code in Pytorch only or in torch-xpu-ops only. Since the SDPA_overrideable is registered in Pytorch already, I think put sycltla kernel in Pytorch intree is more convinent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn’t seem reasonable that a file located in PyTorch is being built into a third-party library. This design doesn’t make much sense to me.

@LuFinch
Copy link
Contributor Author

LuFinch commented Oct 22, 2025

@fengyuan14 @EikanWang Could you help review and give some comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants