Skip to content

Add ahead-of-time compilation support to cuda builder #186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
trigpolynom opened this issue Apr 2, 2025 · 8 comments
Open

Add ahead-of-time compilation support to cuda builder #186

trigpolynom opened this issue Apr 2, 2025 · 8 comments

Comments

@trigpolynom
Copy link
Contributor

From my knowledge/understanding, cuda_builder only supports JIT compilation. It would be beneficial for user adoption and performance for larger kernel sizes if we provided support for AOT compilation. Not sure if anything other than changes to cuda_builder would be necessary. Thanks @jorge-ortega for helping me flesh out this idea a little more.

@jorge-ortega
Copy link
Collaborator

jorge-ortega commented Apr 3, 2025

This post from Nvidia provides more context. Today, CudaBuilder uses the nvvm backend to compile crates to PTX. The host then loads and JITs through the driver API. Either the backend or CudaBuilder could pass the generated PTX to ptxas to creates AOT compiled cubins that could be loaded by the driver.

@LegNeato
Copy link
Contributor

LegNeato commented Apr 4, 2025

I think it would be more idiomatic to treat this this as a different target or a a feature of the target like target-cpu=native. Thoughts?

@adamcavendish
Copy link
Contributor

I think it would be more idiomatic to treat this this as a different target or a a feature of the target like target-cpu=native. Thoughts?

I personally think that it's just a part of pipeline where Rust - PTX - fatbin. Maybe we should support a more complete pipeline and let user devide to what part should the builder build until?

@jorge-ortega
Copy link
Collaborator

I agree that this is more of a build pipeline option. Our current pipeline is disjointed, and users have to glue the ptx into their host binaries themselves. We should target getting fatbins embedded into the final host binary to match what nvcc does. That's different from what's being asked here but does get us a step towards that.

@LegNeato
Copy link
Contributor

LegNeato commented Apr 4, 2025

Sure, but I'm thinking for future integration in rustc...I actually think this maps pretty close to crate_type!

@LegNeato
Copy link
Contributor

LegNeato commented Apr 8, 2025

Possibly useful techniques: https://github.com/calebzulawski/multiversion

@adamcavendish
Copy link
Contributor

Possibly useful techniques: https://github.com/calebzulawski/multiversion

For large language model optimizations, there are a lot of kernels that are written specialized for a specific NVidia card and using CPU to select based on the user input and the card used. multiversion is probably useful but not flexible enough. Anyway, people can come back to the default match case to take back control and flexibility.

@LegNeato
Copy link
Contributor

LegNeato commented Apr 9, 2025

Right, that is why I said "techniques" rather than saying it is useful on its own 😁.

I think it is most idiomatic to use crate_type for jit vs AOT and target features for device-specific features (including target-gpu=native), similar to what rustc uses for CPU (target-cpu) and rust-gpu uses for capabilities and extensions (https://github.com/Rust-GPU/rust-gpu/blob/698f10ac14b7c952394ac5620004e4e973308902/crates/spirv-std/src/arch.rs#L151).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants