Add ahead-of-time compilation support to cuda builder

From my knowledge/understanding, cuda_builder only supports JIT compilation. It would be beneficial for user adoption and performance for larger kernel sizes if we provided support for AOT compilation. Not sure if anything other than changes to cuda_builder would be necessary. Thanks @jorge-ortega for helping me flesh out this idea a little more.