Skip to content

Conversation

jeemzz147
Copy link

Make sure to read the contributing guidelines before submitting a PR

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 31, 2025
@jeemzz147
Copy link
Author

Part of #14909

@jeemzz147 jeemzz147 marked this pull request as ready for review July 31, 2025 08:35
@am17an am17an requested a review from JohannesGaessler July 31, 2025 08:36
@jeemzz147
Copy link
Author

Hi, @JohannesGaessler Could you please review the changes when you have a chance? Thank you in advance!

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my current goals is to consolidate and deduplicate the code around copying data in the CUDA backend. As such I think rather than adding new kernels here it would be better to re-use the existing code. If the operation is not inlace you can use cudaMemsetAsync to set dst with the contents of src0. Afterwards you can use ggml_cpy_flt_cuda in cpy.cu to do the copy. That kernel does not have an argument for the offset but it's not needed as you can simply apply the offset in host code.

@jeemzz147
Copy link
Author

jeemzz147 commented Aug 1, 2025

@JohannesGaessler

I have already used cudaMemcpyAsync to copy the contents from src0 to dst.

  • Regarding the implementation of ggml_cpy_flt_cuda, the functionality of the set kernel and ggml_cpy_flt_cuda may be inconsistent.

    • ggml_cpy_flt_cuda

    • set_f32_cuda

    • The element size of src1 may be less than in dst.

    • The set kernel doesn’t rely on any implicit shape in dst itself;

    • The dst of the set may be a slice, and when writing from src1 to dst, so it may not be aware of dst->ne. Instead, dst->nb1, nb2, and nb3 are provided by the parameters.

    • Could it be that the data of src1 within dst are not contiguous?

If I’m wrong, please correct me. Thanks for your help!

@jeemzz147
Copy link
Author

Hi @am17an, thanks again for your previous review. Since @JohannesGaessler hasn’t had a chance to respond for a few weeks, would it be possible to ask another maintainer or contributor to review this as well? I’d really appreciate any additional feedback to help move this forward.

@JohannesGaessler
Copy link
Collaborator

Sorry, I forgot about this PR. The code in cpy.cuh can do the same as the one in this PR (and more), you just need to call it with the right parameters.

The element size of src1 may be less than in dst.

Set both types to float.

The set kernel doesn’t rely on any implicit shape in dst itself;

Use the same shape twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants