cuda : implement bf16 cpy ops and enable bf16 cont #14763

CISC · 2025-07-18T20:55:22Z

Implemented missing BF16 CPY ops and enabled CONT op for BF16.

Tests before

  CONT(type=bf16,ne=[2,1,1,1]): not supported [CUDA0] 
  CONT(type=bf16,ne=[2,1,3,5]): not supported [CUDA0] 
  CONT(type=bf16,ne=[2,3,5,7]): not supported [CUDA0] 
[...]
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): not supported [CUDA0] 
[...]
  CPY(type_src=f16,type_dst=bf16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=f16,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
[...]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=f16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=f16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=bf16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0] 
[...]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): not supported [CUDA0] 
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): not supported [CUDA0]

Tests after

  CONT(type=bf16,ne=[2,1,1,1]): OK
  CONT(type=bf16,ne=[2,1,3,5]): OK
  CONT(type=bf16,ne=[2,3,5,7]): OK
[...]
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[1,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[2,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[3,2,3,4],permute_src=[0,3,1,2],permute_dst=[0,2,1,3]): OK
[...]
  CPY(type_src=f16,type_dst=bf16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=f16,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
[...]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=f16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=f16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=bf16,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK
[...]
  CPY(type_src=bf16,type_dst=f32,ne=[256,4,4,4],permute_src=[0,0,0,0],permute_dst=[0,0,0,0]): OK
  CPY(type_src=bf16,type_dst=f32,ne=[256,2,3,4],permute_src=[0,2,1,3],permute_dst=[0,0,0,0]): OK

Also fixed a cut'n'paste error for F16->F16 in ggml_cuda_cpy_fn and deduplicated all copy functions.

JohannesGaessler

Generally speaking I am not a fan of how the float conversions are being done currently. I think the code could be deduplicated significantly by unconditionally casting half, nv_bfloat16, and float to float and then simply using that float value to set the destination. I would appreciate it if you were to do this in this PR, otherwise I'll keep it as one of the tasks to hand out when people ask me for a good first issue to work on.

ggml/src/ggml-cuda/cpy-utils.cuh

ggml/src/ggml-cuda/cpy.cu

ggml/src/ggml-cuda/ggml-cuda.cu

* origin/master: (49 commits) ci : correct label refactor->refactoring (ggml-org#14832) CUDA: fix quantized KV cache + multiple sequences (ggml-org#14822) tests : add non-cont K,V FA tests memory : handle saving/loading null layers in recurrent memory (ggml-org#14675) ggml: fix loongarch quantize_row_q8_1 error (ggml-org#14827) CANN: weight format to NZ for Ascend310P3 (ggml-org#14407) CUDA: add fused rms norm (ggml-org#14800) ggml : model card yaml tab->2xspace (ggml-org#14819) vulkan: fix rms_norm_mul to handle broadcasting dim0 (ggml-org#14817) llama : add model type detection for rwkv7 7B&14B (ggml-org#14816) imatrix: add option to display importance score statistics for a given imatrix file (ggml-org#12718) Mtmd: add a way to select device for vision encoder (ggml-org#14236) cuda : implement bf16 cpy ops and enable bf16 cont (ggml-org#14763) opencl: remove unreachable `return` (ggml-org#14806) server : allow setting `--reverse-prompt` arg (ggml-org#14799) cuda: remove linking to cublasLt (ggml-org#14790) opencl: fix `im2col` when `KW!=KH` (ggml-org#14803) opencl: add conv2d kernel (ggml-org#14403) sycl: Fix im2col (ggml-org#14797) kleidiai: add support for get_rows (ggml-org#14676) ...

* implement bf16 cpy ops and enable bf16 cont * deduplicate copy functions * deduplicate checks

implement bf16 cpy ops and enable bf16 cont

4162ffe

CISC requested a review from JohannesGaessler July 18, 2025 20:55

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 18, 2025

JohannesGaessler reviewed Jul 21, 2025

View reviewed changes

ggml/src/ggml-cuda/cpy-utils.cuh Show resolved Hide resolved

deduplicate copy functions

1860cf9

CISC requested a review from JohannesGaessler July 21, 2025 14:49

JohannesGaessler reviewed Jul 21, 2025

View reviewed changes

ggml/src/ggml-cuda/cpy-utils.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/cpy-utils.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/cpy.cu Outdated Show resolved Hide resolved

further deduplication

9cbb916

CISC requested a review from JohannesGaessler July 21, 2025 16:02

JohannesGaessler reviewed Jul 21, 2025

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

CISC added 2 commits July 21, 2025 23:03

deduplicate checks

c039424

ws--

c058460

CISC requested a review from JohannesGaessler July 21, 2025 21:07

rename helper function

45148e6

JohannesGaessler approved these changes Jul 22, 2025

View reviewed changes

CISC merged commit e28c0b8 into master Jul 22, 2025
47 checks passed

CISC deleted the cisc/cuda-bf16-cpy-cont branch July 22, 2025 10:33

taronaeo pushed a commit to taronaeo/llama.cpp-s390x that referenced this pull request Jul 25, 2025

cuda : implement bf16 cpy ops and enable bf16 cont (ggml-org#14763)

4c94f27

* implement bf16 cpy ops and enable bf16 cont * deduplicate copy functions * deduplicate checks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda : implement bf16 cpy ops and enable bf16 cont #14763

cuda : implement bf16 cpy ops and enable bf16 cont #14763

Uh oh!

CISC commented Jul 18, 2025 •

edited

Loading

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cuda : implement bf16 cpy ops and enable bf16 cont #14763

cuda : implement bf16 cpy ops and enable bf16 cont #14763

Uh oh!

Conversation

CISC commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC commented Jul 18, 2025 •

edited

Loading