Skip to content

Conv2D: Add CPU version #14320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conv2D: Add CPU version #14320

wants to merge 4 commits into from

Conversation

am17an
Copy link
Collaborator

@am17an am17an commented Jun 21, 2025

Adding as draft because at the moment it doesn't seem to be always faster than doing im2col, but in some cases it is. Looking to optimize this solution as it's currently completely unoptimized, but it might be useful for #14316

Input Size Kernel Config IM2COL (ms) SIMD (ms) Speedup
8x8x3 3x3x3→16 s1 p0 0.300 0.013 23.08x SIMD
8x8x3 3x3x3→16 s1 p1 0.020 0.017 1.18x SIMD
16x16x8 5x5x8→32 s2 p2 0.066 0.070 1.06x IM2COL
32x32x64 1x1x64→128 s1 p0 0.930 6.485 6.97x IM2COL
16x16x16 3x3x16→32 s1 p1 0.359 0.485 1.35x IM2COL
64x64x3 3x3x3→32 s1 p1 1.387 2.757 1.99x IM2COL
128x128x16 3x3x16→32 s1 p1 9.760 73.721 7.55x IM2COL
128x128x32 3x3x32→64 s1 p1 20.337 187.484 9.22x IM2COL
64x64x64 3x3x64→128 s1 p1 11.420 235.696 20.64x IM2COL
224x224x3 3x3x3→32 s1 p1 14.899 25.178 1.69x IM2COL
224x224x3 7x7x3→64 s2 p3 10.947 69.425 6.34x IM2COL
512x512x3 3x3x3→16 s1 p1 46.892 53.811 1.15x IM2COL
512x512x3 3x3x3→16 s2 p1 13.348 17.387 1.30x IM2COL
56x56x64 1x1x64→128 s1 p0 2.848 5.834 2.05x IM2COL
28x28x128 1x1x128→256 s1 p0 1.460 4.830 3.31x IM2COL
14x14x256 1x1x256→512 s1 p0 0.897 5.093 5.68x IM2COL
256x256x8 3x3x8→8 s1 p1 17.228 7.223 2.39x SIMD
512x512x4 3x3x4→4 s1 p1 36.965 10.013 3.69x SIMD

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 21, 2025
@etasnadi
Copy link
Contributor

etasnadi commented Jun 22, 2025

Check memory usage too. Naive im2col might use tons of memory (maybe not the CPU version?), so even if your code is slower it's worth it to add such an inplace version especially for training conv layers where memory size counts a lot.

Doesn't vec_dot_f16/f32 faster than omp for computing the inner products?

@Acly
Copy link
Collaborator

Acly commented Jun 24, 2025

Another option is to use im2col+mm in a tiled fashion: in a loop, compute im2col into a temporary buffer for a fixed batch of output patches, then call mul_mat to compute part of the result. It has the advantage of using a fixed amount of memory (eg 16 MB) but is still able to piggy-back on all the investment going into gemm kernels (like LLAMAFILE).

I'm sure the direct approach can be faster in theory, but might be a tall order to get there. Not that I want to discourage you, I could be totally wrong and it's great to have options :)

I've been using the tiled method on convolution heavy models for a while, but only implemented it for contiguous-channels layout so far. Will try to add a regular version and do some comparisons.

@am17an
Copy link
Collaborator Author

am17an commented Jun 24, 2025

Yes, I agree it's quite difficult to get im2col + gemm performance without heavy optimisations which leads to code which is not really maintainable. The tiled approach is interesting, if you have a contiguous channel implementation lying around, I can try to implement the regular version and get some numbers

@Acly
Copy link
Collaborator

Acly commented Jun 24, 2025

Implementation is currently here. It needs to make sure to allocate some space in ggml_graph_plan (max GGML_IM2COL_WORK_SIZE). Most of it is just im2col code, probably the existing code can be copied/adapted for the regular version.

Contiguous-channels has some advantages: 1x1 kernel conv2d is a direct mul_mat, and the result doesn't need to be permuted.

@github-actions github-actions bot added the testing Everything test related label Jun 26, 2025
@am17an am17an closed this Jun 26, 2025
@am17an am17an mentioned this pull request Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants