Conv2D: Add CPU version #14320

am17an · 2025-06-21T19:10:44Z

Adding as draft because at the moment it doesn't seem to be always faster than doing im2col, but in some cases it is. Looking to optimize this solution as it's currently completely unoptimized, but it might be useful for #14316

Input Size	Kernel	Config	IM2COL (ms)	SIMD (ms)	Speedup
8x8x3	3x3x3→16	s1 p0	0.300	0.013	23.08x SIMD
8x8x3	3x3x3→16	s1 p1	0.020	0.017	1.18x SIMD
16x16x8	5x5x8→32	s2 p2	0.066	0.070	1.06x IM2COL
32x32x64	1x1x64→128	s1 p0	0.930	6.485	6.97x IM2COL
16x16x16	3x3x16→32	s1 p1	0.359	0.485	1.35x IM2COL
64x64x3	3x3x3→32	s1 p1	1.387	2.757	1.99x IM2COL
128x128x16	3x3x16→32	s1 p1	9.760	73.721	7.55x IM2COL
128x128x32	3x3x32→64	s1 p1	20.337	187.484	9.22x IM2COL
64x64x64	3x3x64→128	s1 p1	11.420	235.696	20.64x IM2COL
224x224x3	3x3x3→32	s1 p1	14.899	25.178	1.69x IM2COL
224x224x3	7x7x3→64	s2 p3	10.947	69.425	6.34x IM2COL
512x512x3	3x3x3→16	s1 p1	46.892	53.811	1.15x IM2COL
512x512x3	3x3x3→16	s2 p1	13.348	17.387	1.30x IM2COL
56x56x64	1x1x64→128	s1 p0	2.848	5.834	2.05x IM2COL
28x28x128	1x1x128→256	s1 p0	1.460	4.830	3.31x IM2COL
14x14x256	1x1x256→512	s1 p0	0.897	5.093	5.68x IM2COL
256x256x8	3x3x8→8	s1 p1	17.228	7.223	2.39x SIMD
512x512x4	3x3x4→4	s1 p1	36.965	10.013	3.69x SIMD

etasnadi · 2025-06-22T12:18:05Z

Check memory usage too. Naive im2col might use tons of memory (maybe not the CPU version?), so even if your code is slower it's worth it to add such an inplace version especially for training conv layers where memory size counts a lot.

Doesn't vec_dot_f16/f32 faster than omp for computing the inner products?

Acly · 2025-06-24T14:02:47Z

Another option is to use im2col+mm in a tiled fashion: in a loop, compute im2col into a temporary buffer for a fixed batch of output patches, then call mul_mat to compute part of the result. It has the advantage of using a fixed amount of memory (eg 16 MB) but is still able to piggy-back on all the investment going into gemm kernels (like LLAMAFILE).

I'm sure the direct approach can be faster in theory, but might be a tall order to get there. Not that I want to discourage you, I could be totally wrong and it's great to have options :)

I've been using the tiled method on convolution heavy models for a while, but only implemented it for contiguous-channels layout so far. Will try to add a regular version and do some comparisons.

am17an · 2025-06-24T14:22:30Z

Yes, I agree it's quite difficult to get im2col + gemm performance without heavy optimisations which leads to code which is not really maintainable. The tiled approach is interesting, if you have a contiguous channel implementation lying around, I can try to implement the regular version and get some numbers

Acly · 2025-06-24T14:50:23Z

Implementation is currently here. It needs to make sure to allocate some space in ggml_graph_plan (max GGML_IM2COL_WORK_SIZE). Most of it is just im2col code, probably the existing code can be copied/adapted for the regular version.

Contiguous-channels has some advantages: 1x1 kernel conv2d is a direct mul_mat, and the result doesn't need to be permuted.

Conv2D: Add CPU version

4e3f47c

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 21, 2025

Green-Sky mentioned this pull request Jun 21, 2025

A specialized Winograd Conv2d op ggml-org/ggml#971

Draft

am17an added 2 commits June 26, 2025 14:53

Half decent

870b650

Tiled approach for F32

bde9061

am17an force-pushed the add_conv2d_cpu branch from f47ac44 to bde9061 Compare June 26, 2025 09:52

github-actions bot added the testing Everything test related label Jun 26, 2025

remove file

b43afa7

am17an closed this Jun 26, 2025

am17an mentioned this pull request Jun 26, 2025

Add Conv2d for CPU #14388

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Conv2D: Add CPU version #14320

Conv2D: Add CPU version #14320

Uh oh!

am17an commented Jun 21, 2025

Uh oh!

etasnadi commented Jun 22, 2025 •

edited

Loading

Uh oh!

Acly commented Jun 24, 2025

Uh oh!

am17an commented Jun 24, 2025 •

edited

Loading

Uh oh!

Acly commented Jun 24, 2025

Uh oh!

Uh oh!

Conv2D: Add CPU version #14320

Conv2D: Add CPU version #14320

Uh oh!

Conversation

am17an commented Jun 21, 2025

Uh oh!

etasnadi commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Acly commented Jun 24, 2025

Uh oh!

am17an commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Acly commented Jun 24, 2025

Uh oh!

Uh oh!

etasnadi commented Jun 22, 2025 •

edited

Loading

am17an commented Jun 24, 2025 •

edited

Loading