Skip to content

small prime FFT based on ulong #2107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 100 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
0bf0127
add profile for powmod
vneiger Sep 16, 2024
c363adb
Merge branch 'main' into introduce_nmod_fft
vneiger Oct 19, 2024
4a887b4
add .h file
vneiger Oct 19, 2024
aee38b3
fix ifndef
vneiger Oct 19, 2024
17faaea
context and init code
vneiger Oct 19, 2024
e738256
add profile
vneiger Oct 19, 2024
dcaede7
fix include
vneiger Oct 19, 2024
cd50787
improve profile init
vneiger Oct 19, 2024
afa5ddc
rename ctx init
vneiger Oct 20, 2024
fd24de2
testing init
vneiger Oct 20, 2024
211ab75
fix explanations and complete test for init
vneiger Oct 20, 2024
6368823
remove printf
vneiger Oct 20, 2024
9eeedd6
forgot to add main
vneiger Oct 22, 2024
3fa7944
dft, test passes
vneiger Oct 23, 2024
ff33533
add profile
vneiger Oct 23, 2024
f4520c9
clean things a bit
vneiger Oct 24, 2024
e10c29c
introducing dft32 base case
vneiger Oct 24, 2024
7b605a6
dft32 base case
vneiger Oct 24, 2024
1f236d8
cleaning things
vneiger Oct 24, 2024
9bf18c7
testing from length 1
vneiger Oct 24, 2024
fb88c54
fix
vneiger Oct 24, 2024
f6cc96c
remove useless function argument
vneiger Oct 24, 2024
a675b68
vaguely faster with added lazy14 layer
vneiger Oct 24, 2024
28b3276
clean explanations
vneiger Oct 24, 2024
b71649d
finalize lazy14 version
vneiger Oct 24, 2024
8cd392c
small fixes
vneiger Oct 24, 2024
9fa9020
tentative fix for flint_bits == 32
vneiger Oct 24, 2024
ccd3f71
dft8 is now a macro, code generation was too unpredictable
vneiger Oct 24, 2024
f0587e5
putting more args slightly slows down for large lengths...
vneiger Oct 24, 2024
4cf7343
macro for dft16 helps, let's see for dft32
vneiger Oct 24, 2024
b96e0a5
dft32 macroified does help a bit
vneiger Oct 24, 2024
82a6c85
mod4 currently unused
vneiger Oct 24, 2024
0cae527
some notes / todos
vneiger Oct 25, 2024
4818198
test with prime close to announced limitw
vneiger Oct 25, 2024
c413b63
limit is 62 bits
vneiger Oct 25, 2024
174bb2d
fix assert and try struct for args... for circumventing strange behav…
vneiger Oct 25, 2024
c7f6dff
fft args and introducing iw
vneiger Oct 27, 2024
5b84260
dft 16 macro pointers
vneiger Oct 27, 2024
1fb2479
dft 16 macro pointers
vneiger Oct 27, 2024
4d8071f
dft8 simplify a bit
vneiger Oct 27, 2024
6e751ea
clean
vneiger Oct 27, 2024
192a575
dft16 simplify a bit
vneiger Oct 27, 2024
375a84d
dft16 4 4 now ok as well
vneiger Oct 27, 2024
c9b6575
dft32 expanded as well; this is ugly but will help for versions with …
vneiger Oct 27, 2024
c24e7b6
starting precomp of tab_inverse(w)
vneiger Oct 27, 2024
e0ace33
add tab_iw in fitdepth
vneiger Oct 27, 2024
b376f7e
unroll a bit is faster
vneiger Oct 27, 2024
0e0df84
notes about init
vneiger Oct 27, 2024
40539de
wip: use multipoint eval in test
vneiger Oct 27, 2024
8c8b08b
use multipoint eval in test
vneiger Oct 27, 2024
b1fb674
idft_t
vneiger Oct 27, 2024
dcf2ae7
idft_t, not tested yet
vneiger Oct 27, 2024
9d8845d
minor changes
vneiger Oct 27, 2024
a872720
progress
vneiger Oct 27, 2024
c29a5d1
add files
vneiger Oct 27, 2024
66523a7
idft test passes
vneiger Oct 27, 2024
453c068
idft test passes
vneiger Oct 27, 2024
82d5cbf
idft in progress
vneiger Oct 27, 2024
07a7dbe
fix name
vneiger Oct 27, 2024
f0fed07
idft in progress
vneiger Oct 28, 2024
f79ba6f
idft in progress
vneiger Oct 28, 2024
c41bcd4
idft in progress
vneiger Oct 28, 2024
f09bca0
idft in progress
vneiger Oct 28, 2024
38a5aba
idft in progress
vneiger Oct 28, 2024
ece93dc
minor fix
vneiger Oct 28, 2024
9ef50f1
a bit lazier
vneiger Oct 28, 2024
3b5669d
idft becoming good, remains some fine tuning to do
vneiger Oct 28, 2024
614dacb
a bit lazier
vneiger Oct 28, 2024
9567b15
fix name
vneiger Oct 28, 2024
42f6fe6
Merge branch 'flintlib:main' into introduce_nmod_fft
vneiger Nov 9, 2024
c438c43
minor comment
vneiger Nov 9, 2024
eaf6d5c
Merge branch 'introduce_nmod_fft' of github.com:vneiger/flint into in…
vneiger Nov 9, 2024
4bb083a
remove unnecessary includes
vneiger Nov 9, 2024
99284b5
add todo about tests idft any node small depths
vneiger Nov 9, 2024
b67916a
Merge branch 'flintlib:main' into introduce_nmod_fft
vneiger Jan 30, 2025
56dde27
Merge branch 'main' into introduce_nmod_fft
vneiger Mar 10, 2025
5010d9c
small typo in bound
vneiger Mar 16, 2025
d7f0c02
Merge branch 'main' into introduce_nmod_fft
vneiger Mar 31, 2025
fb03745
Merge branch 'main' into introduce_nmod_fft
vneiger Apr 11, 2025
c5926e1
this todo is done (precomputationwith inverse roots)
vneiger Apr 11, 2025
6a3dc5c
misc notes and comments
vneiger Apr 13, 2025
ae76510
add dft_lazy44 to timed functions; add notes about attempts
vneiger Apr 13, 2025
422d02e
rename dft_node0->dft and dft->dft_node ; reorganize todos
vneiger Apr 13, 2025
5f080b5
some enhancements in documentation
vneiger Apr 14, 2025
f096ea4
some more reorganization
vneiger Apr 15, 2025
a686fdd
review + doc + clean macros
vneiger Apr 16, 2025
4d6659f
clean dft.c
vneiger Apr 16, 2025
cdf9bd9
some investigations into nmod_poly using n_fft
vneiger Apr 18, 2025
869e1d0
Merge branch 'main' into introduce_nmod_fft
vneiger Apr 18, 2025
2aa0f72
new base case and some cleaning
vneiger May 2, 2025
d3d4e6e
reorganize and make sure return is [0..n)
vneiger May 2, 2025
a2b2a8f
remove unused include
vneiger May 2, 2025
4d481eb
idft32 and some more cleaning
vneiger May 2, 2025
6d893e6
documentation + clean test t-dft
vneiger May 5, 2025
2c66633
clean test t-idft
vneiger May 5, 2025
59f0e30
test dft_t goes fine
vneiger May 5, 2025
f610be4
reduce parameters a bit to make tests fast enough
vneiger May 5, 2025
3e651f5
test idft_t goes fine
vneiger May 5, 2025
3b906a3
clean profile p-dft
vneiger May 5, 2025
3743e5b
Merge branch 'main' into introduce_nmod_fft
vneiger May 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion Makefile.in
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,8 @@ HEADER_DIRS := \
fmpz_mod_mpoly_factor fmpq_mpoly_factor \
fq_nmod_mpoly_factor fq_zech_mpoly_factor \
\
fft @FFT_SMALL@ fmpz_poly_q fmpz_lll \
fft n_fft @FFT_SMALL@ \
fmpz_poly_q fmpz_lll \
n_poly arith qsieve aprcl \
\
nf nf_elem qfb \
Expand Down
236 changes: 236 additions & 0 deletions src/n_fft.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
/*
Copyright (C) 2024 Vincent Neiger

This file is part of FLINT.

FLINT is free software: you can redistribute it and/or modify it under
the terms of the GNU Lesser General Public License (LGPL) as published
by the Free Software Foundation; either version 3 of the License, or
(at your option) any later version. See <https://www.gnu.org/licenses/>.
*/

#ifndef N_FFT_H
#define N_FFT_H

#include "flint.h"

#define N_FFT_CTX_DEFAULT_DEPTH 12

#ifdef __cplusplus
extern "C" {
#endif

/**
* TODO[short term] for perf vs simplifying code:
* - bench intermediate functions to make sure there is nothing surprising
* - check if the dft32 and idft32 macros actually help or not (remove them if not)
* - check if having these p_hi,p_lo tmp in macro args is useful or they can be removed
*
* TODO[long term] large depth can lead to heavy memory usage
* --> provide precomputation-free functions
* TODO[long term] on zen4 (likely on other cpus as well) ctx_init becomes
* slower at some point, losing a factor 4 or more, probably due to caching;
* what is annoying is that the depth where it becomes slower is significantly
* smaller (~13-14) when tab_iw has been incorporated compared to without
* tab_iw (it was depth ~20-21); see if this can be understood, and maybe play
* with vectorization for those simple functions
*/


/*-------------------------------------------------*/
/* STRUCTURES FOR FFT CONTEXT / FUNCTION ARGUMENTS */
/*-------------------------------------------------*/


/** n_fft context:
* - basic parameters
* - precomputed powers of the primitive root of unity and its inverse
* - precomputed inverses of 2**k
*
* Requirements (not checked upon init):
* - mod is an odd prime < 2**(FLINT_BITS-2)
* - max_depth must be >= 3 (so, 8 must divide mod - 1)
* Total memory cost of precomputations for arrays tab_{w,iw,w2,inv2}:
* at most 2 * (2*FLINT_BITS + 2**depth) ulong's
*
* For more details about the content of tab_{w,iw,w2,inv2}, see comments below
**/
typedef struct
{
ulong mod; // modulus, odd prime
ulong max_depth; // maximum supported depth (w has order 2**max_depth)
ulong cofactor; // prime is 1 + cofactor * 2**max_depth
ulong depth; // depth supported by current precomputation
nn_ptr tab_w; // precomputed powers of w
nn_ptr tab_iw; // precomputed powers of 1/w
ulong tab_w2[2*FLINT_BITS]; // precomputed powers w**(2**k)
ulong tab_inv2[2*FLINT_BITS]; // precomputed inverses of 2**k
} n_fft_ctx_struct;
typedef n_fft_ctx_struct n_fft_ctx_t[1];


/** n_fft arguments:
* - modulus mod
* - its double 2*mod (storing helps for speed)
* - precomputed powers of w
* To be used as an argument in FFT functions. In some parts, providing this
* instead of the whole context increased performance. Also, this facilitate
* using the same function with both tab_w and tab_iw (by forming an fft_args
* with Fargs->tab_w = F->tab_iw.
**/
typedef struct
{
ulong mod; // modulus, odd prime
ulong mod2; // 2*mod
nn_srcptr tab_w; // tabulated powers of w, see below
} n_fft_args_struct;
typedef n_fft_args_struct n_fft_args_t[1];


/** tab_w2:
* - length 2*FLINT_BITS, with undefined entries at index 2*(max_depth-1) and beyond
* - contains powers w**d for d a power of 2, and corresponding
* precomputations for modular multiplication:
* -- for 0 <= k < max_depth-1, tab_w2[2*k] = w**(2**(max_depth-2-k))
* and tab_w2[2*k+1] = floor(tab_w2[2*k] * 2**FLINT_BITS / mod)
* -- for 2*(max_depth-1) <= k < 2*FLINT_BITS, tab_w2[k] is undefined
*
* --> one can retrieve w as tab_w2[2 * (max_depth-2)]
* --> the first elements are tab_w2 = [I, I_pr, J, J_pr, ...]
* where I is a square root of -1 and J is a square root of I
*/

/** tab_w:
* - length 2**depth
* - contains 2**(depth-1) first powers of w in (max_depth-1)-bit reversed order,
* and corresponding precomputations for modular multiplication:
* -- for 0 <= k < 2**(depth-1), tab_w[2*k] = w**(br[k])
* and tab_w[2*k+1] = floor(tab_w[2*k] * 2**FLINT_BITS / mod)
* where br = [0, 2**(max_depth-2), 2**(max_depth-3), 3 * 2**(max_depth-3), ...]
* is the bit reversal permutation of length 2**(max_depth-1)
* (https://en.wikipedia.org/wiki/Bit-reversal_permutation)
*
* In particular the first elements are
* tab_w = [1, 1_pr, I, I_pr, J, J_pr, IJ, IJ_pr, ...]
* where I is a square root of -1, J is a square root of I, and IJ = I*J. Note
* that powers of w beyond 2**(max_depth-1), for example -1, -I, -J, etc. are
* not stored.
**/

/** tab_iw: same as tab_w but for the primitive root 1/w */

/** tab_inv2:
* - length 2*FLINT_BITS, with undefined entries at index 2*max_depth and beyond
* - contains the modular inverses of 2**k, and corresponding
* precomputations for modular multiplication:
* -- for 0 <= k < max_depth, tab_inv2[2*k] = the inverse of 2**(k+1)
* modulo mod, and tab_inv2[2*k+1] = floor(tab_inv2[2*k] * 2**FLINT_BITS / mod)
* -- for 2*max_depth <= k < 2*FLINT_BITS, tab_inv2[k] is undefined
*
* Recall F->mod == 1 + cofactor * 2**max_depth, so
* 1 == F->mod - cofactor * 2**(max_depth - k) * 2**k
* --> the inverse of 2**k in (0, F->mod) is
* F->mod - cofactor * 2**(max_depth - k),
* we do not really need to store it, but we want the precomputations as well
*/


/*------------------------------------------*/
/* PRECOMPUTATIONS / CONTEXT INITIALIZATION */
/*------------------------------------------*/

/** Note for init functions, when depth is provided:
* - if it is < 3, it is pretended that it is 3
* - it it is more than F->max_depth (the maximum possible with the given
* prime), it is reduced to F->max_depth
* After calling init, precomputations support DFTs of length up to 2**depth
*/

/* initialize with given root and given depth */
void n_fft_ctx_init2_root(n_fft_ctx_t F, ulong w, ulong max_depth, ulong cofactor, ulong depth, ulong mod);

/* find primitive root, initialize with given depth */
void n_fft_ctx_init2(n_fft_ctx_t F, ulong depth, ulong p);

/* same, with default depth */
FLINT_FORCE_INLINE
void n_fft_ctx_init_root(n_fft_ctx_t F, ulong w, ulong max_depth, ulong cofactor, ulong p)
{ n_fft_ctx_init2_root(F, w, max_depth, cofactor, N_FFT_CTX_DEFAULT_DEPTH, p); }

FLINT_FORCE_INLINE
void n_fft_ctx_init(n_fft_ctx_t F, ulong p)
{ n_fft_ctx_init2(F, N_FFT_CTX_DEFAULT_DEPTH, p); }

/* grows F->depth and precomputations to support DFTs of depth up to depth */
void n_fft_ctx_fit_depth(n_fft_ctx_t F, ulong depth);

void n_fft_ctx_clear(n_fft_ctx_t F);

FLINT_FORCE_INLINE
void n_fft_set_args(n_fft_args_t F, ulong mod, nn_srcptr tab_w)
{
F->mod = mod;
F->mod2 = 2*mod;
F->tab_w = tab_w;
}

/*-----------------------------*/
/* DFT / IDFT / DFT_t / IDFT_t */
/*-----------------------------*/

/** forward and inverse transforms, and their transposes:
* - length is a power of 2, len == 2**depth
* - requirement of all functions (not checked): depth <= F.depth
* - the comments below describe algorithms that modify the input array p in
* place: in these comments p stands for the input p, whereas q stands
* for the array p after running the algorithm
* - below in comments we write w[k] for 0 <= k < len/2, defined as
* w[2*k] == F->tab_w[2*k]
* w[2*k+1] == - F->tab_w[2*k]
* - hence the list w[k] for 0 <= k < len gives the len roots of the
* polynomial x**len - 1, which are all powers of the chosen len-th
* primitive root of unity, with exponents listed in bit reversed order
* - the matrix of DFT of length len is the len x len matrix
* DFT_{w,len} = [ w[i]**j ]_{0 <= i, j < len}
*/

/** dft: discrete Fourier transform (q = DFT_{w,len} * p)
* In-place transform p = [p[j] for 0 <= j < len], seen as a polynomial p(x) of
* degree < len, into its evaluations
* q == [p(w[i]) for 0 <= i < len],
* where p(w[i]) = sum(p[j] * w[i]**j for 0 <= j < len)
*/

/** idft: inverse discrete Fourier transform (q = DFT_{w,len}^{-1} * p)
* In-place transform p = [p[i] for 0 <= i < len] into the list of coefficients
* q = [q[j] for 0 <= j < len] of the unique polynomial q(x) of degree < len
* such that p[i] == q(w[i]) for 0 <= i < len
*/

/** dft_t: transposed discrete Fourier transform (q = p * DFT_{w,len})
* In-place transform p = [p[i] for 0 <= i < len] into the list of weighted
* power sums
* q == [PowerSum(p, w**j) for 0 <= j < len]
* where PowerSum(p, w**j) == sum(p[i] * w[i]**j for 0 <= i < len)
*/

/** idft_t: transposed inverse discrete Fourier transform (q = p * DFT_{w,len}^{-1})
* In-place transform p = [p[j] for 0 <= j < len] into the coefficients q =
* [q[i] for 0 <= i < len] which appear in the partial fraction decomposition
* p(x) = sum_{0 <= i < len} q[i] / (1 - w[i] * x) + O(x**len)
* where p(x) is the power series p(x) = sum_{0 <= j < len} p[j] x**j + O(x**len)
*/

void n_fft_dft(nn_ptr p, ulong depth, n_fft_ctx_t F);

void n_fft_idft(nn_ptr p, ulong depth, n_fft_ctx_t F);

void n_fft_dft_t(nn_ptr p, ulong depth, n_fft_ctx_t F);

void n_fft_idft_t(nn_ptr p, ulong depth, n_fft_ctx_t F);

#ifdef __cplusplus
}
#endif

#endif /* N_FFT_H */
Loading