Skip to content

Commit

Permalink
Readme and examples updated
Browse files Browse the repository at this point in the history
  • Loading branch information
AdhocMan committed Jan 7, 2021
1 parent a1959bd commit 220506b
Show file tree
Hide file tree
Showing 7 changed files with 416 additions and 282 deletions.
54 changes: 38 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,12 @@ To allow for pre-allocation and reuse of memory, the design is based on two clas
- **Grid**: Provides memory for transforms up to a given size.
- **Transform**: Created with information on sparse input data and is associated with a *Grid*. Maximum size is limited by *Grid* dimensions. Internal reference counting to *Grid* objects guarantee a valid state until *Transform* object destruction.

The user provides memory for storing sparse frequency domain data, while a *Transform* provides memory for space domain data. This implies, that executing a *Transform* will override the space domain data of all other *Transforms* associated with the same *Grid*.
A transform can be computed in-place and out-of-place. Addtionally, an internally allocated work buffer can optionally be used for input / output of space domain data.

### New Features in v1.0
- Support for externally allocated memory for space domain data including in-place and out-of-place transforms
- Optional asynchronous computation when using GPUs
- Simplified / direct transform handle creation if no resource reuse through grid handles is required

## Documentation
Documentation can be found [here](https://spfft.readthedocs.io/en/latest/).
Expand Down Expand Up @@ -88,21 +93,21 @@ int main(int argc, char** argv) {
// Use default OpenMP value
const int numThreads = -1;

// use all elements in this example.
// Use all elements in this example.
const int numFrequencyElements = dimX * dimY * dimZ;

// Slice length in space domain. Equivalent to dimZ for non-distributed case.
const int localZLength = dimZ;

// interleaved complex numbers
// Interleaved complex numbers
std::vector<double> frequencyElements;
frequencyElements.reserve(2 * numFrequencyElements);

// indices of frequency elements
// Indices of frequency elements
std::vector<int> indices;
indices.reserve(dimX * dimY * dimZ * 3);

// initialize frequency domain values and indices
// Initialize frequency domain values and indices
double initValue = 0.0;
for (int xIndex = 0; xIndex < dimX; ++xIndex) {
for (int yIndex = 0; yIndex < dimY; ++yIndex) {
Expand All @@ -126,31 +131,48 @@ int main(int argc, char** argv) {
std::cout << frequencyElements[2 * i] << ", " << frequencyElements[2 * i + 1] << std::endl;
}

// create local Grid. For distributed computations, a MPI Communicator has to be provided
// Create local Grid. For distributed computations, a MPI Communicator has to be provided
spfft::Grid grid(dimX, dimY, dimZ, dimX * dimY, SPFFT_PU_HOST, numThreads);

// create transform
// Create transform.
// Note: A transform handle can be created without a grid if no resource sharing is desired.
spfft::Transform transform =
grid.create_transform(SPFFT_PU_HOST, SPFFT_TRANS_C2C, dimX, dimY, dimZ, localZLength,
numFrequencyElements, SPFFT_INDEX_TRIPLETS, indices.data());

// Get pointer to space domain data. Alignment fullfills requirements for std::complex.
// Can also be read as std::complex elements (guaranteed by the standard to be binary compatible
// since C++11).
double* spaceDomain = transform.space_domain_data(SPFFT_PU_HOST);

// transform backward
///////////////////////////////////////////////////
// Option A: Reuse internal buffer for space domain
///////////////////////////////////////////////////

// Transform backward
transform.backward(frequencyElements.data(), SPFFT_PU_HOST);

// Get pointer to buffer with space domain data. Is guaranteed to be castable to a valid
// std::complex pointer. Using the internal working buffer as input / output can help reduce
// memory usage.
double* spaceDomainPtr = transform.space_domain_data(SPFFT_PU_HOST);

std::cout << std::endl << "After backward transform:" << std::endl;
for (int i = 0; i < transform.local_slice_size(); ++i) {
std::cout << spaceDomain[2 * i] << ", " << spaceDomain[2 * i + 1] << std::endl;
std::cout << spaceDomainPtr[2 * i] << ", " << spaceDomainPtr[2 * i + 1] << std::endl;
}

// transform forward
transform.forward(SPFFT_PU_HOST, frequencyElements.data(), SPFFT_NO_SCALING);
/////////////////////////////////////////////////
// Option B: Use external buffer for space domain
/////////////////////////////////////////////////

std::vector<double> spaceDomainVec(2 * transform.local_slice_size());

// Transform backward
transform.backward(frequencyElements.data(), spaceDomainVec.data());

// Transform forward
transform.forward(spaceDomainVec.data(), frequencyElements.data(), SPFFT_NO_SCALING);

// Note: In-place transforms are also supported by passing the same pointer for input and output.

std::cout << std::endl << "After forward transform (without scaling):" << std::endl;
std::cout << std::endl << "After forward transform (without normalization):" << std::endl;
for (int i = 0; i < numFrequencyElements; ++i) {
std::cout << frequencyElements[2 * i] << ", " << frequencyElements[2 * i + 1] << std::endl;
}
Expand Down
8 changes: 6 additions & 2 deletions docs/source/details.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ Transform Definition
- :math:`\omega_{N}^{k,n} = e^{2\pi i \frac{k n}{N}}`: *Backward* transform from frequency domain to space domain



Complex Number Format
---------------------
SpFFT always assumes an interleaved format in double or single precision. The alignment of memory provided for space domain data is guaranteed to fulfill to the requirements for std::complex (for C++11), C complex types and GPU complex types of CUDA or ROCm.
Expand Down Expand Up @@ -90,9 +89,14 @@ The execution of transforms is thread-safe if

GPU
---
| Saving transfer time between host and GPU is key to good performance for execution with GPUs. Ideally, both input and output is located on GPU memory. If host memory pointers are provided as input or output, it is helpful to use pinned memory through the CUDA or ROCm API.
| Saving transfer time between host and GPU is key to good performance for execution with GPUs. Ideally, both input and output is located on GPU memory. If host memory pointers are provided as input or output, it is beneficial to use pinned memory through the CUDA or ROCm API.
| If available, GPU aware MPI can be utilized, to safe on the otherwise required transfers between host and GPU in preparation of the MPI exchange. This can greatly impact performance and is enabled by compiling the library with the CMake option SPFFT_GPU_DIRECT set to ON.
.. note:: Additional environment variables may have to be set for some MPI implementations, to allow GPUDirect usage.
.. note:: The execution of a transform is synchronized with the default stream.

Multi-GPU
---------
Multi-GPU support is not available for individual transform operations, but each Grid / Transform can be associated to a different GPU. At creation time, the current GPU id is stored internally and used for operations later on. So by either using the asynchronous execution mode or using the multi-transform functionality, multiple GPUs can be used at the same time.

Loading

0 comments on commit 220506b

Please sign in to comment.