Readme and examples updated

eth-cscs · Jan 7, 2021 · 220506b · 220506b
1 parent a1959bd
commit 220506b
Show file tree

Hide file tree

Showing 7 changed files with 416 additions and 282 deletions.
diff --git a/README.md b/README.md
@@ -26,7 +26,12 @@ To allow for pre-allocation and reuse of memory, the design is based on two clas
 - **Grid**: Provides memory for transforms up to a given size.
 - **Transform**: Created with information on sparse input data and is associated with a *Grid*. Maximum size is limited by *Grid* dimensions. Internal reference counting to *Grid* objects guarantee a valid state until *Transform* object destruction.
 
-The user provides memory for storing sparse frequency domain data, while a *Transform* provides memory for space domain data. This implies, that executing a *Transform* will override the space domain data of all other *Transforms* associated with the same *Grid*.
+A transform can be computed in-place and out-of-place. Addtionally, an internally allocated work buffer can optionally be used for input / output of space domain data.
+
+### New Features in v1.0
+- Support for externally allocated memory for space domain data including in-place and out-of-place transforms
+- Optional asynchronous computation when using GPUs
+- Simplified / direct transform handle creation if no resource reuse through grid handles is required
 
 ## Documentation
 Documentation can be found [here](https://spfft.readthedocs.io/en/latest/).
@@ -88,21 +93,21 @@ int main(int argc, char** argv) {
   // Use default OpenMP value
   const int numThreads = -1;
 
-  // use all elements in this example.
+  // Use all elements in this example.
   const int numFrequencyElements = dimX * dimY * dimZ;
 
   // Slice length in space domain. Equivalent to dimZ for non-distributed case.
   const int localZLength = dimZ;
 
-  // interleaved complex numbers
+  // Interleaved complex numbers
   std::vector<double> frequencyElements;
   frequencyElements.reserve(2 * numFrequencyElements);
 
-  // indices of frequency elements
+  // Indices of frequency elements
   std::vector<int> indices;
   indices.reserve(dimX * dimY * dimZ * 3);
 
-  // initialize frequency domain values and indices
+  // Initialize frequency domain values and indices
   double initValue = 0.0;
   for (int xIndex = 0; xIndex < dimX; ++xIndex) {
     for (int yIndex = 0; yIndex < dimY; ++yIndex) {
@@ -126,31 +131,48 @@ int main(int argc, char** argv) {
     std::cout << frequencyElements[2 * i] << ", " << frequencyElements[2 * i + 1] << std::endl;
   }
 
-  // create local Grid. For distributed computations, a MPI Communicator has to be provided
+  // Create local Grid. For distributed computations, a MPI Communicator has to be provided
   spfft::Grid grid(dimX, dimY, dimZ, dimX * dimY, SPFFT_PU_HOST, numThreads);
 
-  // create transform
+  // Create transform.
+  // Note: A transform handle can be created without a grid if no resource sharing is desired.
   spfft::Transform transform =
       grid.create_transform(SPFFT_PU_HOST, SPFFT_TRANS_C2C, dimX, dimY, dimZ, localZLength,
                             numFrequencyElements, SPFFT_INDEX_TRIPLETS, indices.data());
 
-  // Get pointer to space domain data. Alignment fullfills requirements for std::complex.
-  // Can also be read as std::complex elements (guaranteed by the standard to be binary compatible
-  // since C++11).
-  double* spaceDomain = transform.space_domain_data(SPFFT_PU_HOST);
 
-  // transform backward
+  ///////////////////////////////////////////////////
+  // Option A: Reuse internal buffer for space domain
+  ///////////////////////////////////////////////////
+
+  // Transform backward
   transform.backward(frequencyElements.data(), SPFFT_PU_HOST);
 
+  // Get pointer to buffer with space domain data. Is guaranteed to be castable to a valid
+  // std::complex pointer. Using the internal working buffer as input / output can help reduce
+  // memory usage.
+  double* spaceDomainPtr = transform.space_domain_data(SPFFT_PU_HOST);
+
   std::cout << std::endl << "After backward transform:" << std::endl;
   for (int i = 0; i < transform.local_slice_size(); ++i) {
-    std::cout << spaceDomain[2 * i] << ", " << spaceDomain[2 * i + 1] << std::endl;
+    std::cout << spaceDomainPtr[2 * i] << ", " << spaceDomainPtr[2 * i + 1] << std::endl;
   }
 
-  // transform forward
-  transform.forward(SPFFT_PU_HOST, frequencyElements.data(), SPFFT_NO_SCALING);
+  /////////////////////////////////////////////////
+  // Option B: Use external buffer for space domain
+  /////////////////////////////////////////////////
+
+  std::vector<double> spaceDomainVec(2 * transform.local_slice_size());
+
+  // Transform backward
+  transform.backward(frequencyElements.data(), spaceDomainVec.data());
+
+  // Transform forward
+  transform.forward(spaceDomainVec.data(), frequencyElements.data(), SPFFT_NO_SCALING);
+
+  // Note: In-place transforms are also supported by passing the same pointer for input and output.
 
-  std::cout << std::endl << "After forward transform (without scaling):" << std::endl;
+  std::cout << std::endl << "After forward transform (without normalization):" << std::endl;
   for (int i = 0; i < numFrequencyElements; ++i) {
     std::cout << frequencyElements[2 * i] << ", " << frequencyElements[2 * i + 1] << std::endl;
   }

diff --git a/docs/source/details.rst b/docs/source/details.rst
@@ -13,7 +13,6 @@ Transform Definition
 - :math:`\omega_{N}^{k,n} = e^{2\pi i \frac{k n}{N}}`: *Backward* transform from frequency domain to space domain
 
 
-
 Complex Number Format
 ---------------------
 SpFFT always assumes an interleaved format in double or single precision. The alignment of memory provided for space domain data is guaranteed to fulfill to the requirements for std::complex (for C++11), C complex types and GPU complex types of CUDA or ROCm.
@@ -90,9 +89,14 @@ The execution of transforms is thread-safe if
 
 GPU
 ---
-| Saving transfer time between host and GPU is key to good performance for execution with GPUs. Ideally, both input and output is located on GPU memory. If host memory pointers are provided as input or output, it is helpful to use pinned memory through the CUDA or ROCm API.
+| Saving transfer time between host and GPU is key to good performance for execution with GPUs. Ideally, both input and output is located on GPU memory. If host memory pointers are provided as input or output, it is beneficial to use pinned memory through the CUDA or ROCm API.
 
 | If available, GPU aware MPI can be utilized, to safe on the otherwise required transfers between host and GPU in preparation of the MPI exchange. This can greatly impact performance and is enabled by compiling the library with the CMake option SPFFT_GPU_DIRECT set to ON.
 
 .. note:: Additional environment variables may have to be set for some MPI implementations, to allow GPUDirect usage.
 .. note:: The execution of a transform is synchronized with the default stream.
+
+Multi-GPU
+---------
+Multi-GPU support is not available for individual transform operations, but each Grid / Transform can be associated to a different GPU. At creation time, the current GPU id is stored internally and used for operations later on. So by either using the asynchronous execution mode or using the multi-transform functionality, multiple GPUs can be used at the same time.
+