Docs preview for PR #2660.

cuda-quantum-bot · cuda-quantum-bot · commit 51018ba1ace4 · 2025-02-26T02:00:12.000Z
diff --git a/pr-2660/_sources/using/backends/sims/svsims.rst.txt b/pr-2660/_sources/using/backends/sims/svsims.rst.txt
@@ -107,7 +107,7 @@ setting the target. It is worth drawing attention to gate fusion, a powerful too
     - Description
   * - ``CUDAQ_FUSION_MAX_QUBITS``
     - positive integer
-    - The max number of qubits used for gate fusion. The default value depends on `GPU Compute Capability <https://developer.nvidia.com/cuda-gpus>`__ (CC). Specifically, the default is 5 for CC 8, 6 for CC 9, and 4 for CC 10. All others will use a default value of `4`.
+    - The max number of qubits used for gate fusion. The default value depends on `GPU Compute Capability <https://developer.nvidia.com/cuda-gpus>`__ (CC) and the floating point precision selected for the simulator. Specifically, for CC 8.0, 9.0, and 10.0 the defaults are `4`, `5`, and `5` for `FP32`. For `FP64` the corresponding defaults are `5`, `6`, and `4`. For all other CC, the default is `4` for both precision modes.
   * - ``CUDAQ_FUSION_DIAGONAL_GATE_MAX_QUBITS``
     - integer greater than or equal to -1
     - The max number of qubits used for diagonal gate fusion. The default value is set to `-1` and the fusion size will be automatically adjusted for the better performance. If 0, the gate fusion for diagonal gates is disabled.
@@ -232,7 +232,7 @@ prior to setting the target.
     - The qubit count threshold where state vector distribution is activated. Below this threshold, simulation is performed as independent (non-distributed) tasks across all MPI processes for optimal performance. Default is 25. 
   * - ``CUDAQ_MGPU_FUSE``
     - positive integer
-    - The max number of qubits used for gate fusion. The default value depends on `GPU Compute Capability <https://developer.nvidia.com/cuda-gpus>`__ (CC). Specifically, the default is 5 for CC 8, 6 for CC 9, and 4 for CC 10. All others will use a default value of `6` if there are more than one MPI processes or `4` otherwise.
+    - The max number of qubits used for gate fusion. The default value depends on `GPU Compute Capability <https://developer.nvidia.com/cuda-gpus>`__ (CC) and the floating point precision selected for the simulator. Specifically, for CC 8.0, 9.0, and 10.0 the defaults are `4`, `5`, and `5` for `FP32`. For `FP64` the corresponding defaults are `5`, `6`, and `4`. For all other CC, the default is `4` for both precision modes.
   * - ``CUDAQ_MGPU_P2P_DEVICE_BITS``
     - positive integer
     - Specify the number of GPUs that can communicate by using GPUDirect P2P. Default value is 0 (P2P communication is disabled).
diff --git a/pr-2660/applications/python/deutschs_algorithm.html b/pr-2660/applications/python/deutschs_algorithm.html
@@ -816,7 +816,7 @@ <h2>XOR <span class="math notranslate nohighlight">\(\oplus\)</span><a class="he
 </section>
 <section id="Quantum-oracles">
 <h2>Quantum oracles<a class="headerlink" href="#Quantum-oracles" title="Permalink to this heading">¶</a></h2>
-<p><img alt="c269f65917f043258df3cd656b7c3ef6" class="no-scaled-link" src="../../_images/oracle.png" style="width: 300px; height: 150px;" /></p>
+<p><img alt="cd459d1f95934fdea70a8623142b1b46" class="no-scaled-link" src="../../_images/oracle.png" style="width: 300px; height: 150px;" /></p>
 <p>Suppose we have <span class="math notranslate nohighlight">\(f(x): \{0,1\} \longrightarrow \{0,1\}\)</span>. We can compute this function on a quantum computer using oracles which we treat as black box functions that yield the output with an appropriate sequence of logical gates.</p>
 <p>Above you see an oracle represented as <span class="math notranslate nohighlight">\(U_f\)</span> which allows us to transform the state <span class="math notranslate nohighlight">\(\ket{x}\ket{y}\)</span> into:</p>
 <div class="math notranslate nohighlight">
@@ -864,7 +864,7 @@ <h2>Quantum parallelism<a class="headerlink" href="#Quantum-parallelism" title="
 <h2>Deutsch’s Algorithm:<a class="headerlink" href="#Deutsch's-Algorithm:" title="Permalink to this heading">¶</a></h2>
 <p>Our aim is to find out if <span class="math notranslate nohighlight">\(f: \{0,1\} \longrightarrow \{0,1\}\)</span> is a constant or a balanced function? If constant, <span class="math notranslate nohighlight">\(f(0) = f(1)\)</span>, and if balanced, <span class="math notranslate nohighlight">\(f(0) \neq f(1)\)</span>.</p>
 <p>We step through the circuit diagram below and follow the math after the application of each gate.</p>
-<p><img alt="706e0cff78994efa904661fd25febe5c" class="no-scaled-link" src="../../_images/deutsch.png" style="width: 500px; height: 210px;" /></p>
+<p><img alt="1399566aaf23404e84b6657cedc13fa3" class="no-scaled-link" src="../../_images/deutsch.png" style="width: 500px; height: 210px;" /></p>
 <div class="math notranslate nohighlight">
 \[\ket{\psi_0}  =  \ket{01}
 \tag{1}\]</div>
diff --git a/pr-2660/examples/python/performance_optimizations.html b/pr-2660/examples/python/performance_optimizations.html
@@ -744,9 +744,9 @@ <h1>Optimizing Performance<a class="headerlink" href="#Optimizing-Performance" t
 <section id="Gate-Fusion">
 <h2>Gate Fusion<a class="headerlink" href="#Gate-Fusion" title="Permalink to this heading">¶</a></h2>
 <p>Gate fusion is an optimization technique where consecutive gates are combined into a single gate operation to improve the efficiency of the simulation (See figure below). By targeting the <code class="docutils literal notranslate"><span class="pre">nvidia-mgpu</span></code> backend and setting the <code class="docutils literal notranslate"><span class="pre">CUDAQ_MGPU_FUSE</span></code> environment variable, you can select the degree of fusion that takes place. A full command line example would look like <code class="docutils literal notranslate"><span class="pre">CUDAQ_MGPU_FUSE=4</span> <span class="pre">python</span> <span class="pre">c2h2VQE.py</span> <span class="pre">--target</span> <span class="pre">nvidia</span> <span class="pre">--target-option</span> <span class="pre">fp64,mgpu</span></code></p>
-<p><img alt="6ee9c13954c146d2985e006149a91e47" src="../../_images/gate-fuse.png" /></p>
+<p><img alt="e1669ee09c184782b012d8cb72e7782d" src="../../_images/gate-fuse.png" /></p>
 <p>The importance of gate fusion is system dependent, but can have a large influence on the performance of the simulation. See the example below for a 24 qubit VQE experiment where changing the fusion level resulted in significant performance boosts.</p>
-<p><img alt="ce7ed8128dd141d0a0b1ad7118abeb1e" src="../../_images/gatefusion.png" /></p>
+<p><img alt="780fa1179ff7454abb9771bd0da1d7e2" src="../../_images/gatefusion.png" /></p>
 </section>
 </section>
 
diff --git a/pr-2660/searchindex.js b/pr-2660/searchindex.js
diff --git a/pr-2660/sphinx/using/backends/sims/svsims.rst b/pr-2660/sphinx/using/backends/sims/svsims.rst
@@ -107,7 +107,7 @@ setting the target. It is worth drawing attention to gate fusion, a powerful too
     - Description
   * - ``CUDAQ_FUSION_MAX_QUBITS``
     - positive integer
-    - The max number of qubits used for gate fusion. The default value depends on `GPU Compute Capability <https://developer.nvidia.com/cuda-gpus>`__ (CC). Specifically, the default is 5 for CC 8, 6 for CC 9, and 4 for CC 10. All others will use a default value of `4`.
+    - The max number of qubits used for gate fusion. The default value depends on `GPU Compute Capability <https://developer.nvidia.com/cuda-gpus>`__ (CC) and the floating point precision selected for the simulator. Specifically, for CC 8.0, 9.0, and 10.0 the defaults are `4`, `5`, and `5` for `FP32`. For `FP64` the corresponding defaults are `5`, `6`, and `4`. For all other CC, the default is `4` for both precision modes.
   * - ``CUDAQ_FUSION_DIAGONAL_GATE_MAX_QUBITS``
     - integer greater than or equal to -1
     - The max number of qubits used for diagonal gate fusion. The default value is set to `-1` and the fusion size will be automatically adjusted for the better performance. If 0, the gate fusion for diagonal gates is disabled.
@@ -232,7 +232,7 @@ prior to setting the target.
     - The qubit count threshold where state vector distribution is activated. Below this threshold, simulation is performed as independent (non-distributed) tasks across all MPI processes for optimal performance. Default is 25. 
   * - ``CUDAQ_MGPU_FUSE``
     - positive integer
-    - The max number of qubits used for gate fusion. The default value depends on `GPU Compute Capability <https://developer.nvidia.com/cuda-gpus>`__ (CC). Specifically, the default is 5 for CC 8, 6 for CC 9, and 4 for CC 10. All others will use a default value of `6` if there are more than one MPI processes or `4` otherwise.
+    - The max number of qubits used for gate fusion. The default value depends on `GPU Compute Capability <https://developer.nvidia.com/cuda-gpus>`__ (CC) and the floating point precision selected for the simulator. Specifically, for CC 8.0, 9.0, and 10.0 the defaults are `4`, `5`, and `5` for `FP32`. For `FP64` the corresponding defaults are `5`, `6`, and `4`. For all other CC, the default is `4` for both precision modes.
   * - ``CUDAQ_MGPU_P2P_DEVICE_BITS``
     - positive integer
     - Specify the number of GPUs that can communicate by using GPUDirect P2P. Default value is 0 (P2P communication is disabled).
diff --git a/pr-2660/using/backends/sims/svsims.html b/pr-2660/using/backends/sims/svsims.html
@@ -820,7 +820,7 @@ <h2>Single-GPU<a class="headerlink" href="#single-gpu" title="Permalink to this
 </tr>
 <tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">CUDAQ_FUSION_MAX_QUBITS</span></code></p></td>
 <td><p>positive integer</p></td>
-<td><p>The max number of qubits used for gate fusion. The default value depends on <a class="reference external" href="https://developer.nvidia.com/cuda-gpus">GPU Compute Capability</a> (CC). Specifically, the default is 5 for CC 8, 6 for CC 9, and 4 for CC 10. All others will use a default value of <code class="code docutils literal notranslate"><span class="pre">4</span></code>.</p></td>
+<td><p>The max number of qubits used for gate fusion. The default value depends on <a class="reference external" href="https://developer.nvidia.com/cuda-gpus">GPU Compute Capability</a> (CC) and the floating point precision selected for the simulator. Specifically, for CC 8.0, 9.0, and 10.0 the defaults are <code class="code docutils literal notranslate"><span class="pre">4</span></code>, <code class="code docutils literal notranslate"><span class="pre">5</span></code>, and <code class="code docutils literal notranslate"><span class="pre">5</span></code> for <code class="code docutils literal notranslate"><span class="pre">FP32</span></code>. For <code class="code docutils literal notranslate"><span class="pre">FP64</span></code> the corresponding defaults are <code class="code docutils literal notranslate"><span class="pre">5</span></code>, <code class="code docutils literal notranslate"><span class="pre">6</span></code>, and <code class="code docutils literal notranslate"><span class="pre">4</span></code>. For all other CC, the default is <code class="code docutils literal notranslate"><span class="pre">4</span></code> for both precision modes.</p></td>
 </tr>
 <tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">CUDAQ_FUSION_DIAGONAL_GATE_MAX_QUBITS</span></code></p></td>
 <td><p>integer greater than or equal to -1</p></td>
@@ -938,7 +938,7 @@ <h2>Multi-node multi-GPU<a class="headerlink" href="#multi-node-multi-gpu" title
 </tr>
 <tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">CUDAQ_MGPU_FUSE</span></code></p></td>
 <td><p>positive integer</p></td>
-<td><p>The max number of qubits used for gate fusion. The default value depends on <a class="reference external" href="https://developer.nvidia.com/cuda-gpus">GPU Compute Capability</a> (CC). Specifically, the default is 5 for CC 8, 6 for CC 9, and 4 for CC 10. All others will use a default value of <code class="code docutils literal notranslate"><span class="pre">6</span></code> if there are more than one MPI processes or <code class="code docutils literal notranslate"><span class="pre">4</span></code> otherwise.</p></td>
+<td><p>The max number of qubits used for gate fusion. The default value depends on <a class="reference external" href="https://developer.nvidia.com/cuda-gpus">GPU Compute Capability</a> (CC) and the floating point precision selected for the simulator. Specifically, for CC 8.0, 9.0, and 10.0 the defaults are <code class="code docutils literal notranslate"><span class="pre">4</span></code>, <code class="code docutils literal notranslate"><span class="pre">5</span></code>, and <code class="code docutils literal notranslate"><span class="pre">5</span></code> for <code class="code docutils literal notranslate"><span class="pre">FP32</span></code>. For <code class="code docutils literal notranslate"><span class="pre">FP64</span></code> the corresponding defaults are <code class="code docutils literal notranslate"><span class="pre">5</span></code>, <code class="code docutils literal notranslate"><span class="pre">6</span></code>, and <code class="code docutils literal notranslate"><span class="pre">4</span></code>. For all other CC, the default is <code class="code docutils literal notranslate"><span class="pre">4</span></code> for both precision modes.</p></td>
 </tr>
 <tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">CUDAQ_MGPU_P2P_DEVICE_BITS</span></code></p></td>
 <td><p>positive integer</p></td>