NVIDIA
diff --git a/‎pr-2441/CMakeLists.txt
+1-1 b/‎pr-2441/CMakeLists.txt
+1-1
diff --git a/‎pr-2441/_sources/api/default_ops.rst.txt
+2-2 b/‎pr-2441/_sources/api/default_ops.rst.txt
+2-2
diff --git a/‎pr-2441/_sources/api/languages/python_api.rst.txt
+2 b/‎pr-2441/_sources/api/languages/python_api.rst.txt
+2
diff --git a/‎pr-2441/_sources/examples/python/executing_photonic_kernels.ipynb.txt
+171 b/‎pr-2441/_sources/examples/python/executing_photonic_kernels.ipynb.txt
+171
diff --git a/‎pr-2441/_sources/using/backends/dynamics.rst.txt
+44-1 b/‎pr-2441/_sources/using/backends/dynamics.rst.txt
+44-1
@@ -1,5 +1,5 @@
 # ============================================================================ #
-# Copyright (c) 2022 - 2024 NVIDIA Corporation & Affiliates.                   #
+# Copyright (c) 2022 - 2025 NVIDIA Corporation & Affiliates.                   #
 # All rights reserved.                                                         #
 #                                                                              #
 # This source code and the accompanying materials are made available under     #
 
@@ -650,7 +650,7 @@ defined by the qudit level that represents the qumode. If it is applied to a qum
 where the number of photons is already at the maximum value, the operation has no
 effect.
 
-:math:`U|0\rangle → |1\rangle, U|1\rangle → |2\rangle, U|2\rangle → |3\rangle, \cdots, U|d\rangle → |d\rangle`
+:math:`C|0\rangle → |1\rangle, C|1\rangle → |2\rangle, C|2\rangle → |3\rangle, \cdots, C|d\rangle → |d\rangle`
 where :math:`d` is the qudit level.
 
 .. tab:: Python
@@ -674,7 +674,7 @@ This operation reduces the number of photons in a qumode up to a minimum value o
 0 representing the vacuum state. If it is applied to a qumode where the number of
 photons is already at the minimum value 0, the operation has no effect.
 
-:math:`U|0\rangle → |0\rangle, U|1\rangle → |0\rangle, U|2\rangle → |1\rangle, \cdots, U|d\rangle → |d-1\rangle`
+:math:`A|0\rangle → |0\rangle, A|1\rangle → |0\rangle, A|2\rangle → |1\rangle, \cdots, A|d\rangle → |d-1\rangle`
 where :math:`d` is the qudit level.
 
 .. tab:: Python
 
@@ -157,6 +157,8 @@ Data Types
 .. autoclass:: cudaq.operator.cudm_state.CuDensityMatState
     :members:
 
+.. autoclass:: cudaq.operator.helpers.InitialState
+
 .. autofunction:: cudaq.operator.cudm_state.to_cupy_array
 
 .. autoclass:: cudaq::SampleResult
 
@@ -0,0 +1,171 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Executing Quantum Photonic Circuits \n",
+    "\n",
+    "In CUDA-Q, there are 2 ways in which one can execute quantum photonic kernels: \n",
+    "\n",
+    "1. `sample`: yields measurement counts \n",
+    "3. `get_state`: yields the quantum statevector of the computation \n",
+    "\n",
+    "## Sample\n",
+    "\n",
+    "Quantum states collapse upon measurement and hence need to be sampled many times to gather statistics. The CUDA-Q `sample` call enables this: \n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import cudaq\n",
+    "import numpy as np\n",
+    "\n",
+    "qumode_count = 2\n",
+    "\n",
+    "# Define the simulation target.\n",
+    "cudaq.set_target(\"orca-photonics\")\n",
+    "\n",
+    "# Define a quantum kernel function.\n",
+    "\n",
+    "\n",
+    "@cudaq.kernel\n",
+    "def kernel(qumode_count: int):\n",
+    "    level = qumode_count + 1\n",
+    "    qumodes = [qudit(level) for _ in range(qumode_count)]\n",
+    "\n",
+    "    # Apply the create gate to the qumodes.\n",
+    "    for i in range(qumode_count):\n",
+    "        create(qumodes[i])  # |00⟩ -> |11⟩\n",
+    "\n",
+    "    # Apply the beam_splitter gate to the qumodes.\n",
+    "    beam_splitter(qumodes[0], qumodes[1], np.pi / 6)\n",
+    "\n",
+    "    # measure all qumodes\n",
+    "    mz(qumodes)\n",
+    "\n",
+    "\n",
+    "result = cudaq.sample(kernel, qumode_count, shots_count=1000)\n",
+    "\n",
+    "print(result)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Get state\n",
+    "\n",
+    "The `get_state` function gives us access to the quantum statevector of the computation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import cudaq\n",
+    "import numpy as np\n",
+    "\n",
+    "qumode_count = 2\n",
+    "\n",
+    "# Define the simulation target.\n",
+    "cudaq.set_target(\"orca-photonics\")\n",
+    "\n",
+    "# Define a quantum kernel function.\n",
+    "\n",
+    "\n",
+    "@cudaq.kernel\n",
+    "def kernel(qumode_count: int):\n",
+    "    level = qumode_count + 1\n",
+    "    qumodes = [qudit(level) for _ in range(qumode_count)]\n",
+    "\n",
+    "    # Apply the create gate to the qumodes.\n",
+    "    for i in range(qumode_count):\n",
+    "        create(qumodes[i])  # |00⟩ -> |11⟩\n",
+    "\n",
+    "    # Apply the beam_splitter gate to the qumodes.\n",
+    "    beam_splitter(qumodes[0], qumodes[1], np.pi / 6)\n",
+    "\n",
+    "    # measure some of all qumodes if need to be measured\n",
+    "    # mz(qumodes)\n",
+    "\n",
+    "\n",
+    "# Compute the statevector of the kernel\n",
+    "result = cudaq.get_state(kernel, qumode_count)\n",
+    "\n",
+    "print(np.array(result))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The statevector generated by the `get_state` command follows little-endian convention for associating numbers with their digit string representations, which places the least significant digit on the right. That is, for the example of a 2-qumode system of level 3 (in which possible states are 0, 1, and 2), we have the following translation between integers and digit string:\n",
+    "$$\\begin{matrix} \n",
+    "\\text{Integer} & \\text{digit string representation}\\\\\n",
+    "& \\text{least significant bit on right}\\\\\n",
+    "0 = \\textcolor{blue}{0}*3^1 + \\textcolor{red}{0}*3^0 & \\textcolor{blue}{0}\\textcolor{red}{0} \\\\\n",
+    "1 = \\textcolor{blue}{0}*3^1 + \\textcolor{red}{1}*3^0 & \\textcolor{blue}{0}\\textcolor{red}{1}\\\\\n",
+    "2 = \\textcolor{blue}{0}*3^1 + \\textcolor{red}{2}*3^0 & \\textcolor{blue}{0}\\textcolor{red}{2}\\\\\n",
+    "3 = \\textcolor{blue}{1}*3^1 + \\textcolor{red}{0}*3^0 & \\textcolor{blue}{1}\\textcolor{red}{0} \\\\\n",
+    "4 = \\textcolor{blue}{1}*3^1 + \\textcolor{red}{1}*3^0 & \\textcolor{blue}{1}\\textcolor{red}{1} \\\\\n",
+    "5 = \\textcolor{blue}{1}*3^1 + \\textcolor{red}{2}*3^0 & \\textcolor{blue}{1}\\textcolor{red}{2} \\\\\n",
+    "6 = \\textcolor{blue}{2}*3^1 + \\textcolor{red}{0}*3^0 & \\textcolor{blue}{2}\\textcolor{red}{0} \\\\\n",
+    "7 = \\textcolor{blue}{2}*3^1 + \\textcolor{red}{1}*3^0 & \\textcolor{blue}{2}\\textcolor{red}{1} \\\\\n",
+    "8 = \\textcolor{blue}{2}*3^1 + \\textcolor{red}{2}*3^0 & \\textcolor{blue}{2}\\textcolor{red}{2} \n",
+    "\\end{matrix}\n",
+    "$$\n"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "## Parallelization Techniques\n",
+    "\n",
+    "The most intensive task in the computation is the execution of the quantum photonic kernel hence each execution function: `sample`, and `get_state` can be parallelized given access to multiple quantum processing units (multi-QPU). We emulate each QPU with a CPU."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(cudaq.__version__)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
@@ -84,6 +84,8 @@ For example, we can plot the Pauli expectation value for the above simulation as
 In particular, for each time step, `evolve` captures an array of expectation values, one for each  
 observable. Hence, we convert them into sequences for plotting purposes.
 
+Examples that illustrate how to use the ``dynamics`` target are available 
+in the `CUDA-Q repository <https://github.com/NVIDIA/cuda-quantum/tree/main/docs/sphinx/examples/python/dynamics>`__. 
 
 Operator
 +++++++++++
@@ -272,4 +274,45 @@ backend target.
     If the output is a '`None`' string, it indicates that your Torch installation does not support CUDA.
     In this case, you need to install a CUDA-enabled Torch package via other mechanisms, e.g., building Torch from source or
     using their Docker images.
- 
+
+Multi-GPU Multi-Node Execution
++++++++++++++++++++++++++++++++
+
+.. _cudensitymat_mgmn:
+
+CUDA-Q ``dynamics`` target supports parallel execution on multiple GPUs. 
+To enable parallel execution, the application must initialize MPI as follows.
+
+
+.. tab:: Python
+
+  .. literalinclude:: ../../snippets/python/using/backends/dynamics.py
+        :language: python
+        :start-after: [Begin MPI]
+        :end-before: [End MPI]
+
+  .. code:: bash 
+
+        mpiexec -np <N> python3 program.py 
+  
+  where ``N`` is the number of processes.
+
+
+By initializing the MPI execution environment (via `cudaq.mpi.initialize()`) in the application code and
+invoking it via an MPI launcher, we have activated the multi-node multi-GPU feature of the ``dynamics`` target.
+Specifically, it will detect the number of processes (GPUs) and distribute the computation across all available GPUs.
+
+
+.. note::
+    The number of MPI processes must be a power of 2, one GPU per process.
+
+.. note::
+    Not all integrators are capable of handling distributed state. Errors will be raised if parallel execution is activated 
+    but the selected integrator does not support distributed state. 
+
+.. warning:: 
+    As of cuQuantum version 24.11, there are a couple of `known limitations <https://docs.nvidia.com/cuda/cuquantum/24.11.0/cudensitymat/index.html>`__ for parallel execution:
+
+    - Computing the expectation value of a mixed quantum state is not supported. Thus, `collapse_operators` are not supported if expectation calculation is required.
+
+    - Some combinations of quantum states and quantum many-body operators are not supported. Errors will be raised in those cases.