fix: code

avik-pal · avik-pal · commit 5784970eb394 · 2025-08-02T20:32:02.000-04:00
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -5,6 +5,7 @@ Lux = "b2108857-7c20-44ae-9111-449ecde12c47"
 Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
 Reactant = "3c362404-f566-11ee-1572-e11a4b42c853"
 ReactantCore = "a3311ec8-5e00-46d5-b541-4f83e724a433"
+Reactant_jll = "0192cb87-2b54-54ad-80e0-3be72ad8a3c0"
 Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
 
 [sources]
diff --git a/docs/src/tutorials/sharding.md b/docs/src/tutorials/sharding.md
@@ -17,7 +17,8 @@
 
 ## Basics
 
-Sharding is one mechanism supported within Reactant that tries to make it easy to program for multiple devices (including [multiple nodes](@ref distributed)).
+Sharding is one mechanism supported within Reactant that tries to make it easy to program
+for multiple devices (including [multiple nodes](@ref distributed)).
 
 ```@example sharding_tutorial
 using Reactant
@@ -26,32 +27,52 @@ using Reactant
 Reactant.devices()
 ```
 
-Sharding provides Reactant users a [PGAS (parallel-global address space)](https://en.wikipedia.org/wiki/Partitioned_global_address_space) programming model. Let's understand what this means through example.
+Sharding provides Reactant users a
+[PGAS (parallel-global address space)](https://en.wikipedia.org/wiki/Partitioned_global_address_space)
+programming model. Let's understand what this means through example.
 
-Suppose we have a function that takes a large input array and computes sin for all elements of the array.
+Suppose we have a function that takes a large input array and computes sin for all elements
+of the array.
 
 ```@example sharding_tutorial
 function big_sin(data)
-    data .= sin(data)
+    data .= sin.(data)
     return nothing
 end
 
 N = 1600
-x = Reactant.to_array(reshape(collect(1:N), 40, 40))
+x = Reactant.to_rarray(reshape(collect(Float32, 1:N), 40, 40))
 
 compiled_big_sin = @compile big_sin(x)
 
 compiled_big_sin(x)
 ```
 
-This successfully allocates the array `x` on one device, and executes it on the same device. However, suppose we want to execute this computation on multiple devices. Perhaps this is because the size of our inputs (`N`) is too large to fit on a single device. Or alternatively the function we execute is computationally expensive and we want to leverage the computing power of multiple devices.
-
-Unlike more explicit communication libraries like MPI, the sharding model used by Reactant aims to let you execute a program on multiple devices without significant modifications to the single-device program. In particular, you do not need to write explicit communication calls (e.g. `MPI.Send` or `MPI.Recv`). Instead you write your program as if it executes on a very large single-node and Reactant will automatically determine how to subdivide the data, computation, and required communication.
-
-When using sharding, the one thing you need to change about your code is how arrays are allocated. In particular, you need to specify how the array is partitioned amongst available devices. For example, suppose you are on a machine with 4 GPUs. In the example above, we computed `sin` for all elements of a 40x40 grid. One partitioning we could select is to have it partitioned along the first axis, such that each GPU has a slice of 10x40 elements. We could accomplish this as follows. No change is required to the original function. However, the compiled function is specific to the sharding so we need to compile a new version for our sharded array.
+This successfully allocates the array `x` on one device, and executes it on the same device.
+However, suppose we want to execute this computation on multiple devices. Perhaps this is
+because the size of our inputs (`N`) is too large to fit on a single device. Or
+alternatively the function we execute is computationally expensive and we want to leverage
+the computing power of multiple devices.
+
+Unlike more explicit communication libraries like MPI, the sharding model used by Reactant
+aims to let you execute a program on multiple devices without significant modifications to
+the single-device program. In particular, you do not need to write explicit communication
+calls (e.g. `MPI.Send` or `MPI.Recv`). Instead you write your program as if it executes on a
+very large single-node and Reactant will automatically determine how to subdivide the data,
+computation, and required communication.
+
+When using sharding, the one thing you need to change about your code is how arrays are
+allocated. In particular, you need to specify how the array is partitioned amongst available
+devices. For example, suppose you are on a machine with 4 GPUs. In the example above, we
+computed `sin` for all elements of a 40x40 grid. One partitioning we could select is to have
+it partitioned along the first axis, such that each GPU has a slice of 10x40 elements. We
+could accomplish this as follows. No change is required to the original function. However,
+the compiled function is specific to the sharding so we need to compile a new version for
+our sharded array.
 
 ```@example sharding_tutorial
 N = 1600
+
 x_sharded_first = Reactant.to_array(
     reshape(collect(1:N), 40, 40),
     sharding=Sharding.NamedSharding(
@@ -65,7 +86,10 @@ compiled_big_sin_sharded_first = @compile big_sin(x_sharded_first)
 compiled_big_sin_sharded_first(x_sharded_first)
 ```
 
-Alternatively, we can parition the data in a different form. In particular, we could subdivide the data on both axes. As a result each GPU would have a slice of 20x20 elements. Again no change is required to the original function, but we would change the allocation as follows:
+Alternatively, we can parition the data in a different form. In particular, we could
+subdivide the data on both axes. As a result each GPU would have a slice of 20x20 elements.
+Again no change is required to the original function, but we would change the allocation as
+follows:
 
 ```@example sharding_tutorial
 N = 1600
@@ -82,37 +106,66 @@ compiled_big_sin_sharded_both = @compile big_sin(x_sharded_both)
 compiled_big_sin_sharded_both(x_sharded_both)
 ```
 
-Sharding in reactant requires you to specify how the data is sharded across devices on a mesh. We start by specifying the mesh [`Sharding.Mesh`](@ref) which is a collection of the devices reshaped into an N-D grid. Additionally, we can specify names for each axis of the mesh, that are then referenced when specifying how the data is sharded.
+Sharding in reactant requires you to specify how the data is sharded across devices on a
+mesh. We start by specifying the mesh [`Sharding.Mesh`](@ref) which is a collection of the
+devices reshaped into an N-D grid. Additionally, we can specify names for each axis of the
+mesh, that are then referenced when specifying how the data is sharded.
 
-1. `Sharding.Mesh(reshape(Reactant.devices()[1:4], 2, 2), (:x, :y))`: Creates a 2D grid of 4 devices arranged in a 2x2 grid. The first axis is named `:x` and the second axis is named `:y`.
-2. `Sharding.Mesh(reshape(Reactant.devices()[1:4], 4, 1), (:x, :y))`: Creates a 2D grid of 4 devices arranged in a 4x1 grid. The first axis is named `:x` and the second axis is named `:y`.
+1. `Sharding.Mesh(reshape(Reactant.devices()[1:4], 2, 2), (:x, :y))`: Creates a 2D grid of 4
+   devices arranged in a 2x2 grid. The first axis is named `:x` and the second axis is named
+   `:y`.
+2. `Sharding.Mesh(reshape(Reactant.devices()[1:4], 4, 1), (:x, :y))`: Creates a 2D grid of 4
+   devices arranged in a 4x1 grid. The first axis is named `:x` and the second axis is
+   named `:y`.
 
-Given the mesh, we will specify how the data is sharded across the devices. 
+Given the mesh, we will specify how the data is sharded across the devices.
 
 <!-- TODO describe how arrays are the "global data arrays, even though data is itself only stored on relevant device and computation is performed only devices with the required data (effectively showing under the hood how execution occurs) -->
 
 <!-- TODO make a simple conway's game of life, or heat equation using sharding simulation example to show how a ``typical MPI'' simulation can be written using sharding. -->
 
 ## Simple 1-Dimensional Heat Equation
 
-So far we chose a function which was perfectly parallelizable (e.g. each elemnt of the array only accesses its own data). Let's consider a more realistic example where an updated element requires data from its neighbors. In the distributed case, this requires communicating the data along the boundaries.
+So far we chose a function which was perfectly parallelizable (e.g. each elemnt of the array
+only accesses its own data). Let's consider a more realistic example where an updated
+element requires data from its neighbors. In the distributed case, this requires
+communicating the data along the boundaries.
 
-In particular, let's implement a one-dimensional [heat equation](https://en.wikipedia.org/wiki/Heat_equation) simulation. In this code you initialize the temperature of all points of the simulation and over time the code will simulate how the heat is transfered across space. In particular points of high temperature will transfer energy to points of low energy.
+In particular, let's implement a one-dimensional
+[heat equation](https://en.wikipedia.org/wiki/Heat_equation) simulation. In this code you
+initialize the temperature of all points of the simulation and over time the code will
+simulate how the heat is transfered across space. In particular points of high temperature
+will transfer energy to points of low energy.
 
 As an example, here is a visualization of a 2-dimensional heat equation:
 
 ![Heat Equation Animation](https://upload.wikimedia.org/wikipedia/commons/a/a9/Heat_eqn.gif)
 
-TODO we should animate the above -- and even more ideally have one we generate ourselves.
+<!-- TODO we should animate the above -- and even more ideally have one we generate ourselves. -->
 
-To keep things simple, let's implement a 1-dimensional heat equation here. We start off with an array for the temperature at each point, and will compute the next version of the temperatures according to the equation `x[i, t] = 0.x * [i, t-1] + 0.25 * x[i-1, t-1] + 0.25 * x[i+1, t-1]`.
+To keep things simple, let's implement a 1-dimensional heat equation here. We start off with
+an array for the temperature at each point, and will compute the next version of the
+temperatures according to the equation
+`x[i, t] = 0.x * [i, t-1] + 0.25 * x[i-1, t-1] + 0.25 * x[i+1, t-1]`.
 
-Let's consider how this can be implemented with explicit MPI communication. Each node will contain a subset of the total data. For example, if we simulate with 100 points, and have 4 devices, each device will contain 25 data points. We're going to allocate some extra room at each end of the buffer to store the ``halo'', or the data at the boundary. Each time step that we take will first copy in the data from its neighbors into the halo via an explicit MPI send and recv call. We'll then compute the updated data for our slice of the data.
+Let's consider how this can be implemented with explicit MPI communication. Each node will
+contain a subset of the total data. For example, if we simulate with 100 points, and have 4
+devices, each device will contain 25 data points. We're going to allocate some extra room at
+each end of the buffer to store the ``halo'', or the data at the boundary. Each time step
+that we take will first copy in the data from its neighbors into the halo via an explicit
+MPI send and recv call. We'll then compute the updated data for our slice of the data.
 
-With sharding, things are a bit more simple. We can write the code as if we only had one device. No explicit send or recv's are necessary
-as they will be added automatically by Reactant when it deduces they are needed. In fact, Reactant will attempt to optimize the placement of the communicatinos to minimize total runtime. While Reactant tries to do a good job (which could be faster than an initial implementation -- especially for complex codebases), an expert may be able to find a better placement of the communication.
+With sharding, things are a bit more simple. We can write the code as if we only had one
+device. No explicit send or recv's are necessary as they will be added automatically by
+Reactant when it deduces they are needed. In fact, Reactant will attempt to optimize the
+placement of the communicatinos to minimize total runtime. While Reactant tries to do a
+good job (which could be faster than an initial implementation -- especially for complex
+codebases), an expert may be able to find a better placement of the communication.
 
-The only difference for the sharded code again occurs during allocation. Here we explicitly specify that we want to subdivide the initial grid of 100 amongst all devices. Analagously if we had 4 devices to work with, each device would have 25 elements in its local storage. From the user's standpoint, however, all arrays give access to the entire dataset.
+The only difference for the sharded code again occurs during allocation. Here we explicitly
+specify that we want to subdivide the initial grid of 100 amongst all devices. Analagously
+if we had 4 devices to work with, each device would have 25 elements in its local storage.
+From the user's standpoint, however, all arrays give access to the entire dataset.
 
 ::: code-group
 
@@ -182,7 +235,7 @@ data = Reactant.to_rarray(
 )
 
 function simulate(data, time_steps)
-    @traced for i in 1:time_steps
+    @trace for i in 1:time_steps
         one_dim_heat_equation_time_step_sharded!(data)
     end
 end
@@ -192,6 +245,13 @@ end
 
 :::
 
+MPI to send the data. between computers When using GPUs on different devices, one needs to copy the data through the network via NCCL instead of the `cuda.
+
+All devices from all nodes are available for use by Reactant. Given the topology of the devices, Reactant will automatically determine the right type of communication primitive to use to send data between the relevant nodes. For example, between GPUs on the same host Reactant may use the faster `cudaMemcpy` whereas for GPUs on different nodes Reactant will use NCCL.
+
+The fact that you doesn't need to specify how the communication is occuring enables code written with Reactant to be run on a different topology (e.g. moving fro
+One nice feature about how Reactant's handling of multiple devices is that you don't need to specify how the data is transfered. For example, when using multiple GPUs on the same host it might be efficient to copy data using a `cudaMemcpy` to transfer between devices directly. When using CPUs on multiple different nodes, one can use
+
 ## Devices
 
 You can query the available devices that Reactant can access as follows using
@@ -211,13 +271,6 @@ Reactant.addressable_devices()
 
 You can inspect the type of the device, as well as its properties.
 
-MPI to send the data.  between computers When using GPUs on different devices, one needs to copy the data through the network via NCCL instead of the `cuda.
-
-All devices from all nodes are available for use by Reactant. Given the topology of the devices, Reactant will automatically determine the right type of communication primitive to use to send data between the relevant nodes. For example, between GPUs on the same host Reactant may use the faster `cudaMemcpy` whereas for GPUs on different nodes Reactant will use NCCL.
-
-The fact that you doesn't need to specify how the communication is occuring enables code written with Reactant to be run on a different topology (e.g. moving fro
-One nice feature about how Reactant's handling of multiple devices is that you don't need to specify how the data is transfered. For example, when using multiple GPUs on the same host it might be efficient to copy data using a `cudaMemcpy` to transfer between devices directly. When using CPUs on multiple different nodes, one can use
-
 ## Generating Distributed Data by Concatenating Local-Worker Data
 
 ## Handling Replicated Tensors