Skip to content

Commit 9b1205d

Browse files
Merge pull request #11 from jwshi21/sched-docs
add scheduling documentation
2 parents 1bcf55e + 4dafd94 commit 9b1205d

File tree

2 files changed

+220
-0
lines changed

2 files changed

+220
-0
lines changed

documentation/docs/scheduling.md

+219
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,219 @@
1+
The scheduling language enables users to specify and compose transformations to further optimize the code generated by taco.
2+
3+
Consider the following SpMV computation and associated code, which we will transform below:
4+
```c++
5+
Format csr({Dense,Sparse});
6+
Tensor<double> A("A", {512, 64}, csr);
7+
Tensor<double> x("x", {64}, {Dense});
8+
Tensor<double> y("y", {512}, {Dense});
9+
10+
IndexVar i("i"), j("j");
11+
Access matrix = A(i, j);
12+
y(i) = matrix * x(j);
13+
IndexStmt stmt = y.getAssignment().concretize();
14+
```
15+
```c
16+
for (int32_t i = 0; i < A1_dimension; i++) {
17+
for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
18+
int32_t j = A2_crd[jA];
19+
y_vals[i] = y_vals[i] + A_vals[jA] * x_vals[j];
20+
}
21+
}
22+
```
23+
# Pos
24+
25+
The `pos(i, ipos, access)` transformation takes in an index variable `i` that iterates over the coordinate space of `access` and replaces it with a derived index variable `ipos` that iterates over the same iteration range, but with respect to the the position space.
26+
27+
Since the `pos` transformation is not valid for dense level formats, for the SpMV example, the following would result in an error:
28+
```c++
29+
stmt = stmt.pos(i, IndexVar("ipos"), matrix);
30+
```
31+
32+
We could instead have:
33+
```c++
34+
stmt = stmt.pos(j, IndexVar("jpos"), matrix);
35+
```
36+
```c
37+
for (int32_t i = 0; i < A1_dimension; i++) {
38+
for (int32_t jposA = A2_pos[i]; jposA < A2_pos[(i + 1)]; jposA++) {
39+
if (jposA < A2_pos[i] || jposA >= A2_pos[(i + 1)])
40+
continue;
41+
42+
int32_t j = A2_crd[jposA];
43+
y_vals[i] = y_vals[i] + A_vals[jposA] * x_vals[j];
44+
}
45+
}
46+
```
47+
48+
# Fuse
49+
50+
The `fuse(i, j, f)` transformation takes in two index variables `i` and `j`, where `j` is directly nested under `i`, and collapses them into a fused index variable `f` that iterates over the product of the coordinates `i` and `j`.
51+
52+
`fuse` helps facilitate other transformations, such as iterating over the position space of several index variables, as in this SpMV example:
53+
```c++
54+
IndexVar f("f");
55+
stmt = stmt.fuse(i, j, f);
56+
stmt = stmt.pos(f, IndexVar("fpos"), matrix);
57+
```
58+
```c
59+
for (int32_t fposA = 0; fposA < A2_pos[A1_dimension]; fposA++) {
60+
if (fposA >= A2_pos[A1_dimension])
61+
continue;
62+
63+
int32_t f = A2_crd[fposA];
64+
while (fposA == A2_pos[(i_pos + 1)]) {
65+
i_pos++;
66+
i = i_pos;
67+
}
68+
y_vals[i] = y_vals[i] + A_vals[fposA] * x_vals[f];
69+
}
70+
```
71+
72+
# Split
73+
74+
The `split(i, i0, i1, splitFactor)` transformation splits (strip-mines) an index variable `i` into two nested index variables `i0` and `i1`. The size of the inner index variable `i1` is then held constant at `splitFactor`, which must be a positive integer.
75+
76+
For the SpMV example, we could have:
77+
```c++
78+
stmt = stmt.split(i, IndexVar("i0"), IndexVar("i1"), 16);
79+
```
80+
```c
81+
for (int32_t i0 = 0; i0 < ((A1_dimension + 15) / 16); i0++) {
82+
for (int32_t i1 = 0; i1 < 16; i1++) {
83+
int32_t i = i0 * 16 + i1;
84+
if (i >= A1_dimension)
85+
continue;
86+
87+
for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
88+
int32_t j = A2_crd[jA];
89+
y_vals[i] = y_vals[i] + A_vals[jA] * x_vals[j];
90+
}
91+
}
92+
}
93+
```
94+
95+
<!-- (not yet implemented) -->
96+
<!-- # Divide
97+
98+
The `divide(i, i0, i1, divideFactor)` transformation divides an index variable `i` into two nested index variables `i0` and `i1`. The size of the outer index variable `i0` is then held constant at `divideFactor`, which must be a positive integer. -->
99+
100+
# Precompute
101+
102+
The `precompute(expr, i, iw, workspace)` transformation, which is described in more detail [here](http://tensor-compiler.org/taco-workspaces.pdf), leverages scratchpad memories and reorders computations to increase locality.
103+
104+
Given a subexpression `expr` to precompute, an index variable `i` to precompute over, and an index variable `iw` (which can be the same or different as `i`) to precompute with, the precomputed results are stored in the tensor variable `workspace`.
105+
106+
For the SpMV example, if `rhs` is the right hand side of the original statement, we could have:
107+
```c++
108+
TensorVar workspace("workspace", Type(Float64, {Dimension(64)}), taco::dense);
109+
stmt = stmt.precompute(rhs, j, j, workspace);
110+
```
111+
```c
112+
for (int32_t i = 0; i < A1_dimension; i++) {
113+
double* restrict workspace = 0;
114+
workspace = (double*)malloc(sizeof(double) * 64);
115+
for (int32_t pworkspace = 0; pworkspace < 64; pworkspace++) {
116+
workspace[pworkspace] = 0.0;
117+
}
118+
for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
119+
int32_t j = A2_crd[jA];
120+
workspace[j] = A_vals[jA] * x_vals[j];
121+
}
122+
for (int32_t j = 0; j < ; j++) {
123+
y_vals[i] = y_vals[i] + workspace[j];
124+
}
125+
free(workspace);
126+
}
127+
```
128+
129+
# Reorder
130+
131+
The `reorder(vars)` transformation takes in a new ordering for a set of index variables in the expression that are directly nested in the iteration order.
132+
133+
For the SpMV example, we could have:
134+
```c++
135+
stmt = stmt.reorder({j, i});
136+
```
137+
```c
138+
for (int32_t jA = A2_pos[iA]; jA < A2_pos[(iA + 1)]; jA++) {
139+
int32_t j = A2_crd[jA];
140+
for (int32_t i = 0; i < A1_dimension; i++) {
141+
y_vals[i] = y_vals[i] + A_vals[jA] * x_vals[j];
142+
}
143+
}
144+
```
145+
146+
# Bound
147+
148+
The `bound(i, ibound, bound, bound_type)` transformation replaces an index variable `i` with an index variable `ibound` that obeys a compile-time constraint on its iteration space, incorporating knowledge about the size or structured sparsity pattern of the corresponding input. The meaning of `bound` depends on the `bound_type`.
149+
150+
For the SpMV example, we could have
151+
```c++
152+
stmt = stmt.bound(i, IndexVar("ibound"), 100, BoundType::MaxExact);
153+
```
154+
```c
155+
for (int32_t ibound = 0; ibound < 100; ibound++) {
156+
for (int32_t jA = A2_pos[ibound]; jA < A2_pos[(ibound + 1)]; jA++) {
157+
int32_t j = A2_crd[jA];
158+
y_vals[ibound] = y_vals[ibound] + A_vals[jA] * x_vals[j];
159+
}
160+
}
161+
```
162+
163+
# Unroll
164+
165+
The `unroll(i, unrollFactor)` transformation unrolls the loop corresponding to an index variable `i` by `unrollFactor` number of iterations, where `unrollFactor` is a positive integer.
166+
167+
For the SpMV example, we could have
168+
```c++
169+
stmt = stmt.split(i, i0, i1, 32);
170+
stmt = stmt.unroll(i0, 4);
171+
```
172+
```c
173+
if ((((A1_dimension + 31) / 32) * 32 + 32) + (((A1_dimension + 31) / 32) * 32 + 32) >= A1_dimension) {
174+
for (int32_t i0 = 0; i0 < ((A1_dimension + 31) / 32); i0++) {
175+
for (int32_t i1 = 0; i1 < 32; i1++) {
176+
int32_t i = i0 * 32 + i1;
177+
if (i >= A1_dimension)
178+
continue;
179+
180+
for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
181+
int32_t j = A2_crd[jA];
182+
y_vals[i] = y_vals[i] + A_vals[jA] * x_vals[j];
183+
}
184+
}
185+
}
186+
}
187+
else {
188+
#pragma unroll 4
189+
for (int32_t i0 = 0; i0 < ((A1_dimension + 31) / 32); i0++) {
190+
for (int32_t i1 = 0; i1 < 32; i1++) {
191+
int32_t i = i0 * 32 + i1;
192+
for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
193+
int32_t j = A2_crd[jA];
194+
y_vals[i] = y_vals[i] + A_vals[jA] * x_vals[j];
195+
}
196+
}
197+
}
198+
}
199+
```
200+
201+
# Parallelize
202+
203+
The `parallelize(i, parallel_unit, output_race_strategy)` transformation tags an index variable `i` for parallel execution on hardware type `parallel_unit`. Data races are handled by an `output_race_strategy`. Since the other transformations expect serial code, `parallelize` must come last in a series of transformations.
204+
205+
For the SpMV example, we could have
206+
```c++
207+
stmt = stmt.parallelize(i, ParallelUnit::CPUThread, OutputRaceStrategy::NoRaces);
208+
```
209+
```c
210+
#pragma omp parallel for schedule(runtime)
211+
for (int32_t i = 0; i < A1_dimension; i++) {
212+
for (int32_t jA = A2_pos[i]; jA < A2_pos[(i + 1)]; jA++) {
213+
int32_t j = A2_crd[jA];
214+
y_vals[i] = y_vals[i] + A_vals[jA] * x_vals[j];
215+
}
216+
}
217+
```
218+
219+

documentation/mkdocs.yml

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ pages:
1515
- C++ Library:
1616
- 'Defining Tensors': 'tensors.md'
1717
- 'Computing on Tensors': 'computations.md'
18+
- 'Providing a Schedule': 'scheduling.md'
1819
- Python Library:
1920
- 'Tutorial': 'tutorial.md'
2021
- 'Defining Tensors': 'pytensors.md'

0 commit comments

Comments
 (0)