Fixing the performance issue/bug in copying large vectors when using --at-most-two-errors-per-detector flag (#45)

draganaurosgrbic · noajshu · LalehB · web-flow · commit 181528ebaf65 · 2025-07-24T17:45:15.000-07:00
### Fixing the performance issue (and also implicitly a bug that existed) This PR fixes the performance issue of costly `std::vector` copy operations performed when the `--at-most-two-errors-per-detector` flag is enabled. As discussed in #27, the initial `next_next_blocked_errs = next_blocked_errs` line that existed used to consume a significant decoding time, as each time a new search state was being explored/constructed, this line would make a local copy of blocked errors for the current state being processed, so that changes made on that current state do not affect following states being explored. In #27, I realized these operations of creating local copies of `std::vector` data structures only had to be performed when the `--at-most-two-errors-per-detector` flag was enabled, as in that case, _Tesseract_ was performing additional changes on the vector of blocked errors that had to be reverted in the following iteration of exploring a search state. In this PR, I address these issues also when the `--at-most-two-errors-per-detector` flag is enabled, as it is highly inefficient to copy entire large vectors each time a new search state is explored, only to revert a few changes. I achieved this by storing a special value `2` (instead of `true/1`) for the errors that are blocked due to the `--at-most-two-errors-per-detector` flag and that should be unblocked/reverted in the next search state. Note that this is possible to implement now, as we are now storing boolean elements as integers, rather than simple bits/values that are always either `1` or `0`. The evaluation of this change/optimization could be explored in different dimensions: 1. Before I started working on _Tesseract_, `--at-most-two-errors-per-detector` flag would frequently perform copy operations of `std::vector<bool>`. As discussed in #25, these data structures improve the memory efficiency by packing boolean elements into individual bits, but I replaced them with `std::vector<char>` as this drastically improved the performance/speed, since _Tesseract_ was frequently accessing elements in `std::vector<bool>`, which induced significant overhead, due to the bit-wise operations being performed. 2. After I applied the optimization in #25, we were performing copies of `std::vector<char>`, which required more time, as they stored larger elements. 3. Finally, after I implemented optimization in #34, we would perform copies of `std::vector<DetectorCostTuple>`, where each element is 8 bytes, requiring even more time. Since we were not using the `--at-most-two-errors-per-detector` flag, as it affects the accuracy of the decoder, I focused my optimizations when not enabling this flag. However, the _Tesseract_ algorithm was still left with this code that was frequently performing copies of large vectors when the `--at-most-two-errors-per-detector` flag was enabled, only to revert a few changes. In this PR, I fix this this performance issue when the `--at-most-two-errors-per-detector` flag is enabled, by imposing a smarter strategy: storing a special `2` value for the errors that need to be unblocked/reverted in the following iteration of exploring a search state. Below are graphs where I evaluate the performance improvement I achieved by removing these unnecessary copy operations when the `--at-most-two-errors-per-detector` flag is enabled. Note that I am evaluating this when removing copy operations of `std::vector<DetectorCostTuple>`, as this was the last data representation I implemented before this PR. I also noticed that this PR affects the accuracy of the decoder. The reason why this PR affects/improves the accuracy of the decoder when using the `--at-most-two-errors-per-detector` flag is because the code had a bug. The code below: ``` for (int d : edets[ei]) { next_detectors[d] = !next_detectors[d]; int fired = next_detectors[d] ? 1 : -1; next_num_detectors += fired; for (int oei : d2e[d]) { next_detector_cost_tuples[oei].detectors_count += fired; } if (!next_detectors[d] && config.at_most_two_errors_per_detector) { for (size_t oei : d2e[d]) { next_next_detector_cost_tuples[oei].error_blocked = true; } } } ``` contains the critical loop with the bug. This loop updates the number of fired detectors for each error in the `next_detector_cost_tuples` and blocks errors in the `next_next_detector_cost_tuples`. However, when calling the `get_detcost` function, the `next_next_detector_cost_tuples` is provided as the argument. This inconsistency occurred only when the flag `--at-most-two-errors-per-detector` is enabled, as in that case, `next_next_detector_cost_tuples` was being constructed and passed to `get_detcost`, but the modifications on the fired detectors were performed on the `next_detector_cost_tuples`. This explains having more low confidence results and also 3 errors in a surface code benchmark (r=11, p=0.002, 500 shots). In this PR, this loop with the bug is replaced with: ``` for (int d : edets[ei]) { next_detectors[d] = !next_detectors[d]; int fired = next_detectors[d] ? 1 : -1; next_num_detectors += fired; for (int oei : d2e[d]) { next_detector_cost_tuples[oei].detectors_count += fired; } if (!next_detectors[d] && config.at_most_two_errors_per_detector) { for (size_t oei : d2e[d]) { next_detector_cost_tuples[oei].error_blocked = next_detector_cost_tuples[oei].error_blocked == 1 ? 1 : 2; } } } ``` As explained earlier, this PR entirely removes the `next_next_detector_cost_tuples` and replaces it with the smarter strategy of reverting changes made on blocked errors, explained earlier. Since I completely removed this array, I also fixed the bug, as now changes on the number of fired detectors and blocked errors are always performed on the same `next_detector_cost_tuples` array. **Note: I evaluated the change/improvement in accuracy only by comparing the number of low confidence results. I also measured the number of errors when executing shots/simulations, but for the benchmarks below I tested with, there were no errors before and after I implemented this PR. The only exception was the surface code benchmark (r=11, p=0.002, 500 shots). For this benchmark, I encountered 3 errors (from all 500 shots) before I implemented this PR and 0 after I implemented this PR.** <img width="1778" height="870" alt="Screenshot 2025-07-18 7 40 19 PM" src="https://github.com/user-attachments/assets/a1f54d20-ee00-43bf-8c28-f29aaf487d80" /> <img width="1778" height="876" alt="Screenshot 2025-07-18 7 40 39 PM" src="https://github.com/user-attachments/assets/87279c31-abae-4860-9bab-749eb02631ee" /> <img width="1778" height="877" alt="Screenshot 2025-07-18 7 41 01 PM" src="https://github.com/user-attachments/assets/07f820fa-c2de-4029-b5c4-52c0e035dc71" /> <img width="1778" height="875" alt="Screenshot 2025-07-18 7 41 21 PM" src="https://github.com/user-attachments/assets/8a4275ed-a787-4733-9e2c-8c609406edbc" /> ### Analyzing the impact of this flag on the performance and accuracy **Now that we have this flag fixed and optimized (as the version without using the flag), we can analyze its impact on the performance and accuracy of the decoder.** I first analyzed the performance and accuracy impact of this flag using the same benchmarks I used to test the performance/bug fix I implemented in this PR. I noticed that for these benchmarks, the flag provides somewhat better accuracy, but lower performance. Below are graphs that compare the accuracy and performance with and without using the `--at-most-two-errors-per-detector` flag. <img width="1778" height="985" alt="Screenshot 2025-07-18 7 41 52 PM" src="https://github.com/user-attachments/assets/d4e4f5f3-75e8-46e1-9fdf-51537316d652" /> <img width="1778" height="874" alt="Screenshot 2025-07-18 7 42 10 PM" src="https://github.com/user-attachments/assets/b31b28cd-d9a9-4a2a-bae2-88f78a16a5e0" /> <img width="1778" height="986" alt="Screenshot 2025-07-18 7 42 27 PM" src="https://github.com/user-attachments/assets/c6ea54d6-0738-44b3-924f-4632291e41e1" /> <img width="1778" height="883" alt="Screenshot 2025-07-18 7 42 44 PM" src="https://github.com/user-attachments/assets/e5f64154-da58-406d-8e47-2be2a0004a37" /> ### More data on the performance and accuracy impact of the flag I performed additional experiments/benchmarks to collect more comprehensive data on the impact of this flag on the performance and accuracy of the _Tesseract_ decoder. Below are plots for various groups of codes. It confirms that for (most of) benchmarks it provides somewhat better accuracy, but lower performance. <img width="1763" height="980" alt="Screenshot 2025-07-18 1 50 52 PM" src="https://github.com/user-attachments/assets/3ac277dd-4e9d-418c-956d-dc331ef12019" /> <img width="1763" height="981" alt="Screenshot 2025-07-18 1 52 49 PM" src="https://github.com/user-attachments/assets/9c7e50ef-7bb2-4805-8e8c-d1df4152cc10" /> <img width="1762" height="981" alt="Screenshot 2025-07-18 1 55 53 PM" src="https://github.com/user-attachments/assets/1803cccf-4f25-4b9a-bb2a-3818412f60de" /> <img width="1762" height="980" alt="Screenshot 2025-07-18 1 57 34 PM" src="https://github.com/user-attachments/assets/b5645353-8168-4b39-9473-4c3ed425083c" /> <img width="1748" height="981" alt="Screenshot 2025-07-18 2 02 48 PM" src="https://github.com/user-attachments/assets/1084b196-365a-4e3b-a65d-bacd19929760" /> <img width="1748" height="981" alt="Screenshot 2025-07-18 2 04 45 PM" src="https://github.com/user-attachments/assets/c6cbcf78-26ab-48e2-a9a3-2ff1faf3c5dc" /> <img width="1756" height="989" alt="Screenshot 2025-07-18 2 08 19 PM" src="https://github.com/user-attachments/assets/e5f98227-b5e0-4eba-885f-571908d183a0" /> <img width="1746" height="988" alt="Screenshot 2025-07-18 3 15 07 PM" src="https://github.com/user-attachments/assets/81c6133e-e34d-403d-85fc-320042311120" /> **The results show that the increase in decoding speed can range from around 0% to over 40%. In very rare cases, this flag provides (very low) performance improvement. The accuracy improvement ranges from 0% to over 30%, indicating that this flag can have a significant impact on the higher accuracy.** ### Major contributions of the PR: - Removes the performance degradation caused by optimizations I implemented when targeting configurations that do not use this flag, but significantly improved the decoding time without using the flag - Completely removed inefficient/redundant `std::vector` copy operations that were propagated due to the `next_next_blocked_errs = next_blocked_errs` line that existed before (mentioned in PR #27) - Fixed the performance issue/bug that existed when using the `--at-most-two-errors-per-detector` flag, where large vectors were frequently copied in each decoding iteration only to revert a few changes (it is important to note that this performance issue escalated because of the changes made in the data representation, which were necessary to implement previous optimization strategies) - Extensive experiments/benchmarks performed to evaluate the impact of the performance issue/bug fix - Extensive experiments/benchmarks performed to evaluate the impact of the flag itself on the performance and accuracy of the decoder ### Does it provide better performance on any benchmark now? I also tested running a benchmark our team looked at the last meeting where we saw that using the `--at-most-two-errors-per-detector` flag did provide better performance. I specifically tested running this benchmark: `bazel build src:all && time ./bazel-bin/src/tesseract --pqlimit 200000 --beam 5 --num-det-orders 20 --sample-num-shots 20 --det-order-seed 13267562 --circuit testdata/colorcodes/r\=9\,d\=9\,p\=0.002\,noise\=si1000\,c\=superdense_color_code_X\,q\=121\,gates\=cz.stim --sample-seed 717347 --threads 1 --print-stats` with and without the `--at-most-two-errors-per-detector` flag. However, the execution time I had without using the flag was 69.01 seconds, and with using the flag 74.23 seconds. There were no errors or low confidence results in each run. I think the benchmark we looked at during our last meeting used the installation of _Tesseract_ before my optimization from #34. If so, this shows that my optimizations had higher impact when not using this flag, and also shows that the performance improvement I achieved outweighs this flag's initial speedup. **Conclusion: I am very confident that the current version of the _Tesseract_ algorithm is faster without using this flag due to the optimizations I implemented in the `get_detcost` function. When `--at-most-two-errors-per-detector` flag is enabled, more errors are blocked, preventing them to influcence detectors' costs, and therefore the `get_detcost` function itself. I invested a lot of time accelerating the `get_detcost` function, so other speedups this flag initially achieved did not outweigh the impact I achieved in #34.** PR #47 contains the code/scripts I used to benchmark and compare color, surface, and bicycle codes with and without using the `--at-most-two-errors-per-detector` flag. --------- Signed-off-by: Dragana Grbic <draganaurosgrbic@gmail.com> Co-authored-by: noajshu <shutty@google.com> Co-authored-by: LaLeh <lalehbeni@google.com>
diff --git a/src/tesseract.cc b/src/tesseract.cc
@@ -69,16 +69,20 @@ bool Node::operator>(const Node& other) const {
 double TesseractDecoder::get_detcost(
     size_t d, const std::vector<DetectorCostTuple>& detector_cost_tuples) const {
   double min_cost = INF;
+  double error_cost;
   ErrorCost ec;
   DetectorCostTuple dct;
 
   for (size_t ei : d2e[d]) {
     ec = error_costs[ei];
-    dct = detector_cost_tuples[ei];
     if (ec.min_cost >= min_cost) break;
+
+    dct = detector_cost_tuples[ei];
     if (!dct.error_blocked) {
-      double error_cost = ec.likelihood_cost / dct.detectors_count;
-      min_cost = std::min(min_cost, error_cost);
+      error_cost = ec.likelihood_cost / dct.detectors_count;
+      if (error_cost < min_cost) {
+        min_cost = error_cost;
+      }
     }
   }
 
@@ -214,15 +218,15 @@ void TesseractDecoder::flip_detectors_and_block_errors(
     }
 
     for (size_t oei : d2e[min_detector]) {
-      detector_cost_tuples[oei].error_blocked = true;
+      detector_cost_tuples[oei].error_blocked = 1;
       if (!config.at_most_two_errors_per_detector && oei == ei) break;
     }
 
     for (size_t d : edets[ei]) {
       detectors[d] = !detectors[d];
       if (!detectors[d] && config.at_most_two_errors_per_detector) {
         for (size_t oei : d2e[d]) {
-          detector_cost_tuples[oei].error_blocked = true;
+          detector_cost_tuples[oei].error_blocked = 1;
         }
       }
     }
@@ -236,7 +240,7 @@ void TesseractDecoder::decode_to_errors(const std::vector<uint64_t>& detections,
 
   std::priority_queue<Node, std::vector<Node>, std::greater<Node>> pq;
   std::unordered_map<size_t, std::unordered_set<std::vector<char>, VectorCharHash>>
-      discovered_detectors;
+      visited_detectors;
 
   std::vector<char> initial_detectors(num_detectors, false);
   std::vector<DetectorCostTuple> initial_detector_cost_tuples(num_errors);
@@ -264,7 +268,6 @@ void TesseractDecoder::decode_to_errors(const std::vector<uint64_t>& detections,
   std::vector<size_t> next_errors;
   std::vector<char> next_detectors;
   std::vector<DetectorCostTuple> next_detector_cost_tuples;
-  std::vector<DetectorCostTuple> next_next_detector_cost_tuples;
 
   pq.push({initial_cost, min_num_detectors, std::vector<size_t>()});
   size_t num_pq_pushed = 1;
@@ -301,10 +304,8 @@ void TesseractDecoder::decode_to_errors(const std::vector<uint64_t>& detections,
       return;
     }
 
-    if (config.no_revisit_dets &&
-        !discovered_detectors[node.num_detectors].insert(detectors).second) {
+    if (config.no_revisit_dets && !visited_detectors[node.num_detectors].insert(detectors).second)
       continue;
-    }
 
     if (config.verbose) {
       std::cout.precision(13);
@@ -330,7 +331,7 @@ void TesseractDecoder::decode_to_errors(const std::vector<uint64_t>& detections,
       min_num_detectors = node.num_detectors;
       if (config.no_revisit_dets) {
         for (size_t i = min_num_detectors + detector_beam + 1; i <= max_num_detectors; ++i) {
-          discovered_detectors[i].clear();
+          visited_detectors[i].clear();
         }
       }
       max_num_detectors = std::min(max_num_detectors, min_num_detectors + detector_beam);
@@ -355,23 +356,26 @@ void TesseractDecoder::decode_to_errors(const std::vector<uint64_t>& detections,
 
     if (config.at_most_two_errors_per_detector) {
       for (int ei : d2e[min_detector]) {
-        next_detector_cost_tuples[ei].error_blocked = true;
+        next_detector_cost_tuples[ei].error_blocked = 1;
       }
     }
 
     size_t prev_ei = std::numeric_limits<size_t>::max();
     std::vector<double> detector_cost_cache(num_detectors, -1);
 
     for (size_t ei : d2e[min_detector]) {
-      if (detector_cost_tuples[ei].error_blocked) {
-        continue;
-      }
+      if (detector_cost_tuples[ei].error_blocked) continue;
 
       if (prev_ei != std::numeric_limits<size_t>::max()) {
         for (int d : edets[prev_ei]) {
           int fired = detectors[d] ? 1 : -1;
           for (int oei : d2e[d]) {
             next_detector_cost_tuples[oei].detectors_count += fired;
+
+            if (config.at_most_two_errors_per_detector &&
+                next_detector_cost_tuples[oei].error_blocked == 2) {
+              next_detector_cost_tuples[oei].error_blocked = 0;
+            }
           }
         }
       }
@@ -380,15 +384,11 @@ void TesseractDecoder::decode_to_errors(const std::vector<uint64_t>& detections,
       next_errors = node.errors;
       next_errors.push_back(ei);
       next_detectors = detectors;
-      next_detector_cost_tuples[ei].error_blocked = true;
+      next_detector_cost_tuples[ei].error_blocked = 1;
 
       double next_cost = node.cost + errors[ei].likelihood_cost;
       size_t next_num_detectors = node.num_detectors;
 
-      if (config.at_most_two_errors_per_detector) {
-        next_next_detector_cost_tuples = next_detector_cost_tuples;
-      }
-
       for (int d : edets[ei]) {
         next_detectors[d] = !next_detectors[d];
         int fired = next_detectors[d] ? 1 : -1;
@@ -399,17 +399,21 @@ void TesseractDecoder::decode_to_errors(const std::vector<uint64_t>& detections,
 
         if (!next_detectors[d] && config.at_most_two_errors_per_detector) {
           for (size_t oei : d2e[d]) {
-            next_next_detector_cost_tuples[oei].error_blocked = true;
+            next_detector_cost_tuples[oei].error_blocked =
+                next_detector_cost_tuples[oei].error_blocked == 1
+                    ? 1
+                    : 2;  // we store '2' value to indicate an error that was blocked due to the
+                          // '--at-most-two-error-per-detector' heuristic, in order to revert it in
+                          // the next decoding iteration
           }
         }
       }
 
       if (next_num_detectors > max_num_detectors) continue;
 
-      if (config.no_revisit_dets && discovered_detectors[next_num_detectors].find(next_detectors) !=
-                                        discovered_detectors[next_num_detectors].end()) {
+      if (config.no_revisit_dets && visited_detectors[next_num_detectors].find(next_detectors) !=
+                                        visited_detectors[next_num_detectors].end())
         continue;
-      }
 
       for (int d : edets[ei]) {
         if (detectors[d]) {
@@ -418,9 +422,7 @@ void TesseractDecoder::decode_to_errors(const std::vector<uint64_t>& detections,
           }
           next_cost -= detector_cost_cache[d];
         } else {
-          next_cost +=
-              get_detcost(d, config.at_most_two_errors_per_detector ? next_next_detector_cost_tuples
-                                                                    : next_detector_cost_tuples);
+          next_cost += get_detcost(d, next_detector_cost_tuples);
         }
       }
 
@@ -430,9 +432,7 @@ void TesseractDecoder::decode_to_errors(const std::vector<uint64_t>& detections,
           detector_cost_cache[od] = get_detcost(od, detector_cost_tuples);
         }
         next_cost -= detector_cost_cache[od];
-        next_cost +=
-            get_detcost(od, config.at_most_two_errors_per_detector ? next_next_detector_cost_tuples
-                                                                   : next_detector_cost_tuples);
+        next_cost += get_detcost(od, next_detector_cost_tuples);
       }
 
       if (next_cost == INF) continue;
@@ -484,4 +484,4 @@ void TesseractDecoder::decode_shots(std::vector<stim::SparseShot>& shots,
   for (size_t i = 0; i < shots.size(); ++i) {
     obs_predicted[i] = decode(shots[i].hits);
   }
-}
+}
diff --git a/src/tesseract.pybind.h b/src/tesseract.pybind.h
@@ -84,4 +84,4 @@ void add_tesseract_module(py::module &root) {
       .def_readwrite("errors", &TesseractDecoder::errors);
 }
 
-#endif
+#endif

Original file line number	Diff line number	Diff line change
`@@ -84,4 +84,4 @@ void add_tesseract_module(py::module &root) {`
`84`	`84`	`.def_readwrite("errors", &TesseractDecoder::errors);`
`85`	`85`	`}`
`86`	`86`
`87`		`-#endif`
	`87`	`+#endif`