rustdoc-search: search backend with partitioned suffix tree #144476

notriddle · 2025-07-26T00:43:20Z

Before:

After:

Summary

Rewrites the rustdoc search engine to use an indexed data structure, factored out as a crate called stringdex, that allows it to perform modified-levenshtein distance calculations, substring matches, and prefix matches in a reasonably efficient, and, more importantly, incremental algorithm.

Motivation

Fixes #131156

While the windows-rs crate is definitely the worst offender, I've noticed performance problems with the compiler crates as well. It makes no sense for rustdoc-search to have poor performance: it's basically a spell checker, and those have been usable since the 90's.

Stringdex is particularly designed to quickly return exact matches, to always report those matches first, and to never, ever place new matches on top of old ones. It also tries to yield to the event loop occasionally as it runs. This way, you can click the exactly-matched result before the rest of the search finishes running.

Explanation

A longer description of how name search works can be found in stringdex's HACKING.md file.

Type search is done by performing a name search on each element, then performing bitmap operations to narrow down a list of potential matches, then performing type unification on each match.

Drawbacks

It's rather complex, and takes up more disk space than the current flat list of strings.

Rationale and alternatives

Instead of a suffix tree, I could've used a different approximate matching data structure. I didn't do that because I wanted to keep the current behavior (for example, a straightforward trigram index won't match oepn like the current system does).

Prior art

Sherlodoc is based on a similar concept, but they:

use edit distance over a suffix tree for type-based search, instead of the binary matching that's implemented here
use substring matching for name-based search, but not fuzzy name matching
actually implement body text search, which is a potential-future feature, but not implemented in this PR

Future possibilities

Faster type-driven search

This is the most noticeable problem, when I test it. Type-driven search feels slow, because it has to buffer all of the results before it can show any of them. We don't have a fancy trick algorithm like the backlog heap in prefix search, and we should,

To fix this, we might need to ditch pure bitmaps in favor of some other IR data structure to store the inverted index. It's better now than it used to be, at least in terms of memory management, but we need a better incremental search algo here.

Low-level optimization in stringdex

There are half a dozen low-level optimizations that I still need to explore. I haven't done them yet, because I've been working on bug fixes and rebasing on rustdoc's side, and a more solid and diverse test suite for stringdex itself.

Stringdex decides whether to bundle two nodes into the same file based on size. To figure out a node's size, I have to run compression on it. This is probably slower than it needs to be.
Stack compression is limited to the same 256-slot sliding windows as backref compression, and it doesn't have to be. (stack and backref compression are used to optimize the representation of the edge pointer from a parent node to its child; backref uses one byte, while stack is entirely implicit)
The JS-side decoder is pretty naive. It performs unnecessary hash table lookups when decoding compressed nodes, and retains a list of hashes that it doesn't need. It needs to calculate the hashes in order to construct the merkle tree correctly, but it doesn't need to keep them.
Data compression happens at the end, while emitting the node. This means it's not being counted when deciding on how to bundle, which is pretty dumb.

Improved recall in type-driven search

Right now, type-driven search performs very strict matching. It's very precise, but misses a lot of things people would want.

What I'm not sure about is whether to focus more on edit-distance-based approaches, or to focus on type-theoretical approaches. Both gives avenues to improve, but edit distance is going to be faster while type checking is going to be more precise.

For example, a type theoretical improvement would fix Iterator<T>, (T -> U) -> Iterator<U> to give Iterator::map, because it would recognize that the Map struct implements the Iterator trait. I don't know of any clean way to get this result to work without implementing significant type checking logic in search.js, and an edit-distance-based "dirty" approach would likely give a bunch of other results on top of this one.

Full-text search

Once you've got this fuzzy dictionary matching to work, the logical next step is to implement some kind of information retrieval-based approach to phrase matching.

Like applying edit distance to types, phrase search gets you significantly better recall, but with a few major drawbacks:

You have to pick between index bloat and the use of stopwords. Stopwords are bad because they might actually be important (try searching "if let" in mdBook if you're feeling brave), but without them, you spend a lot of space on text that doesn't matter.
Example code also tends to have a lot of irrelevant stuff in it. Like stop words, we'd have to pick potentially-confusing or bloat.

Neither of these problems are deal-breakers, but they're worth keeping in mind.

GuillaumeGomez · 2025-08-08T14:02:53Z

I didn't look at the code yet, just testing the output. As I already said, I love the new GUI, so big 👍 from me on it.

I noted a few things while testing the new output (not blockers, but definitely improvements for potential future PRs):

A visual indicator that more results are loading would be great.
The ( ) empty parens on the tabs titles feel a bit weird. I would simply not display the number until we're done searching or eventually to add another visual indicator that it's not done yet (a rotating spinner?).
The cog gear for the setting menu got updated, a reason for that?

Now for the feeling about the search itself: it is MUCH better. No more blocking UI or anything, just working. It's definitely great.

package.json

tests/rustdoc-gui/toggle-docs-mobile-failure.png

GuillaumeGomez · 2025-08-08T14:06:25Z

For the review, another thing that's not great is that UI and search index are the same commit (the first one). Not sure if intended or not, but it makes the review quite awful to go through. ^^'

tests/rustdoc-gui/notable-trait-failure.png

GuillaumeGomez · 2025-08-08T14:14:32Z

src/librustdoc/html/static/js/search.js

@@ -830,7 +880,7 @@ function createQueryElement(query, parserState, name, generics, isInGenerics) {
 */
 function makePrimitiveElement(name, extra) {
    return Object.assign({
-        name,
+        name: name,


Not needed, in js, if variable and field have the same name, you can just write the variable name and it'll create a key with the variable name, taking the variable value.

Good catch.

src/librustdoc/html/static/js/search.js

GuillaumeGomez · 2025-08-08T15:38:57Z

I went through the code and didn't see anything standing out but I feel like I missed most of it and definitely can't picture the whole thing.

I think the PR is fine, minus the various comments I made, tests are there to enforce it and the demo works super nicely.

notriddle · 2025-08-08T15:43:48Z

Trying to address some of the questions asked in the review:

A visual indicator that more results are loading would be great.

I agree. The empty parens was too subtle. I'll try to get a better indicator drawn up within about a week.

The cog gear for the setting menu got updated, a reason for that?

I want all four icons to have roughly equal visual weight. In the old toolbar, the cog looked darker than everything else. It's very noticeable when you blur it.

It's particularly galling, because that Settings button is not the most important icon in the toolbar, so it should not be the most prominent.

GuillaumeGomez · 2025-08-08T15:45:38Z

Makes sense. 👍

Please ping me once the changes have been done for next review round.

notriddle · 2025-08-08T18:47:44Z

output.mp4

notriddle removed A-testsuite Area: The testsuite used to check the correctness of rustc T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. A-CI Area: Our Github Actions CI labels Jul 26, 2025

This comment has been minimized.

Sign in to view

notriddle force-pushed the notriddle/stringdex branch from 3f21c90 to a3603c7 Compare July 28, 2025 16:44

rustbot added A-CI Area: Our Github Actions CI A-testsuite Area: The testsuite used to check the correctness of rustc T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. labels Jul 28, 2025

notriddle removed A-testsuite Area: The testsuite used to check the correctness of rustc T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. A-CI Area: Our Github Actions CI labels Jul 28, 2025

This comment has been minimized.

Sign in to view

notriddle force-pushed the notriddle/stringdex branch from a3603c7 to 278838f Compare July 28, 2025 17:09

rustbot added A-CI Area: Our Github Actions CI A-testsuite Area: The testsuite used to check the correctness of rustc T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. labels Jul 28, 2025

notriddle force-pushed the notriddle/stringdex branch from 278838f to 96b5862 Compare July 28, 2025 17:13

This comment has been minimized.

Sign in to view

notriddle force-pushed the notriddle/stringdex branch from 96b5862 to 799c605 Compare July 28, 2025 17:31

This comment has been minimized.

Sign in to view

notriddle force-pushed the notriddle/stringdex branch from 799c605 to 43cb8d0 Compare July 28, 2025 17:44

This comment has been minimized.

Sign in to view

notriddle force-pushed the notriddle/stringdex branch from 43cb8d0 to 73790db Compare July 28, 2025 18:43

rustbot added A-CI Area: Our Github Actions CI A-run-make Area: port run-make Makefiles to rmake.rs A-testsuite Area: The testsuite used to check the correctness of rustc T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. labels Aug 7, 2025

notriddle removed A-testsuite Area: The testsuite used to check the correctness of rustc T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. A-CI Area: Our Github Actions CI A-run-make Area: port run-make Makefiles to rmake.rs labels Aug 7, 2025

GuillaumeGomez reviewed Aug 8, 2025

View reviewed changes

package.json Outdated Show resolved Hide resolved

GuillaumeGomez reviewed Aug 8, 2025

View reviewed changes

tests/rustdoc-gui/toggle-docs-mobile-failure.png Outdated Show resolved Hide resolved

GuillaumeGomez reviewed Aug 8, 2025

View reviewed changes

tests/rustdoc-gui/notable-trait-failure.png Outdated Show resolved Hide resolved

GuillaumeGomez reviewed Aug 8, 2025

View reviewed changes

src/librustdoc/html/static/js/search.js Show resolved Hide resolved

rustbot added A-CI Area: Our Github Actions CI A-run-make Area: port run-make Makefiles to rmake.rs A-testsuite Area: The testsuite used to check the correctness of rustc T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. labels Aug 8, 2025

This comment has been minimized.

Sign in to view

notriddle force-pushed the notriddle/stringdex branch 2 times, most recently from 0a59952 to c4b80e9 Compare August 8, 2025 16:49

This comment has been minimized.

Sign in to view

Fix nitpicks and a random perf problem that I found

5c17945

notriddle force-pushed the notriddle/stringdex branch from c4b80e9 to 5c17945 Compare August 8, 2025 17:05

Loading throbber

4ab3c31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rustdoc-search: search backend with partitioned suffix tree #144476

rustdoc-search: search backend with partitioned suffix tree #144476

notriddle commented Jul 26, 2025 •

edited

Loading

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

GuillaumeGomez commented Aug 8, 2025

Uh oh!

Uh oh!

Uh oh!

GuillaumeGomez commented Aug 8, 2025

Uh oh!

Uh oh!

GuillaumeGomez Aug 8, 2025

Uh oh!

notriddle Aug 8, 2025

Uh oh!

Uh oh!

GuillaumeGomez commented Aug 8, 2025

Uh oh!

notriddle commented Aug 8, 2025

Uh oh!

GuillaumeGomez commented Aug 8, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

notriddle commented Aug 8, 2025

Uh oh!

Uh oh!

rustdoc-search: search backend with partitioned suffix tree #144476

Are you sure you want to change the base?

rustdoc-search: search backend with partitioned suffix tree #144476

Conversation

notriddle commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Explanation

Drawbacks

Rationale and alternatives

Prior art

Future possibilities

Faster type-driven search

Low-level optimization in stringdex

Improved recall in type-driven search

Full-text search

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

GuillaumeGomez commented Aug 8, 2025

Uh oh!

Uh oh!

Uh oh!

GuillaumeGomez commented Aug 8, 2025

Uh oh!

Uh oh!

GuillaumeGomez Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

notriddle Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

GuillaumeGomez commented Aug 8, 2025

Uh oh!

notriddle commented Aug 8, 2025

Uh oh!

GuillaumeGomez commented Aug 8, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

notriddle commented Aug 8, 2025

Uh oh!

Uh oh!

notriddle commented Jul 26, 2025 •

edited

Loading