-
Notifications
You must be signed in to change notification settings - Fork 13.6k
rustdoc-search: search backend with partitioned suffix tree #144476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
3f21c90
to
a3603c7
Compare
This comment has been minimized.
This comment has been minimized.
a3603c7
to
278838f
Compare
278838f
to
96b5862
Compare
This comment has been minimized.
This comment has been minimized.
96b5862
to
799c605
Compare
This comment has been minimized.
This comment has been minimized.
799c605
to
43cb8d0
Compare
This comment has been minimized.
This comment has been minimized.
43cb8d0
to
73790db
Compare
I didn't look at the code yet, just testing the output. As I already said, I love the new GUI, so big 👍 from me on it. I noted a few things while testing the new output (not blockers, but definitely improvements for potential future PRs):
Now for the feeling about the search itself: it is MUCH better. No more blocking UI or anything, just working. It's definitely great. |
For the review, another thing that's not great is that UI and search index are the same commit (the first one). Not sure if intended or not, but it makes the review quite awful to go through. ^^' |
@@ -830,7 +880,7 @@ function createQueryElement(query, parserState, name, generics, isInGenerics) { | |||
*/ | |||
function makePrimitiveElement(name, extra) { | |||
return Object.assign({ | |||
name, | |||
name: name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed, in js, if variable and field have the same name, you can just write the variable name and it'll create a key with the variable name, taking the variable value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
I went through the code and didn't see anything standing out but I feel like I missed most of it and definitely can't picture the whole thing. I think the PR is fine, minus the various comments I made, tests are there to enforce it and the demo works super nicely. |
Makes sense. 👍 Please ping me once the changes have been done for next review round. |
This comment has been minimized.
This comment has been minimized.
0a59952
to
c4b80e9
Compare
This comment has been minimized.
This comment has been minimized.
c4b80e9
to
5c17945
Compare
output.mp4 |
Before:
After:
Summary
Rewrites the rustdoc search engine to use an indexed data structure, factored out as a crate called stringdex, that allows it to perform modified-levenshtein distance calculations, substring matches, and prefix matches in a reasonably efficient, and, more importantly, incremental algorithm.
Motivation
Fixes #131156
While the windows-rs crate is definitely the worst offender, I've noticed performance problems with the compiler crates as well. It makes no sense for rustdoc-search to have poor performance: it's basically a spell checker, and those have been usable since the 90's.
Stringdex is particularly designed to quickly return exact matches, to always report those matches first, and to never, ever place new matches on top of old ones. It also tries to yield to the event loop occasionally as it runs. This way, you can click the exactly-matched result before the rest of the search finishes running.
Explanation
A longer description of how name search works can be found in stringdex's HACKING.md file.
Type search is done by performing a name search on each element, then performing bitmap operations to narrow down a list of potential matches, then performing type unification on each match.
Drawbacks
It's rather complex, and takes up more disk space than the current flat list of strings.
Rationale and alternatives
Instead of a suffix tree, I could've used a different approximate matching data structure. I didn't do that because I wanted to keep the current behavior (for example, a straightforward trigram index won't match oepn like the current system does).
Prior art
Sherlodoc is based on a similar concept, but they:
Future possibilities
Faster type-driven search
This is the most noticeable problem, when I test it. Type-driven search feels slow, because it has to buffer all of the results before it can show any of them. We don't have a fancy trick algorithm like the backlog heap in prefix search, and we should,
To fix this, we might need to ditch pure bitmaps in favor of some other IR data structure to store the inverted index. It's better now than it used to be, at least in terms of memory management, but we need a better incremental search algo here.
Low-level optimization in stringdex
There are half a dozen low-level optimizations that I still need to explore. I haven't done them yet, because I've been working on bug fixes and rebasing on rustdoc's side, and a more solid and diverse test suite for stringdex itself.
Improved recall in type-driven search
Right now, type-driven search performs very strict matching. It's very precise, but misses a lot of things people would want.
What I'm not sure about is whether to focus more on edit-distance-based approaches, or to focus on type-theoretical approaches. Both gives avenues to improve, but edit distance is going to be faster while type checking is going to be more precise.
For example, a type theoretical improvement would fix
Iterator<T>, (T -> U) -> Iterator<U>
to giveIterator::map
, because it would recognize that the Map struct implements the Iterator trait. I don't know of any clean way to get this result to work without implementing significant type checking logic in search.js, and an edit-distance-based "dirty" approach would likely give a bunch of other results on top of this one.Full-text search
Once you've got this fuzzy dictionary matching to work, the logical next step is to implement some kind of information retrieval-based approach to phrase matching.
Like applying edit distance to types, phrase search gets you significantly better recall, but with a few major drawbacks:
Neither of these problems are deal-breakers, but they're worth keeping in mind.