Improve search results #716

axlewin · 2025-08-14T12:56:15Z

Improves the behaviour of the search endpoint, particularly for finding book questions on sci.

Some users use the sitewide search to search for book questions by section number (e.g. "Pre-Uni Maths for Sciences A1.2", or more commonly just "A1" or "A1.2"), as contained in the question's subtitle. This does not currently return the correct question(s). The aim of these changes is primarily to make searches of this type work.

This PR adjusts some weights and reduces how generously we match on words in the content, raising the relative priority of matches on fields such as subtitles. There are now separate high-priority searches for exact whole-string matches on id/title/subtitle, in addition to the existing fuzzy matches on the tokenised search string. There's a trade-off here with slightly worse matching if searching for words/phrases in a page's content.

As part of the above, subtitle is now included in the list of raw fields for elasticsearch. This solves a problem affecting questions from the Pre-Uni Physics book where near-matches on titles are prioritised over exact matches on subtitles (try "Essential Pre-Uni Physics E1.1" before & after running ETL). This works well but changes how content is indexed; since this is only needed for results from this specific book, it may make sense to revert this change.

Switching the QF searches to use fuzzy instead of substring matches would improve matching for book questions on the QF, but since the substring match method was introduced specifically for the QF I haven't changed it.

A side-effect of these changes is that searching by full url now finds the correct content object, although this isn't a common use case.

Searches that rely on stemming (e.g. "Lagrange" to match "Lagrangian points") still don't work, but they don't in the current implementation either.

For Ada, these changes should have minimal impact since Ada's search already works well. Most Ada searches are for topics rather than specific questions, so we should make sure that keyword matching on searchable content still works as well as it did before here. Similarly, non-(book-)question pages on sci should still match reliably.

There is a test script which can be run on some test cases against a local API to assert the above.

codecov · 2025-08-14T12:59:28Z

Codecov Report

❌ Patch coverage is 92.10526% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.61%. Comparing base (442b150) to head (2891e46).
⚠️ Report is 4 commits behind head on master.

Files with missing lines	Patch %	Lines
...am/cl/dtg/segue/dao/content/GitContentManager.java	95.65%	0 Missing and 1 partial ⚠️
.../ac/cam/cl/dtg/segue/etl/ElasticSearchIndexer.java	0.00%	1 Missing ⚠️
...tg/segue/search/IsaacSearchInstructionBuilder.java	92.85%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #716      +/-   ##
==========================================
+ Coverage   36.53%   36.61%   +0.08%     
==========================================
  Files         536      536              
  Lines       23689    23716      +27     
  Branches     2857     2861       +4     
==========================================
+ Hits         8655     8684      +29     
+ Misses      14175    14170       -5     
- Partials      859      862       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

axlewin added 3 commits August 13, 2025 12:15

Don't use wildcard instructions for searchable content

45904fb

Require stricter matching for searching by id

25c0314

Restore slight boost to searchable content

e210166

axlewin added 6 commits August 15, 2025 17:26

Prioritise exact matches on id/title/subtitle

b810ce6

Prioritise exact matches for question finder search

93ce11e

Add null check for QF search string

06180ad

Change access modifier

2b3da1c

Fix indentation

39780ad

Fix indentation

2891e46

axlewin marked this pull request as ready for review August 19, 2025 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve search results #716

Improve search results #716

Uh oh!

axlewin commented Aug 14, 2025 •

edited

Loading

Uh oh!

codecov bot commented Aug 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Improve search results #716

Are you sure you want to change the base?

Improve search results #716

Uh oh!

Conversation

axlewin commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

axlewin commented Aug 14, 2025 •

edited

Loading

codecov bot commented Aug 14, 2025 •

edited

Loading