Skip to content

Conversation

axlewin
Copy link
Contributor

@axlewin axlewin commented Aug 14, 2025

Improves the behaviour of the search endpoint, particularly for finding book questions on sci.

Some users use the sitewide search to search for book questions by section number (e.g. "Pre-Uni Maths for Sciences A1.2", or more commonly just "A1" or "A1.2"), as contained in the question's subtitle. This does not currently return the correct question(s). The aim of these changes is primarily to make searches of this type work.

This PR adjusts some weights and reduces how generously we match on words in the content, raising the relative priority of matches on fields such as subtitles. There are now separate high-priority searches for exact whole-string matches on id/title/subtitle, in addition to the existing fuzzy matches on the tokenised search string. There's a trade-off here with slightly worse matching if searching for words/phrases in a page's content.

As part of the above, subtitle is now included in the list of raw fields for elasticsearch. This solves a problem affecting questions from the Pre-Uni Physics book where near-matches on titles are prioritised over exact matches on subtitles (try "Essential Pre-Uni Physics E1.1" before & after running ETL). This works well but changes how content is indexed; since this is only needed for results from this specific book, it may make sense to revert this change.

Switching the QF searches to use fuzzy instead of substring matches would improve matching for book questions on the QF, but since the substring match method was introduced specifically for the QF I haven't changed it.

A side-effect of these changes is that searching by full url now finds the correct content object, although this isn't a common use case.

Searches that rely on stemming (e.g. "Lagrange" to match "Lagrangian points") still don't work, but they don't in the current implementation either.

For Ada, these changes should have minimal impact since Ada's search already works well. Most Ada searches are for topics rather than specific questions, so we should make sure that keyword matching on searchable content still works as well as it did before here. Similarly, non-(book-)question pages on sci should still match reliably.

There is a test script which can be run on some test cases against a local API to assert the above.

Copy link

codecov bot commented Aug 14, 2025

Codecov Report

❌ Patch coverage is 92.10526% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 36.61%. Comparing base (442b150) to head (2891e46).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
...am/cl/dtg/segue/dao/content/GitContentManager.java 95.65% 0 Missing and 1 partial ⚠️
.../ac/cam/cl/dtg/segue/etl/ElasticSearchIndexer.java 0.00% 1 Missing ⚠️
...tg/segue/search/IsaacSearchInstructionBuilder.java 92.85% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #716      +/-   ##
==========================================
+ Coverage   36.53%   36.61%   +0.08%     
==========================================
  Files         536      536              
  Lines       23689    23716      +27     
  Branches     2857     2861       +4     
==========================================
+ Hits         8655     8684      +29     
+ Misses      14175    14170       -5     
- Partials      859      862       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@axlewin axlewin marked this pull request as ready for review August 19, 2025 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant