Introduce lateInteractionScore Function #18727

mingshl · 2025-07-11T02:53:34Z

Description

lateInteractionScore Function Summary

The lateInteractionScore function is a specialized vector similarity function for Painless scripting in OpenSearch that enables ColBERT-style retrieval and reranking.

##Sample interface

    1 GET my_test_index/_search
    2 {
    3   "query": {
    4     "script_score": {
    5       "query": {
    6         "match_all": {}
    7       },
    8       "script": {
    9         "source": "lateInteractionScore(params.query_vector, 'my_vector', params._source)",
   10         "params": {
   11           "query_vector": [
   12             [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
   13             [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
   14           ]
   15         }
   16       }
   17     }
   18   },
   19   "size": 10
   20 }

Function Signature

double lateInteractionScore(List<List<Double>> queryVectors, String docFieldName, Map<String, Object> doc)

Description

This function calculates the maximum similarity between query vectors and document vectors using dot product. For each query vector, it finds the document vector with the highest dot product similarity and sums these maximum scores to produce a final similarity score.

Key Features

Supports multi-vector representations (arrays of vectors) for both queries and documents
Handles dimension mismatches gracefully by returning 0 for incompatible vectors
Returns 0 for null or empty vectors
Optimized for ColBERT-style late interaction between query and document vectors

Please see the yaml test for the sample usage for thus new function maxSimDotProduct function

Related Issues

address the rerank requirements from #18091

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2025-07-11T03:02:46Z

❌ Gradle check result for 454ff67: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Mingshi Liu <[email protected]>

github-actions · 2025-07-11T03:52:39Z

❌ Gradle check result for 10d0ee6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

navneet1v · 2025-07-11T04:32:24Z

@mingshl why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

vigyasharma · 2025-07-11T21:36:24Z

why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

I have similar questions. This doesn't seem to use the underlying Lucene implementation for LateInteractionField or the mechanisms for reranking (rescorer or RescoreTopNQuery).

Some concerns off the top of my head would be:

Painless scripting may be disabled on some production deployments (at least it used to be some time back). Would they not be able to use this change?
Similarly, if production setups are not replicating _source fields, will they miss out on this functionality. The Lucene impl. uses BinaryDocValues.
We would lose out on upstream Lucene improvements and changes that get built on top of this feature. Essentially becomes a fork with OpenSearch having its own implementation for late interaction models.
a. e.g. if Lucene leverages the same field to also index these values into the ann graph, OpenSearch would need its own impl. for the same?

That being said, I'm not super familiar with painless scripting, and maybe this is part-1 of subsequently planned changes? @mingshl It would help if you could elaborate on the rationale behind doing it this way.

navneet1v · 2025-07-14T04:16:02Z

Some concerns off the top of my head would be:

Painless scripting may be disabled on some production deployments (at least it used to be some time back). Would they not be able to use this change?

Similarly, if production setups are not replicating _source fields, will they miss out on this functionality. The Lucene impl. uses BinaryDocValues.

We would lose out on upstream Lucene improvements and changes that get built on top of this feature. Essentially becomes a fork with OpenSearch having its own implementation for late interaction models.
a. e.g. if Lucene leverages the same field to also index these values into the ann graph, OpenSearch would need its own impl. for the same?

+1 on this.

mingshl · 2025-07-15T17:10:55Z

why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

I have similar questions. This doesn't seem to use the underlying Lucene implementation for LateInteractionField or the mechanisms for reranking (rescorer or RescoreTopNQuery).

Some concerns off the top of my head would be:

Painless scripting may be disabled on some production deployments (at least it used to be some time back). Would they not be able to use this change?

Similarly, if production setups are not replicating _source fields, will they miss out on this functionality. The Lucene impl. uses BinaryDocValues.

We would lose out on upstream Lucene improvements and changes that get built on top of this feature. Essentially becomes a fork with OpenSearch having its own implementation for late interaction models.
a. e.g. if Lucene leverages the same field to also index these values into the ann graph, OpenSearch would need its own impl. for the same?

That being said, I'm not super familiar with painless scripting, and maybe this is part-1 of subsequently planned changes? @mingshl It would help if you could elaborate on the rationale behind doing it this way.

@vigyasharma @navneet1v

there are some known concerns in some deployment production environments of using painless script, that's a known concern, but it also works for the other deployment production environments. This feature would be helpful for the production environments that are already using script scores.
right, using this approach, we do need to use the _source field
we will sure adapt to the new the lucene field which is very exciting when we upgrade to the new lucene version, but it won't available through OpenSearch soon, it's unlikely to adapt this year. This rerank score function can be served as an interim solution, until we have the new field is available.

mingshl · 2025-07-15T17:15:27Z

@mingshl why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

@navneet1v we could also use knn field, but since the field is only used in rerank, it's not using in search, it's seem not necessary to set as a knn field in index mapping.

navneet1v · 2025-07-15T18:24:54Z

@mingshl why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

@navneet1v we could also use knn field, but since the field is only used in rerank, it's not using in search, it's seem not necessary to set as a knn field in index mapping.

I respectfully disagree on this point. If we do like this then it will block the future extension of the feature when we want to build late interaction with k-NN field.

2. right, using this approach, we do need to use the _source field

If you look at script score based search in k-NN it doesn't need source. So if we use kNN Field then the ranking can work on with BDV or KNNVectorValues.

Another thing is if I look at how you are calculating the score its not optimal since it is not using SIMD. K-NN plugin already provides different scoring functions like L2, L1, Cosine etc which are all SIMD optimized. Hence this implementation will be slow, and also conversion of the final vector scores to OpenSearch scores needs additional calculations which I don't see in the code.

mingshl · 2025-07-15T22:35:41Z

@mingshl why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

@navneet1v we could also use knn field, but since the field is only used in rerank, it's not using in search, it's seem not necessary to set as a knn field in index mapping.

I respectfully disagree on this point. If we do like this then it will block the future extension of the feature when we want to build late interaction with k-NN field.

right, using this approach, we do need to use the _source field

If you look at script score based search in k-NN it doesn't need source. So if we use kNN Field then the ranking can work on with BDV or KNNVectorValues.

Another thing is if I look at how you are calculating the score its not optimal since it is not using SIMD. K-NN plugin already provides different scoring functions like L2, L1, Cosine etc which are all SIMD optimized. Hence this implementation will be slow, and also conversion of the final vector scores to OpenSearch scores needs additional calculations which I don't see in the code.

this lateInteractionScore Function can also support for nested knn field, oh I remembered when I try to write the unit test, because knn field depends on the knn plugin, so I cannot write the test in core, but I will test it offline.

@navneet1v that's good callout for SIMD, I will try to implement SIMD function to make it accelerate, thanks!!!

Signed-off-by: Mingshi Liu <[email protected]>

github-actions · 2025-07-16T01:03:42Z

✅ Gradle check result for 6148e19: SUCCESS

codecov · 2025-07-16T01:04:02Z

Codecov Report

❌ Patch coverage is 72.22222% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.80%. Comparing base (f1825fd) to head (6148e19).
⚠️ Report is 157 commits behind head on main.

Files with missing lines	Patch %	Lines
...ch/painless/functions/PainlessVectorFunctions.java	70.58%	5 Missing and 5 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #18727      +/-   ##
============================================
- Coverage     72.83%   72.80%   -0.03%     
- Complexity    68511    68554      +43     
============================================
  Files          5572     5573       +1     
  Lines        314750   314786      +36     
  Branches      45680    45688       +8     
============================================
- Hits         229242   229177      -65     
- Misses        66898    67008     +110     
+ Partials      18610    18601       -9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

navneet1v · 2025-07-18T07:42:17Z

@mingshl why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

@navneet1v we could also use knn field, but since the field is only used in rerank, it's not using in search, it's seem not necessary to set as a knn field in index mapping.

I respectfully disagree on this point. If we do like this then it will block the future extension of the feature when we want to build late interaction with k-NN field.

right, using this approach, we do need to use the _source field

If you look at script score based search in k-NN it doesn't need source. So if we use kNN Field then the ranking can work on with BDV or KNNVectorValues.
Another thing is if I look at how you are calculating the score its not optimal since it is not using SIMD. K-NN plugin already provides different scoring functions like L2, L1, Cosine etc which are all SIMD optimized. Hence this implementation will be slow, and also conversion of the final vector scores to OpenSearch scores needs additional calculations which I don't see in the code.

this lateInteractionScore Function can also support for nested knn field, oh I remembered when I try to write the unit test, because knn field depends on the knn plugin, so I cannot write the test in core, but I will test it offline.

@navneet1v that's good callout for SIMD, I will try to implement SIMD function to make it accelerate, thanks!!!

@mingshl I am not sure if I was able to put my point earlier. So let me give it one more try.

We should move the LateInteraction painless script working on Array to just vector fields(k-NN field) and that too in k-NN plugin. The k-NN vector field is more optimized for storing vectors(using LuceneKNNVectorFormat) and support other data types too like byte, fp16, binary etc.
k-NN plugin already provides the support for different space_type(L2, IP, Cosine, L1 etc) which would be a duplicate work, which it is.
K-NN plugin also converts the distances between vectors to OpenSearch compatible scores. Which will be again duplicated here and more maintenance leading to divergence in score computation.
We do the integration with Lucene late interaction feature: Support for Re-Ranking Queries using Late Interaction Model Multi-Vectors. apache/lucene#14729 developed by @vigyasharma it will come via k-NN plugin, which will be an enhancement of the feature in the current PR.
All these space types already supports SIMD optimization and k-NN plugin currently keeps improving these functions. Like adding support for SIMD based distance computations on FP16 vectors.

I don't see currently any reason why this painless script should work on array field and be part of core.

mingshl · 2025-07-18T20:30:52Z

@mingshl why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

@navneet1v we could also use knn field, but since the field is only used in rerank, it's not using in search, it's seem not necessary to set as a knn field in index mapping.

I respectfully disagree on this point. If we do like this then it will block the future extension of the feature when we want to build late interaction with k-NN field.

right, using this approach, we do need to use the _source field

If you look at script score based search in k-NN it doesn't need source. So if we use kNN Field then the ranking can work on with BDV or KNNVectorValues.
Another thing is if I look at how you are calculating the score its not optimal since it is not using SIMD. K-NN plugin already provides different scoring functions like L2, L1, Cosine etc which are all SIMD optimized. Hence this implementation will be slow, and also conversion of the final vector scores to OpenSearch scores needs additional calculations which I don't see in the code.

this lateInteractionScore Function can also support for nested knn field, oh I remembered when I try to write the unit test, because knn field depends on the knn plugin, so I cannot write the test in core, but I will test it offline.
@navneet1v that's good callout for SIMD, I will try to implement SIMD function to make it accelerate, thanks!!!

@mingshl I am not sure if I was able to put my point earlier. So let me give it one more try.

We should move the LateInteraction painless script working on Array to just vector fields(k-NN field) and that too in k-NN plugin. The k-NN vector field is more optimized for storing vectors(using LuceneKNNVectorFormat) and support other data types too like byte, fp16, binary etc.

k-NN plugin already provides the support for different space_type(L2, IP, Cosine, L1 etc) which would be a duplicate work, which it is.

K-NN plugin also converts the distances between vectors to OpenSearch compatible scores. Which will be again duplicated here and more maintenance leading to divergence in score computation.

We do the integration with Lucene late interaction feature: Support for Re-Ranking Queries using Late Interaction Model Multi-Vectors. apache/lucene#14729 developed by @vigyasharma it will come via k-NN plugin, which will be an enhancement of the feature in the current PR.

All these space types already supports SIMD optimization and k-NN plugin currently keeps improving these functions. Like adding support for SIMD based distance computations on FP16 vectors.

I don't see currently any reason why this painless script should work on array field and be part of core.

the main driving goal of this script function is that we don't have the new field type, the multi-field type in knn plugin yet, and we don't want to duplicate the effort of @vigyasharma 's change in lucene. While we wait for the lucene upgrade in core, we would like to develop some interim solution.

I also love the idea from @vigyasharma that we can develop a new query type, call colbert query or late interation query that would do the rerank work for given multi-vector, before we have the new multi-vector type is ready, under the backend we can utilize the script score function first. and later we can smoothly adapt the new field type into the new query. In this way, we can keep the API and interface the same for the users.

If the best place to host the new query type is knn, I can move the PR to knn plugin too.

vigyasharma · 2025-07-18T22:33:24Z

Thanks @mingshl .

Right! I understand the desire to add some support early, rather than wait for Lucene 10.3 release. So designing it such that user interface remains the same, and we can "unfork" to Lucene's implementation under the hood will ensure users don't have to make changes in their applications. Creating an OpenSearch field and query type, that can later be changed to invoke Lucene's LateInteractionField and query could be a viable approach (happy to hear other ideas).

It's true that users will have to reindex their data to migrate it from the field we create now, to the field Lucene is using. But upgrade and reindex are better understood operations that most production systems build mechanisms for (even if they don't upgrade right after the release is out).

opensearch-trigger-bot · 2025-08-19T15:22:50Z

This PR is stalled because it has been open for 30 days with no activity.

vigyasharma · 2025-10-09T18:17:35Z

Proposal for rescoring with late interaction models using Lucene's LateInteractionField – opensearch-project/k-NN#2934

mingshl requested review from a team, Bukhtawar, CEHENKLE, Rishikesh1159, VachaShah, anasalkouz, andrross, ashking94, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, owaiskazi19, reta, sachinpkale, saratvemulapalli, shwetathareja and sohami as code owners July 11, 2025 02:53

mingshl mentioned this pull request Jul 11, 2025

[Feature Request] Support for multi-stage retriever and re-ranker in OpenSearch to use late interaction embedding models like ColBert, ColPali etc. #18091

Open

mingshl changed the title ~~Introduce maxSimDotProduct Function~~ Introduce lateInteractionScore Function Jul 11, 2025

mingshl force-pushed the main-max-sim-rerank branch from 454ff67 to 302ad4c Compare July 11, 2025 03:39

introduce lateInterationScore

10d0ee6

Signed-off-by: Mingshi Liu <[email protected]>

mingshl force-pushed the main-max-sim-rerank branch from 302ad4c to 10d0ee6 Compare July 11, 2025 03:41

apply SIMD

6148e19

Signed-off-by: Mingshi Liu <[email protected]>

mingshl mentioned this pull request Jul 18, 2025

[Feature Request] Upgrade to Lucene 10.3 (forward-tracking) #18638

Closed

mingshl mentioned this pull request Jul 20, 2025

[FEATURE][RFC] Introduce MultiVector Field Type For Late-interaction Score opensearch-project/k-NN#2706

Open

opensearch-trigger-bot bot added the stalled Issues that have stalled label Aug 19, 2025

mingshl mentioned this pull request Sep 29, 2025

Add lateInteractionFunction opensearch-project/k-NN#2909

Merged

5 tasks

opensearch-trigger-bot bot removed the stalled Issues that have stalled label Oct 11, 2025

Introduce lateInteractionScore Function #18727

Are you sure you want to change the base?

Introduce lateInteractionScore Function #18727

Uh oh!

Conversation

mingshl commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

lateInteractionScore Function Summary

Function Signature

Description

Key Features

Related Issues

Check List

Uh oh!

github-actions bot commented Jul 11, 2025

Uh oh!

github-actions bot commented Jul 11, 2025

Uh oh!

navneet1v commented Jul 11, 2025

Uh oh!

vigyasharma commented Jul 11, 2025

Uh oh!

navneet1v commented Jul 14, 2025

Uh oh!

mingshl commented Jul 15, 2025

Uh oh!

mingshl commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

navneet1v commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mingshl commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

codecov bot commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

navneet1v commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mingshl commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vigyasharma commented Jul 18, 2025

Uh oh!

opensearch-trigger-bot bot commented Aug 19, 2025

Uh oh!

vigyasharma commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mingshl commented Jul 11, 2025 •

edited

Loading

mingshl commented Jul 15, 2025 •

edited

Loading

navneet1v commented Jul 15, 2025 •

edited

Loading

mingshl commented Jul 15, 2025 •

edited

Loading

codecov bot commented Jul 16, 2025 •

edited

Loading

navneet1v commented Jul 18, 2025 •

edited

Loading

mingshl commented Jul 18, 2025 •

edited

Loading