Skip to content

Conversation

@mingshl
Copy link
Contributor

@mingshl mingshl commented Jul 11, 2025

Description

lateInteractionScore Function Summary

The lateInteractionScore function is a specialized vector similarity function for Painless scripting in OpenSearch that enables ColBERT-style retrieval and reranking.

##Sample interface

    1 GET my_test_index/_search
    2 {
    3   "query": {
    4     "script_score": {
    5       "query": {
    6         "match_all": {}
    7       },
    8       "script": {
    9         "source": "lateInteractionScore(params.query_vector, 'my_vector', params._source)",
   10         "params": {
   11           "query_vector": [
   12             [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
   13             [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
   14           ]
   15         }
   16       }
   17     }
   18   },
   19   "size": 10
   20 }

Function Signature

double lateInteractionScore(List<List<Double>> queryVectors, String docFieldName, Map<String, Object> doc)

Description

This function calculates the maximum similarity between query vectors and document vectors using dot product. For each query vector, it finds the document vector with the highest dot product similarity and sums these maximum scores to produce a final similarity score.

Key Features

  • Supports multi-vector representations (arrays of vectors) for both queries and documents
  • Handles dimension mismatches gracefully by returning 0 for incompatible vectors
  • Returns 0 for null or empty vectors
  • Optimized for ColBERT-style late interaction between query and document vectors

Please see the yaml test for the sample usage for thus new function maxSimDotProduct function

Related Issues

address the rerank requirements from #18091

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

❌ Gradle check result for 454ff67: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@mingshl mingshl changed the title Introduce maxSimDotProduct Function Introduce lateInteractionScore Function Jul 11, 2025
@mingshl mingshl force-pushed the main-max-sim-rerank branch from 454ff67 to 302ad4c Compare July 11, 2025 03:39
@mingshl mingshl force-pushed the main-max-sim-rerank branch from 302ad4c to 10d0ee6 Compare July 11, 2025 03:41
@github-actions
Copy link
Contributor

❌ Gradle check result for 10d0ee6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@navneet1v
Copy link
Contributor

@mingshl why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

@vigyasharma
Copy link
Contributor

why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

I have similar questions. This doesn't seem to use the underlying Lucene implementation for LateInteractionField or the mechanisms for reranking (rescorer or RescoreTopNQuery).

Some concerns off the top of my head would be:

  1. Painless scripting may be disabled on some production deployments (at least it used to be some time back). Would they not be able to use this change?
  2. Similarly, if production setups are not replicating _source fields, will they miss out on this functionality. The Lucene impl. uses BinaryDocValues.
  3. We would lose out on upstream Lucene improvements and changes that get built on top of this feature. Essentially becomes a fork with OpenSearch having its own implementation for late interaction models.
    a. e.g. if Lucene leverages the same field to also index these values into the ann graph, OpenSearch would need its own impl. for the same?

That being said, I'm not super familiar with painless scripting, and maybe this is part-1 of subsequently planned changes? @mingshl It would help if you could elaborate on the rationale behind doing it this way.

@navneet1v
Copy link
Contributor

Some concerns off the top of my head would be:

  1. Painless scripting may be disabled on some production deployments (at least it used to be some time back). Would they not be able to use this change?
  2. Similarly, if production setups are not replicating _source fields, will they miss out on this functionality. The Lucene impl. uses BinaryDocValues.
  3. We would lose out on upstream Lucene improvements and changes that get built on top of this feature. Essentially becomes a fork with OpenSearch having its own implementation for late interaction models.
    a. e.g. if Lucene leverages the same field to also index these values into the ann graph, OpenSearch would need its own impl. for the same?

+1 on this.

@mingshl
Copy link
Contributor Author

mingshl commented Jul 15, 2025

why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

I have similar questions. This doesn't seem to use the underlying Lucene implementation for LateInteractionField or the mechanisms for reranking (rescorer or RescoreTopNQuery).

Some concerns off the top of my head would be:

  1. Painless scripting may be disabled on some production deployments (at least it used to be some time back). Would they not be able to use this change?
  2. Similarly, if production setups are not replicating _source fields, will they miss out on this functionality. The Lucene impl. uses BinaryDocValues.
  3. We would lose out on upstream Lucene improvements and changes that get built on top of this feature. Essentially becomes a fork with OpenSearch having its own implementation for late interaction models.
    a. e.g. if Lucene leverages the same field to also index these values into the ann graph, OpenSearch would need its own impl. for the same?

That being said, I'm not super familiar with painless scripting, and maybe this is part-1 of subsequently planned changes? @mingshl It would help if you could elaborate on the rationale behind doing it this way.

@vigyasharma @navneet1v

  1. there are some known concerns in some deployment production environments of using painless script, that's a known concern, but it also works for the other deployment production environments. This feature would be helpful for the production environments that are already using script scores.
  2. right, using this approach, we do need to use the _source field
  3. we will sure adapt to the new the lucene field which is very exciting when we upgrade to the new lucene version, but it won't available through OpenSearch soon, it's unlikely to adapt this year. This rerank score function can be served as an interim solution, until we have the new field is available.

@mingshl
Copy link
Contributor Author

mingshl commented Jul 15, 2025

@mingshl why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

@navneet1v we could also use knn field, but since the field is only used in rerank, it's not using in search, it's seem not necessary to set as a knn field in index mapping.

@navneet1v
Copy link
Contributor

navneet1v commented Jul 15, 2025

@mingshl why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

@navneet1v we could also use knn field, but since the field is only used in rerank, it's not using in search, it's seem not necessary to set as a knn field in index mapping.

I respectfully disagree on this point. If we do like this then it will block the future extension of the feature when we want to build late interaction with k-NN field.

2. right, using this approach, we do need to use the _source field

If you look at script score based search in k-NN it doesn't need source. So if we use kNN Field then the ranking can work on with BDV or KNNVectorValues.

Another thing is if I look at how you are calculating the score its not optimal since it is not using SIMD. K-NN plugin already provides different scoring functions like L2, L1, Cosine etc which are all SIMD optimized. Hence this implementation will be slow, and also conversion of the final vector scores to OpenSearch scores needs additional calculations which I don't see in the code.

@mingshl
Copy link
Contributor Author

mingshl commented Jul 15, 2025

@mingshl why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

@navneet1v we could also use knn field, but since the field is only used in rerank, it's not using in search, it's seem not necessary to set as a knn field in index mapping.

I respectfully disagree on this point. If we do like this then it will block the future extension of the feature when we want to build late interaction with k-NN field.

  1. right, using this approach, we do need to use the _source field

If you look at script score based search in k-NN it doesn't need source. So if we use kNN Field then the ranking can work on with BDV or KNNVectorValues.

Another thing is if I look at how you are calculating the score its not optimal since it is not using SIMD. K-NN plugin already provides different scoring functions like L2, L1, Cosine etc which are all SIMD optimized. Hence this implementation will be slow, and also conversion of the final vector scores to OpenSearch scores needs additional calculations which I don't see in the code.

this lateInteractionScore Function can also support for nested knn field, oh I remembered when I try to write the unit test, because knn field depends on the knn plugin, so I cannot write the test in core, but I will test it offline.

@navneet1v that's good callout for SIMD, I will try to implement SIMD function to make it accelerate, thanks!!!

Signed-off-by: Mingshi Liu <[email protected]>
@github-actions
Copy link
Contributor

✅ Gradle check result for 6148e19: SUCCESS

@codecov
Copy link

codecov bot commented Jul 16, 2025

Codecov Report

❌ Patch coverage is 72.22222% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.80%. Comparing base (f1825fd) to head (6148e19).
⚠️ Report is 157 commits behind head on main.

Files with missing lines Patch % Lines
...ch/painless/functions/PainlessVectorFunctions.java 70.58% 5 Missing and 5 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18727      +/-   ##
============================================
- Coverage     72.83%   72.80%   -0.03%     
- Complexity    68511    68554      +43     
============================================
  Files          5572     5573       +1     
  Lines        314750   314786      +36     
  Branches      45680    45688       +8     
============================================
- Hits         229242   229177      -65     
- Misses        66898    67008     +110     
+ Partials      18610    18601       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@navneet1v
Copy link
Contributor

navneet1v commented Jul 18, 2025

@mingshl why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

@navneet1v we could also use knn field, but since the field is only used in rerank, it's not using in search, it's seem not necessary to set as a knn field in index mapping.

I respectfully disagree on this point. If we do like this then it will block the future extension of the feature when we want to build late interaction with k-NN field.

  1. right, using this approach, we do need to use the _source field

If you look at script score based search in k-NN it doesn't need source. So if we use kNN Field then the ranking can work on with BDV or KNNVectorValues.
Another thing is if I look at how you are calculating the score its not optimal since it is not using SIMD. K-NN plugin already provides different scoring functions like L2, L1, Cosine etc which are all SIMD optimized. Hence this implementation will be slow, and also conversion of the final vector scores to OpenSearch scores needs additional calculations which I don't see in the code.

this lateInteractionScore Function can also support for nested knn field, oh I remembered when I try to write the unit test, because knn field depends on the knn plugin, so I cannot write the test in core, but I will test it offline.

@navneet1v that's good callout for SIMD, I will try to implement SIMD function to make it accelerate, thanks!!!

@mingshl I am not sure if I was able to put my point earlier. So let me give it one more try.

  1. We should move the LateInteraction painless script working on Array to just vector fields(k-NN field) and that too in k-NN plugin. The k-NN vector field is more optimized for storing vectors(using LuceneKNNVectorFormat) and support other data types too like byte, fp16, binary etc.
  2. k-NN plugin already provides the support for different space_type(L2, IP, Cosine, L1 etc) which would be a duplicate work, which it is.
  3. K-NN plugin also converts the distances between vectors to OpenSearch compatible scores. Which will be again duplicated here and more maintenance leading to divergence in score computation.
  4. We do the integration with Lucene late interaction feature: Support for Re-Ranking Queries using Late Interaction Model Multi-Vectors. apache/lucene#14729 developed by @vigyasharma it will come via k-NN plugin, which will be an enhancement of the feature in the current PR.
  5. All these space types already supports SIMD optimization and k-NN plugin currently keeps improving these functions. Like adding support for SIMD based distance computations on FP16 vectors.

I don't see currently any reason why this painless script should work on array field and be part of core.

@mingshl
Copy link
Contributor Author

mingshl commented Jul 18, 2025

@mingshl why we are using float arrays as vectors when there is a specific vector data type in OpenSearch provided via k-NN plugin. Should we not enable this via the k-NN plugin?

@navneet1v we could also use knn field, but since the field is only used in rerank, it's not using in search, it's seem not necessary to set as a knn field in index mapping.

I respectfully disagree on this point. If we do like this then it will block the future extension of the feature when we want to build late interaction with k-NN field.

  1. right, using this approach, we do need to use the _source field

If you look at script score based search in k-NN it doesn't need source. So if we use kNN Field then the ranking can work on with BDV or KNNVectorValues.
Another thing is if I look at how you are calculating the score its not optimal since it is not using SIMD. K-NN plugin already provides different scoring functions like L2, L1, Cosine etc which are all SIMD optimized. Hence this implementation will be slow, and also conversion of the final vector scores to OpenSearch scores needs additional calculations which I don't see in the code.

this lateInteractionScore Function can also support for nested knn field, oh I remembered when I try to write the unit test, because knn field depends on the knn plugin, so I cannot write the test in core, but I will test it offline.
@navneet1v that's good callout for SIMD, I will try to implement SIMD function to make it accelerate, thanks!!!

@mingshl I am not sure if I was able to put my point earlier. So let me give it one more try.

  1. We should move the LateInteraction painless script working on Array to just vector fields(k-NN field) and that too in k-NN plugin. The k-NN vector field is more optimized for storing vectors(using LuceneKNNVectorFormat) and support other data types too like byte, fp16, binary etc.
  2. k-NN plugin already provides the support for different space_type(L2, IP, Cosine, L1 etc) which would be a duplicate work, which it is.
  3. K-NN plugin also converts the distances between vectors to OpenSearch compatible scores. Which will be again duplicated here and more maintenance leading to divergence in score computation.
  4. We do the integration with Lucene late interaction feature: Support for Re-Ranking Queries using Late Interaction Model Multi-Vectors. apache/lucene#14729 developed by @vigyasharma it will come via k-NN plugin, which will be an enhancement of the feature in the current PR.
  5. All these space types already supports SIMD optimization and k-NN plugin currently keeps improving these functions. Like adding support for SIMD based distance computations on FP16 vectors.

I don't see currently any reason why this painless script should work on array field and be part of core.

the main driving goal of this script function is that we don't have the new field type, the multi-field type in knn plugin yet, and we don't want to duplicate the effort of @vigyasharma 's change in lucene. While we wait for the lucene upgrade in core, we would like to develop some interim solution.

I also love the idea from @vigyasharma that we can develop a new query type, call colbert query or late interation query that would do the rerank work for given multi-vector, before we have the new multi-vector type is ready, under the backend we can utilize the script score function first. and later we can smoothly adapt the new field type into the new query. In this way, we can keep the API and interface the same for the users.

If the best place to host the new query type is knn, I can move the PR to knn plugin too.

@vigyasharma
Copy link
Contributor

Thanks @mingshl .

Right! I understand the desire to add some support early, rather than wait for Lucene 10.3 release. So designing it such that user interface remains the same, and we can "unfork" to Lucene's implementation under the hood will ensure users don't have to make changes in their applications. Creating an OpenSearch field and query type, that can later be changed to invoke Lucene's LateInteractionField and query could be a viable approach (happy to hear other ideas).

It's true that users will have to reindex their data to migrate it from the field we create now, to the field Lucene is using. But upgrade and reindex are better understood operations that most production systems build mechanisms for (even if they don't upgrade right after the release is out).

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Aug 19, 2025
@vigyasharma
Copy link
Contributor

Proposal for rescoring with late interaction models using Lucene's LateInteractionFieldopensearch-project/k-NN#2934

@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants