Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Net: WIP on LINQ-based criteria filtering #10273

Draft
wants to merge 8 commits into
base: feature-vector-data-preb1
Choose a base branch
from

Conversation

roji
Copy link
Member

@roji roji commented Jan 23, 2025

This is some draft work on LINQ-based criteria filtering (#10156). Basic equality, AND and OR are implemented for Qdrant, and tested via a specification test suite that should be usable more or less as-is across all other providers (#10194). There's a lot more work here and many possible improvements - am submitting in early draft stage to gather early feedback.

To see what this looks like, see these tests.

Note that this adds new projects outside of the existing "IntegrationTests" project; it's difficult to work when all provider tests are in the same project (any experimental breaking change in the abstraction must be propagated everywhere), and in any case, I think we want to end up in a place where each provider has its own separate test project.

/cc @westey-m

@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel memory labels Jan 23, 2025
@github-actions github-actions bot changed the title WIP on LINQ-based criteria filtering .Net: WIP on LINQ-based criteria filtering Jan 23, 2025
var right = this.Visit(andAlso.Right);

// Qdrant doesn't allow arbitrary nesting of logical operators, only one MUST list (AND), one SHOULD list (OR), and one MUST_NOT list (AND NOT).
// We can combine MUST and MUST_NOT; but we can only combine SHOULD if it's the *only* thing on the one side (no MUST/MUST_NOT), and there's no SHOULD on the other (only MUST/MUST_NOT).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am not 100% sure about Qdrant's capabilities here, should confirm.

@@ -20,6 +20,6 @@ public interface IVectorizableTextSearch<TRecord>
/// <returns>The records found by the vector search, including their result scores.</returns>
Task<VectorSearchResults<TRecord>> VectorizableTextSearchAsync(
string searchText,
VectorSearchOptions? options = default,
VectorSearchOptions<TRecord>? options = default,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making VectorSearchOptions generic over TRecord isn't trivial... Modern C# users can use target-typed new, and in that case this doesn't interfere in the method calling. But people not doing target-typed new would have to explicitly specify the full type every time (even if they don't use filtering at all), which maybe isn't ideal. The alternative would be to move the LINQ filter to an optional parameter directly on the method (no strong feelings from my side here).

dotnet/SK-dotnet.sln Outdated Show resolved Hide resolved
where TKey : notnull
{
[Fact]
public virtual async Task Equality_with_int()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can keep PascalCase as naming convention for test methods like in other tests in a codebase - that would be great :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mainly anticipating all these moving out to the Microsoft.Extensions repo, where IIRC the convention is like this. But I can rename to match SK conventions and change later if you prefer.

@westey-m
Copy link
Contributor

@roji, a good scenario to also include in early exploration would be tag based operations, e.g. where there is an array of strings (tags) in a field in the database, and we want to check if one or more out of tags we provide match those.
We have received consistent feedback that this is a popular scenario for vector search filtering.

e.Int2 == a.Int2);
}

[Fact]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the issues I have been facing with abstract base class tests is that a few of the DBs don't perform well on the Build server for integration tests, so we typically run them manually. Mostly issues with failed startup and flakiness.
I generalized some work done for Pinecone that allows us to have attribute based test disabling, which allows us to have an attribute on the inheriting class that disables all tests in the inheriting and base class.
Check out VectorStoreFact, PineconeApiKeySetConditionAttribute and DisableVectorStoreTestsAttribute for how this works.
Passing a Skip reason to Fact doesn't work because it requires a static reason, and it would just disable the tests for all inheritors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I remember this... I'll take a look.

But just off the top of my head, in CI we could run tests only for those providers where there's a good container-baesd option (like Qdrant), and not run them for the others (splitting the test projects by provider could also help with that a bit). Locally, developers can simply explicitly choose what it is they want to run - though that would mean that running all tests in the solution would produce failures. If that's not acceptable, a conditional [Fact] attribute like what you mention above could be good, based on some environment variable or whatever.

In any case, my hope is that this could be orthogonal to the tests themselves being general across providers, as opposed to written per-provider... But we'll see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite like the idea of the test conditionality being based on whether the required secrets have been configured like in PineconeApiKeySetConditionAttribute, since it removes the need to change build settings in addition to setting the build secrets. Locally, I prefer skipped to failure, since failures need to be investigated, while the meaning of something being skipped is clear without further investigation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will do that.

@roji
Copy link
Member Author

roji commented Jan 23, 2025

a good scenario to also include in early exploration would be tag based operations, e.g. where there is an array of strings (tags) in a field in the database, and we want to check if one or more out of tags we provide match those.
We have received consistent feedback that this is a popular scenario for vector search filtering.

@westey-m thanks - will do that, it should be easily representible in LINQ by doing b => b.Tags.Contains("foo"). Maybe let's talk tomorrow about how that's represented e.g. in Qdrant (just because I'm currently focusing on it).

@roji roji changed the base branch from main to feature-vector-data-preb1 January 24, 2025 18:18
* Also inline captured variables.
* Also improve tests to check for none/all results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernel Issues or pull requests impacting the core kernel memory .NET Issue or Pull requests regarding .NET code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants