Skip to content

Add SearchIndex and VectorSearchIndex #264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 102 commits into
base: main
Choose a base branch
from

Conversation

WaVEV
Copy link
Collaborator

@WaVEV WaVEV commented Mar 3, 2025

Search Indexes and Vector search indexes

This PR introduces new index classes to encapsulate the definitions and details of Atlas indexes.

@WaVEV WaVEV requested review from timgraham and Jibola March 3, 2025 02:45
@WaVEV WaVEV force-pushed the create-atlas-indexes branch from 03629ae to de3d245 Compare March 8, 2025 03:50
@WaVEV WaVEV force-pushed the create-atlas-indexes branch 2 times, most recently from 1bf4717 to 7dc04ab Compare March 20, 2025 21:49
@WaVEV WaVEV marked this pull request as ready for review March 20, 2025 23:05
@timgraham timgraham changed the title Create atlas indexes Add SearchIndex and VectorSearchIndex Mar 23, 2025
@WaVEV WaVEV force-pushed the create-atlas-indexes branch from 60d49de to 2865e13 Compare March 25, 2025 03:05
@WaVEV WaVEV force-pushed the create-atlas-indexes branch 2 times, most recently from 9fdc143 to 15e3450 Compare March 31, 2025 03:05
@WaVEV WaVEV force-pushed the create-atlas-indexes branch 3 times, most recently from b06db74 to e69da64 Compare April 9, 2025 06:00
Copy link
Collaborator

@timgraham timgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a check similar to https://github.com/django/django/blob/1429e722f265a4f4229b5f7eaa6a6df3161c342a/django/db/models/constraints.py#L150-L167 and make sure that schema editor ignores search indexes if not supported.

@WaVEV WaVEV force-pushed the create-atlas-indexes branch from 92caf14 to 61b1c05 Compare April 12, 2025 16:19
@WaVEV WaVEV force-pushed the create-atlas-indexes branch from 00b0323 to 08654ec Compare April 15, 2025 02:17
Comment on lines 296 to 297
# Drop the index if it exists, particularly if it may not have been created previously
# due to lack of Atlas search support, but now the database supports it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really possible/likely that a database can go from non-Atlas to Atlas?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe @Jibola could answer better, IMO it is very reasonable. An application add some support for AI and don't want to have two separate backends. The opposite isn't very reasonable, go from Atlas to non-Atlas. I decided to handle both.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you migrate from non-Atlas to Atlas? Dump the db and load it in a new server? I guess it's possible for this migrations scenario to happen, but it doesn't seem very likely. I'd expect the main place where search indexes might be ignored is if they were in a third-party app, so basically the scenario would be: user installs third-party app with ignored search index, user migrates to atlas, third-party app removes atlas index. (Do you have another scenario in mind?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 yes, I think you are right. But at least the if must remains as it is. Or maybe we could remove the existence check. Because if an index was skipped because the backend doesn't support atlas, then It has nothing to drop.

@timgraham timgraham force-pushed the create-atlas-indexes branch from 21c9047 to 0d6f719 Compare April 24, 2025 18:28
Comment on lines 21 to 56
name: Django Test Suite
runs-on: ubuntu-latest
steps:
- name: Checkout django-mongodb-backend
uses: actions/checkout@v4
with:
persist-credentials: false
- name: install django-mongodb-backend
run: |
pip3 install --upgrade pip
pip3 install -e .
- name: Checkout Django
uses: actions/checkout@v4
with:
repository: 'mongodb-forks/django'
ref: 'mongodb-5.1.x'
path: 'django_repo'
persist-credentials: false
- name: Install system packages for Django's Python test dependencies
run: |
sudo apt-get update
sudo apt-get install libmemcached-dev
- name: Install Django and its Python test dependencies
run: |
cd django_repo/tests/
pip3 install -e ..
pip3 install -r requirements/py3.txt
- name: Copy the test settings file
run: cp .github/workflows/mongodb_settings.py django_repo/tests/
- name: Copy the test runner file
run: cp .github/workflows/runtests.py django_repo/tests/runtests_.py
- name: Start local Atlas
working-directory: .
run: bash .github/workflows/start_local_atlas.sh mongodb/mongodb-atlas-local:7
- name: Run tests
run: python3 django_repo/tests/runtests_.py

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium test

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
Copy link
Collaborator

@timgraham timgraham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're missing schema tests with similarities, both as a string and as a list.

Comment on lines +21 to +23
Some fields such as :class:`~django.db.models.DecimalField` aren't
supported. See the :ref:`Atlas documentation <atlas:bson-data-chart>` for a
complete list of unsupported data types.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this accurate? It might be useful to have a check, but maybe we're spending too much time on this. 😉

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we are. Most of the time wasted was my fault.
Yes, it is.

VALID_FIELD_TYPES = frozenset(("boolean", "date", "number", "objectId", "string", "uuid"))
_error_id_prefix = "django_mongodb_backend.indexes.VectorSearchIndex"

def __init__(self, *, fields=(), similarities="cosine", name=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you decide that cosine should be the default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, any time that I worked with embedding, this similarity was the first that I tried. Except when the data was normalized (in that case, cosine and dot product gives the same result and dot product is faster). L2 norm is used but less than cosine in semantics searches.
In order to simplify the index, I decided to put cosine in default.
I have no preference to put similarities as a needed parameter


The index should reference at least one vector field: an :class:`.ArrayField`
with a :attr:`~.ArrayField.base_field` of :class:`~django.db.models.FloatField`
or :class:`~django.db.models.IntegerField`. It cannot reference an
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ArrayField(IntegerField) store data in the correct format for this index? binData(int8)? https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#about-the-similarity-functions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is stored as array, isn't the best data structure to save this kind of data. But, it works. I remember having this conversation with Jib and James about that.

@timgraham timgraham force-pushed the create-atlas-indexes branch from fd52b43 to c59297c Compare April 26, 2025 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants