Skip to content

[Store][Postgres] allow store initialization with utilized distance #197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DZunke
Copy link
Contributor

@DZunke DZunke commented Jul 24, 2025

Q A
Bug fix? no
New feature? yes
Docs? no
Issues #195
License MIT

According to the pgvector documentation there are multiple distance calculations allowed. The current implementation in the store is only the L2 distance with the usage of <->. Allowing to utilize the other distance calculation variants would be useful here as mostly the discussion seem to go around the cosine algorithm.

@lyrixx
Copy link
Member

lyrixx commented Jul 24, 2025

Thanks for this PR.

I'm using Store::FromDbal(), so the comparaison should be added there too. And also in fromPdo()

So in order to test, I changed in the constructor, and double check it's well configurated (current operator : <=>).

diff --git a/src/store/src/Bridge/Postgres/Store.php b/src/store/src/Bridge/Postgres/Store.php
index 07bff22..28d4ba5 100644
--- a/src/store/src/Bridge/Postgres/Store.php
+++ b/src/store/src/Bridge/Postgres/Store.php
@@ -34,7 +34,7 @@ final readonly class Store implements VectorStoreInterface, InitializableStoreIn
         private \PDO $connection,
         private string $tableName,
         private string $vectorFieldName = 'embedding',
-        private Distance $distance = Distance::L2,
+        private Distance $distance = Distance::Cosine,
     ) {
     }

I have crawled https://jolicode.com and https://www.premieroctet.com, indexed all their content, and run the following code:

$rows = $connection->executeQuery("select * from {$_SERVER['PLATFORM']}")->fetchAllAssociative();
foreach ($rows as $row) {
    $metadata = json_decode($row['metadata'], true, 512, \JSON_THROW_ON_ERROR);
    $vector = new Vector(json_decode($row['embedding'], true));
    $documents = $store->query($vector, [], 0.000001); //Hack to not get "current row"

    if (!$documents) {
        continue;
    }

    echo "Current document: {$metadata['url']}\n";
    echo "Found " . count($documents) . " similar documents:\n";
    foreach ($documents as $i => $document) {
        echo "- {$document->metadata['url']} (score: {$document->score})\n";
        // break;
    }
    die;
    echo "\n";
}

So:

  1. We still sort by score ASC
  2. But lowest seems to be best again
Current document: https://jolicode.com/blog/tag/zellij
Found 5 similar documents:
- https://jolicode.com/blog/tag/zellij (score: 0)
- https://jolicode.com/blog/tag/tmux (score: 0.068143753338624)
- https://jolicode.com/blog/tag/agence (score: 0.12764826831817)
- https://jolicode.com/blog/tag/js (score: 0.13115907744191)
- https://jolicode.com/blog/tag/sysadmin (score: 0.13248805321064)

Current document: https://jolicode.com/blog/tag/encodage
Found 5 similar documents:
- https://jolicode.com/blog/tag/encodage (score: 0)
- https://jolicode.com/blog/tag/utf8 (score: 0.045148642848127)
- https://jolicode.com/qui-sommes-nous/equipe/marion-hurteau (score: 0.07487888379814)
- https://jolicode.com/blog/ce-que-vous-devez-savoir-sur-les-chaa-r-nes-de-caracta-res (score: 0.07784386727315)
- https://jolicode.com/blog/tag/qualite (score: 0.083560306575813)

@chr-hertel chr-hertel added Store Issues & PRs about the AI Store component Status: Needs Work labels Jul 24, 2025
@DZunke DZunke force-pushed the configurable-postgres-distance branch from e966024 to b4e30f5 Compare July 25, 2025 09:53
$uuid = Uuid::v4();
$vectorData = [0.1, 0.2, 0.3];
$minScore = 0.8;
$pdo = $this->createMock(\PDO::class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know the testing strategy on this project yet. But this kind of test tests nothing. All the important things are mocked.

IMHO, it would be much better to use a real instance of pgvector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yeah ... the tests just testing the query is correctly build. But nethertheless i have to remove the tests because there was a merge of tests. But still unit and not functional tests. I would keep with them for now to have just a look that the correct comparison method is utilized.

@DZunke
Copy link
Contributor Author

DZunke commented Jul 25, 2025

Thanks for this PR.

I'm using Store::FromDbal(), so the comparaison should be added there too. And also in fromPdo()

So in order to test, I changed in the constructor, and double check it's well configurated (current operator : <=>).
...

Thanks, @lyrixx ! I've already added it to the named constructors.

Your results generally is looking totally fine to me. The cosine distance, which is being used, returns a value between 0 and 2, where 0 indicates that the elements are identical, and 2 means they are very different. So, sorting by ASC means that the most similar document comes first. Sorting by DESC would result in the most dissimilar document appearing first.

In the query, the filtering could be problematic. The term minScore is not the best wording here, especially in combination with the >= comparator. What’s currently labeled as minScore should actually be a maxScore, at least for cosine search 🙈

The score problem in filtering seems also be valid for the L2 distance as this is a value from 0 to infinite, where 0 is the most fitting match. It seems the minScore wording is coming from the MongoDB implementation - at least this seems to be where it started and where it is correct.

@DZunke DZunke force-pushed the configurable-postgres-distance branch from b4e30f5 to 8c7735a Compare July 25, 2025 11:12
@lyrixx
Copy link
Member

lyrixx commented Jul 25, 2025

Thanks you very much for the explanation. Very clear.

And I agree with you for the minScore. May be the name could be "treashhold". It kinda generic haha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Needs Work Store Issues & PRs about the AI Store component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants