Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDoc-3254 Embeddings Generation via Tasks #1998

Merged
merged 34 commits into from
Mar 27, 2025

Conversation

Danielle9897
Copy link
Member

@Danielle9897 Danielle9897 marked this pull request as draft March 17, 2025 08:56
@Danielle9897 Danielle9897 marked this pull request as ready for review March 19, 2025 15:16

When indexing your own data (textual or numerical) that was not generated by these tasks,
use the `CreateVector()` method in your index definition.
An example is available in [Indexing raw tex](../../ai-integration/vector-search/vector-search-using-static-index#indexing-raw-text).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

| **lines** | `string[]` | An array of text lines to split into chunks. |
| **htmlText** | `string` | A string containing HTML content to process. |
| **maxTokensPerLine** | `number` | The maximum tokens allowed per line or chunk. |
| **maxTokensPerChunk** | `number` | The maximum tokens per chunk (used in `html.strip`) |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably have only one parameter instead of maxTokensPerLine and maxTokensPerChunk separately, I'll write an info once it's changed

Copy link
Member

@arekpalinski arekpalinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Danielle9897 Very impressive work! Descriptions are clear and detailed. Everything is well organized.

* The generated identifier will be: _"my-connection-string-to-google-ai"_

Allowed characters: only lowercase letters (a-z), numbers (0-9), and hyphens (-).
See how this identifier is used in the [embeddings cache collection](../../ai-integration/generating-embeddings/embedding-collections#the-embeddings-cache-collection).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also worth linking other possible places where the task identifier is used:

  • indexing - LoadDocument("FieldName", "my-connection-string-to-google-ai")
  • dynamic queries - from Employees where vector.search(..., ai.task("my-connection-string-to-google-ai", ...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, in this section where we explain the connection string identifier,
I wouldn't put links to these methods since they use the "task identifier" and not the "connection string identifier"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Please ignore :)

4. **Source properties & their hash**:
This section contains properties from the source document that were converted into embeddings.
Each property includes a hash derived from its content:
`<property-name>: [<hash-created-from-conent>, ...]`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conent -> content

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* Each attachment contains a singe embedding.

* The attachment name follows this format:
`<task-identifier>_<property-name>_<hash-created-from-property-content>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hash-created-from-property-content -> hash-created-from-chunked-property-content maybe?

Copy link
Member Author

@Danielle9897 Danielle9897 Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified to: hash-of-text-chunk-from-property-content

* This applies both when generating embeddings for source document content and when performing a vector search that requires an embedding for the search term.

* To find a matching embedding, RavenDB:
1. **Generates a hash** from the text content that requires embedding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure but maybe we should somehow mention here it's about chunked content?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, added


Allowed characters: only lowercase letters (a-z), numbers (0-9), and hyphens (-).
See how this identifier is used in the [Embeddings collection](../../ai-integration/generating-embeddings/embedding-collections#the-embeddings-collection)
documents that reference the generated embeddings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, maybe also reference indexing and querying since those are places where a user needs to use reference the embeddings generation task via its identifier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added links here, as you suggested.

* The text is sent to the providers in batches. The batch size is configurable, see the
[Ai.Embeddings.Generation.Task.MaxBatchSize](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.maxbatchsize) configuration key.
* A failed embeddings generation task will retry after the duration set in the
[Ai.Embeddings.Generation.Task.RetryDelayInSec](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.retrydelayinsec) configuration key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above might get changed during the refactoring work that is WIP

* [Ai.Embeddings.Generation.Task.MaxBatchSize](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.maxbatchsize)
* [Ai.Embeddings.Generation.Task.MaxFallbackTimeInSec](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.maxfallbacktimeinsec)
* [Ai.Embeddings.Generation.Task.RetryDelayInSec](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.retrydelayinsec)
* [Ai.Embeddings.Generation.Task.RetryStrategy](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.retrystrategy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those might get changed during the current refactoring work

@arekpalinski arekpalinski requested a review from ArieSLV March 21, 2025 07:36

---

* In this page:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* In this page:
* On this page:

Copy link
Member Author

@Danielle9897 Danielle9897 Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should consistently use either:

  • On this page, or
  • In this article

IMHO - we should go with "In this article" (similar to Microsoft’s docs)
and that fix should be applied across ALL articles in the documentation.

For now, I’ll apply this change to all articles in the "AI Integration" section.

* This article explains how to define a connection string to the [Azure OpenAI Service](https://azure.microsoft.com/en-us/products/ai-services/openai-service),
enabling RavenDB to seamlessly integrate its [embeddings generation tasks](../../ai-integration/generating-embeddings/overview) with your Azure environment.

* In this page:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* In this page:
* On this page:

This model is embedded within RavenDB, enabling RavenDB to seamlessly handle its
[embeddings generation tasks](../../ai-integration/generating-embeddings/overview) without requiring an external AI service.

* In this page:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* In this page:
* On this page:


* This configuration supports **Google AI embeddings** only and is Not compatible with Vertex AI.

* In this page:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* In this page:
* On this page:

* This article explains how to define a connection string to [Google AI](https://ai.google.dev/gemini-api/docs/embeddings),
enabling RavenDB to seamlessly integrate its [embeddings generation tasks](../../ai-integration/generating-embeddings/overview) with Google's AI services.

* This configuration supports **Google AI embeddings** only and is Not compatible with Vertex AI.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* This configuration supports **Google AI embeddings** only and is Not compatible with Vertex AI.
* This configuration supports **Google AI provider** only and is not compatible with Vertex AI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public class WhatAreOperations {

private interface IFoo<TResult, TEntity> {
/*
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this commented-out block remain here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Java articles will be revised/updated in a separate issue

Comment on lines +86 to +87
GetClientConfigurationOperation.Result result
= store.maintenance().send(new GetClientConfigurationOperation());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GetClientConfigurationOperation.Result result
= store.maintenance().send(new GetClientConfigurationOperation());
GetClientConfigurationOperation.Result result =
store.maintenance().send(new GetClientConfigurationOperation());


//region wait_kill_syntax
await waitForCompletion();
await kill()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
await kill()
await kill();

Just for consistency with the rest of the examples :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/**
* Wait for operation completion.
*
* It throws TimoutException if $duration is set and operation execution time elapses duration interval.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* It throws TimoutException if $duration is set and operation execution time elapses duration interval.
* It throws TimeoutException if $duration is set and operation execution time elapses duration interval.

And do we need this commented-out block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done,
and - yes (it shows in the article)

delete_by_query_op = DeleteByQueryOperation("from Products where Discontinued = true")

# Execute the operation
# Send returns an 'Operation' object that can be 'killed'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Send returns an 'Operation' object that can be 'killed'
# The send_async method returns an 'Operation' object that can be 'killed'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ArieSLV
Copy link

ArieSLV commented Mar 23, 2025

Wooow! Absolutely massive amount of work. It took me 4.5 hours to review — writing all this must’ve been a Herculean task. Huge respect!!!!

@Danielle9897 Danielle9897 force-pushed the RDoc-3254-embeddingsGeneration branch from 710596a to 1395741 Compare March 24, 2025 06:27

* The server creates the following dedicated collections,
which contain documents that reference the embedding attachments:
* **Embeddings Collection**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indicates that there will be only one Embeddings Collection; however, this is per collection that the user created a task.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 70 to 101
* For example:
In this [task definition](../../ai-integration/generating-embeddings/embeddings-generation-task#configuring-an-embeddings-generation-task---from-the-studio),
an embeddings generation task is defined on the `Categories` collection.
This creates the `Categories/embeddings` collection, where a document will look as follows:

![The embeddings document](images/embeddings-collection-1.png)

1. **Collection name**
The unique name of the embeddings collection: `Categories/embeddings`.
2. **Document ID**
Each document ID in this collection follows the format: `<source-document-name>/embeddings`
3. **Task identifier**
The identifier of the task that generated the embeddings for the listed properties.
4. **Source properties & their hash**:
This section contains properties from the source document whose content was converted into embeddings.
Each property contains an array of hashes derived from text chunks created from the property's content:
`<property-name>: [<hash-created-from-text-chunk-1>, <hash-created-from-text-chunk-2>, ...]`
5. **Attachment flag**
Indicates that the document includes attachments, which store the embeddings.
The next image shows the embedding attachments in the document's properties pane.

![The embeddings document - attachments](images/embeddings-collection-2.png)

* Each attachment contains a single embedding.

* The attachment name follows this format:
`<task-identifier>_<property-name>_<hash-of-text-chunk-from-property-content>`

* If the embeddings were [Quantized](../../ai-integration/vector-search/vector-search-using-dynamic-query#what-is-quantization) by the task,
the format includes the quantization type:
`<task-identifier>_<property-name>_<hash-of-text-chunk-from-property-content>_<quantization-type>`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This need to be adjusted soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Once the expiration date is reached, the document is automatically deleted (provided that [document expiration](../../studio/database/settings/document-expiration) is enabled).

* When a source document (from which embeddings were generated) is modified,
RavenDB extends the expiration date of the relevant document in this cache if the remaining time is less than one-third of the original duration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been changed. It is now half of the original duration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

…t code +

Update images +
Small fixes to flow charts +
Fix to configuration +
Updated the expiration policy
@Danielle9897 Danielle9897 force-pushed the RDoc-3254-embeddingsGeneration branch from 512ac71 to 6c3028b Compare March 26, 2025 21:38
@ppekrol ppekrol merged commit d411598 into ravendb:master Mar 27, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants