RDoc-3254 Embeddings Generation via Tasks #1998

Danielle9897 · 2025-03-17T08:56:12Z

Related issue:
https://issues.hibernatingrhinos.com/issue/RDoc-3254/AI-Integration-Document-Embeddings-Generation-via-Tasks

…ion task article

Lwiel · 2025-03-19T15:58:19Z

...entation.Pages/ai-integration/vector-search/vector-search-using-static-index.dotnet.markdown

+
+    When indexing your own data (textual or numerical) that was not generated by these tasks,  
+    use the `CreateVector()` method in your index definition.  
+    An example is available in [Indexing raw tex](../../ai-integration/vector-search/vector-search-using-static-index#indexing-raw-text).  


Lwiel · 2025-03-19T16:54:37Z

...tation.Pages/ai-integration/generating-embeddings/embeddings-generation-task.dotnet.markdown

+| **lines**             | `string[]` | An array of text lines to split into chunks.        |
+| **htmlText**          | `string`   | A string containing HTML content to process.        |
+| **maxTokensPerLine**  | `number`   | The maximum tokens allowed per line or chunk.       |
+| **maxTokensPerChunk** | `number`   | The maximum tokens per chunk (used in `html.strip`) |


We'll probably have only one parameter instead of maxTokensPerLine and maxTokensPerChunk separately, I'll write an info once it's changed

arekpalinski

@Danielle9897 Very impressive work! Descriptions are clear and detailed. Everything is well organized.

arekpalinski · 2025-03-21T07:05:15Z

...n.Documentation.Pages/ai-integration/connection-strings/connection-strings-overview.markdown

+       * The generated identifier will be: _"my-connection-string-to-google-ai"_
+
+     Allowed characters: only lowercase letters (a-z), numbers (0-9), and hyphens (-).  
+     See how this identifier is used in the [embeddings cache collection](../../ai-integration/generating-embeddings/embedding-collections#the-embeddings-cache-collection).


Maybe also worth linking other possible places where the task identifier is used:

indexing - LoadDocument("FieldName", "my-connection-string-to-google-ai")

dynamic queries - from Employees where vector.search(..., ai.task("my-connection-string-to-google-ai", ...)

Specifically, in this section where we explain the connection string identifier,
I wouldn't put links to these methods since they use the "task identifier" and not the "connection string identifier"

You're right. Please ignore :)

arekpalinski · 2025-03-21T07:17:31Z

...aven.Documentation.Pages/ai-integration/generating-embeddings/embedding-collections.markdown

+  4. **Source properties & their hash**:  
+     This section contains properties from the source document that were converted into embeddings.  
+     Each property includes a hash derived from its content:  
+     `<property-name>: [<hash-created-from-conent>, ...]`


conent -> content

arekpalinski · 2025-03-21T07:17:45Z

...aven.Documentation.Pages/ai-integration/generating-embeddings/embedding-collections.markdown

+  * Each attachment contains a singe embedding.
+
+  * The attachment name follows this format:  
+    `<task-identifier>_<property-name>_<hash-created-from-property-content>`


hash-created-from-property-content -> hash-created-from-chunked-property-content maybe?

modified to: hash-of-text-chunk-from-property-content

arekpalinski · 2025-03-21T07:19:24Z

...aven.Documentation.Pages/ai-integration/generating-embeddings/embedding-collections.markdown

+* This applies both when generating embeddings for source document content and when performing a vector search that requires an embedding for the search term.  
+
+* To find a matching embedding, RavenDB:  
+   1. **Generates a hash** from the text content that requires embedding.  


I'm not sure but maybe we should somehow mention here it's about chunked content?

arekpalinski · 2025-03-21T07:21:57Z

...tation.Pages/ai-integration/generating-embeddings/embeddings-generation-task.dotnet.markdown

+
+        Allowed characters: only lowercase letters (a-z), numbers (0-9), and hyphens (-).  
+        See how this identifier is used in the [Embeddings collection](../../ai-integration/generating-embeddings/embedding-collections#the-embeddings-collection)
+        documents that reference the generated embeddings.  


Same as above, maybe also reference indexing and querying since those are places where a user needs to use reference the embeddings generation task via its identifier.

I've added links here, as you suggested.

arekpalinski · 2025-03-21T07:27:46Z

...ntation/7.0/Raven.Documentation.Pages/ai-integration/generating-embeddings/overview.markdown

+  * The text is sent to the providers in batches. The batch size is configurable, see the  
+    [Ai.Embeddings.Generation.Task.MaxBatchSize](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.maxbatchsize) configuration key.  
+  * A failed embeddings generation task will retry after the duration set in the 
+    [Ai.Embeddings.Generation.Task.RetryDelayInSec](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.retrydelayinsec) configuration key.


The above might get changed during the refactoring work that is WIP

arekpalinski · 2025-03-21T07:32:18Z

...ion/7.0/Raven.Documentation.Pages/server/configuration/ai-integration-configuration.markdown

+   * [Ai.Embeddings.Generation.Task.MaxBatchSize](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.maxbatchsize)  
+   * [Ai.Embeddings.Generation.Task.MaxFallbackTimeInSec](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.maxfallbacktimeinsec)  
+   * [Ai.Embeddings.Generation.Task.RetryDelayInSec](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.retrydelayinsec)  
+   * [Ai.Embeddings.Generation.Task.RetryStrategy](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.retrystrategy)


Those might get changed during the current refactoring work

ArieSLV · 2025-03-23T08:41:33Z

Documentation/7.0/Raven.Documentation.Pages/ai-integration/ai-tasks-list-view.markdown

+
+---
+
+* In this page:


Suggested change

* In this page:

* On this page:

Right, we should consistently use either:

On this page, or

In this article

IMHO - we should go with "In this article" (similar to Microsoft’s docs)
and that fix should be applied across ALL articles in the documentation.

For now, I’ll apply this change to all articles in the "AI Integration" section.

ArieSLV · 2025-03-23T08:45:25Z

....0/Raven.Documentation.Pages/ai-integration/connection-strings/azure-open-ai.dotnet.markdown

+* This article explains how to define a connection string to the [Azure OpenAI Service](https://azure.microsoft.com/en-us/products/ai-services/openai-service),  
+  enabling RavenDB to seamlessly integrate its [embeddings generation tasks](../../ai-integration/generating-embeddings/overview) with your Azure environment.
+
+* In this page:


Suggested change

* In this page:

* On this page:

ArieSLV · 2025-03-23T08:49:58Z

...ion/7.0/Raven.Documentation.Pages/ai-integration/connection-strings/embedded.dotnet.markdown

+  This model is embedded within RavenDB, enabling RavenDB to seamlessly handle its  
+  [embeddings generation tasks](../../ai-integration/generating-embeddings/overview) without requiring an external AI service.
+
+* In this page:


Suggested change

* In this page:

* On this page:

ArieSLV · 2025-03-23T08:52:16Z

...on/7.0/Raven.Documentation.Pages/ai-integration/connection-strings/google-ai.dotnet.markdown

+
+* This configuration supports **Google AI embeddings** only and is Not compatible with Vertex AI.
+
+* In this page:


Suggested change

* In this page:

* On this page:

ArieSLV · 2025-03-23T08:52:55Z

...on/7.0/Raven.Documentation.Pages/ai-integration/connection-strings/google-ai.dotnet.markdown

+* This article explains how to define a connection string to [Google AI](https://ai.google.dev/gemini-api/docs/embeddings),  
+  enabling RavenDB to seamlessly integrate its [embeddings generation tasks](../../ai-integration/generating-embeddings/overview) with Google's AI services.
+
+* This configuration supports **Google AI embeddings** only and is Not compatible with Vertex AI.


Suggested change

* This configuration supports **Google AI embeddings** only and is Not compatible with Vertex AI.

* This configuration supports **Google AI provider** only and is not compatible with Vertex AI.

ArieSLV · 2025-03-23T13:04:57Z

...ation/7.0/Samples/java/src/test/java/net/ravendb/ClientApi/Operations/WhatAreOperations.java

+public class WhatAreOperations {
+
+    private interface IFoo<TResult, TEntity> {
+        /*


Should this commented-out block remain here?

The Java articles will be revised/updated in a separate issue

ArieSLV · 2025-03-23T13:07:38Z

...ation/7.0/Samples/java/src/test/java/net/ravendb/ClientApi/Operations/WhatAreOperations.java

+            GetClientConfigurationOperation.Result result
+                = store.maintenance().send(new GetClientConfigurationOperation());


Suggested change

GetClientConfigurationOperation.Result result

= store.maintenance().send(new GetClientConfigurationOperation());

GetClientConfigurationOperation.Result result =

store.maintenance().send(new GetClientConfigurationOperation());

ArieSLV · 2025-03-23T13:10:21Z

Documentation/7.0/Samples/nodejs/client-api/operations/whatAreOperations.js

+
+    //region wait_kill_syntax
+    await waitForCompletion();
+    await kill()


Suggested change

await kill()

await kill();

Just for consistency with the rest of the examples :)

ArieSLV · 2025-03-23T13:12:07Z

Documentation/7.0/Samples/php/ClientApi/Operations/WhatAreOperations.php

+    /**
+     * Wait for operation completion.
+     *
+     * It throws TimoutException if $duration is set and operation execution time elapses duration interval.


Suggested change

* It throws TimoutException if $duration is set and operation execution time elapses duration interval.

* It throws TimeoutException if $duration is set and operation execution time elapses duration interval.

And do we need this commented-out block?

done,
and - yes (it shows in the article)

ArieSLV · 2025-03-23T13:15:31Z

Documentation/7.0/Samples/python/ClientApi/Operations/WhatAreOperations.py

+            delete_by_query_op = DeleteByQueryOperation("from Products where Discontinued = true")
+
+            # Execute the operation
+            # Send returns an 'Operation' object that can be 'killed'


Suggested change

# Send returns an 'Operation' object that can be 'killed'

# The send_async method returns an 'Operation' object that can be 'killed'

ArieSLV · 2025-03-23T13:28:32Z

Wooow! Absolutely massive amount of work. It took me 4.5 hours to review — writing all this must’ve been a Herculean task. Huge respect!!!!

…s that have an OpenAI-compatible API....)

maciejaszyk · 2025-03-25T08:49:34Z

...aven.Documentation.Pages/ai-integration/generating-embeddings/embedding-collections.markdown

+
+* The server creates the following dedicated collections,  
+  which contain documents that reference the embedding attachments:  
+  * **Embeddings Collection**


This indicates that there will be only one Embeddings Collection; however, this is per collection that the user created a task.

maciejaszyk · 2025-03-25T12:37:09Z

...aven.Documentation.Pages/ai-integration/generating-embeddings/embedding-collections.markdown

+* For example:  
+  In this [task definition](../../ai-integration/generating-embeddings/embeddings-generation-task#configuring-an-embeddings-generation-task---from-the-studio),
+  an embeddings generation task is defined on the `Categories` collection.  
+  This creates the `Categories/embeddings` collection, where a document will look as follows:
+
+    ![The embeddings document](images/embeddings-collection-1.png)
+
+  1. **Collection name**   
+     The unique name of the embeddings collection: `Categories/embeddings`.
+  2. **Document ID**  
+     Each document ID in this collection follows the format: `<source-document-name>/embeddings`  
+  3. **Task identifier**  
+     The identifier of the task that generated the embeddings for the listed properties.  
+  4. **Source properties & their hash**:  
+     This section contains properties from the source document whose content was converted into embeddings.  
+     Each property contains an array of hashes derived from text chunks created from the property's content:  
+     `<property-name>: [<hash-created-from-text-chunk-1>, <hash-created-from-text-chunk-2>, ...]`
+  5. **Attachment flag**  
+     Indicates that the document includes attachments, which store the embeddings.  
+     The next image shows the embedding attachments in the document's properties pane.
+
+    ![The embeddings document - attachments](images/embeddings-collection-2.png)
+
+  * Each attachment contains a single embedding.
+
+  * The attachment name follows this format:  
+    `<task-identifier>_<property-name>_<hash-of-text-chunk-from-property-content>`
+
+  * If the embeddings were [Quantized](../../ai-integration/vector-search/vector-search-using-dynamic-query#what-is-quantization) by the task,
+    the format includes the quantization type:  
+    `<task-identifier>_<property-name>_<hash-of-text-chunk-from-property-content>_<quantization-type>`
+


This need to be adjusted soon.

maciejaszyk · 2025-03-25T12:42:21Z

...aven.Documentation.Pages/ai-integration/generating-embeddings/embedding-collections.markdown

+  Once the expiration date is reached, the document is automatically deleted (provided that [document expiration](../../studio/database/settings/document-expiration) is enabled).
+
+* When a source document (from which embeddings were generated) is modified,
+  RavenDB extends the expiration date of the relevant document in this cache if the remaining time is less than one-third of the original duration.


It has been changed. It is now half of the original duration.

…t code + Update images + Small fixes to flow charts + Fix to configuration + Updated the expiration policy

…t IDs

Danielle9897 added 10 commits March 10, 2025 18:43

RDoc-3254 The AI Configuration article

94c4f70

RDoc-3254 The connection strings articles

821fca1

RDoc-3254 The connection strings articles - fixes

8121a7a

RDoc-3254 AI Tasks - list view

20ac9a9

RDoc-3254 Embeddings generation - Overview

d37d609

RDoc-3254 Fix links

172805e

RDoc-3254 Fix the configuration article

de07adf

RDoc-3254 The embeddings collections article + The embeddings generat…

8d27c43

…ion task article

RDoc-3254 Add embeddings generation task from the Client API

a58cd1e

RDoc-3254 Add the AddEmbeddingsGenerationOperation (C# only)

4c39991

Danielle9897 marked this pull request as draft March 17, 2025 08:56

Danielle9897 added 8 commits March 18, 2025 10:10

RDoc-3254 Explain chunking methods + Add syntax

ac7f0b3

RDoc-3254 fix syntax

aee2e67

RDoc-3254 fix flow charts

8671bef

RDoc-3254 Dynamic query example

9e3e50d

RDoc-3254 Add info to the "RavenDB as a vector DB" article

3ceb499

RDoc-3254 Static index & query example

0556f53

RDoc-3254 fix the overview

8ee51aa

RDoc-3254 fix section: Configure the vector field in the Studio

a0cb9c8

Danielle9897 marked this pull request as ready for review March 19, 2025 15:16

Danielle9897 requested review from maciejaszyk, Lwiel and arekpalinski March 19, 2025 15:16

Lwiel reviewed Mar 19, 2025

View reviewed changes

Danielle9897 added 5 commits March 20, 2025 11:48

RDoc-3254 small fixes

0ec3474

RDoc-3254 Add the option to use Name: this.Name in the script

d9ea52d

RDoc-3254 small fixes

ceba35c

RDoc-3254 improve text

46d8c9a

RDoc-3254 Enhance the process & the cache lookup flow

036cae0

arekpalinski approved these changes Mar 21, 2025

View reviewed changes

arekpalinski requested a review from ArieSLV March 21, 2025 07:36

ArieSLV reviewed Mar 23, 2025

View reviewed changes

RDoc-3254 Fix flow charts

1395741

Danielle9897 force-pushed the RDoc-3254-embeddingsGeneration branch from 710596a to 1395741 Compare March 24, 2025 06:27

Danielle9897 added 6 commits March 24, 2025 09:41

RDoc-3254 In this page => In this article

36af81b

RDoc-3254 Fixed all of @lev comments except one (about other provider…

62ef950

…s that have an OpenAI-compatible API....)

RDoc-3254 fix flow chart

de3bfb8

RDoc-3254 fix file 'embedding-collections' based on @Arek's comments

033ba15

RDoc-3254 fix other review comments

853b4a8

RDoc-3254 fix the configuration to match the latest code

355700b

maciejaszyk reviewed Mar 25, 2025

View reviewed changes

RDoc-3254 Update the embedding-collections article to match the lates…

6c3028b

…t code + Update images + Small fixes to flow charts + Fix to configuration + Updated the expiration policy

Danielle9897 force-pushed the RDoc-3254-embeddingsGeneration branch from 512ac71 to 6c3028b Compare March 26, 2025 21:38

Danielle9897 added 3 commits March 27, 2025 11:48

RDoc-3254 Add the default chunking method used in the script

ba68ff7

RDoc-3254 Apply the new naming convention for collections and documen…

023190e

…t IDs

RDoc-3280 Update information about similarity threshold in vector search

65e21d5

ppekrol merged commit d411598 into ravendb:master Mar 27, 2025
1 of 2 checks passed


		* This configuration supports Google AI embeddings only and is Not compatible with Vertex AI.

		* In this page:

		GetClientConfigurationOperation.Result result
		= store.maintenance().send(new GetClientConfigurationOperation());

	* It throws TimoutException if $duration is set and operation execution time elapses duration interval.
	* It throws TimeoutException if $duration is set and operation execution time elapses duration interval.

	# Send returns an 'Operation' object that can be 'killed'
	# The send_async method returns an 'Operation' object that can be 'killed'

RDoc-3254 Embeddings Generation via Tasks #1998

RDoc-3254 Embeddings Generation via Tasks #1998

Uh oh!

Conversation

Danielle9897 commented Mar 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arekpalinski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Danielle9897 Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Danielle9897 Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArieSLV commented Mar 23, 2025

Uh oh!

Choose a reason for hiding this comment

Danielle9897 Mar 24, 2025 •

edited

Loading

Danielle9897 Mar 24, 2025 •

edited

Loading