-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDoc-3254 Embeddings Generation via Tasks #1998
RDoc-3254 Embeddings Generation via Tasks #1998
Conversation
|
||
When indexing your own data (textual or numerical) that was not generated by these tasks, | ||
use the `CreateVector()` method in your index definition. | ||
An example is available in [Indexing raw tex](../../ai-integration/vector-search/vector-search-using-static-index#indexing-raw-text). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| **lines** | `string[]` | An array of text lines to split into chunks. | | ||
| **htmlText** | `string` | A string containing HTML content to process. | | ||
| **maxTokensPerLine** | `number` | The maximum tokens allowed per line or chunk. | | ||
| **maxTokensPerChunk** | `number` | The maximum tokens per chunk (used in `html.strip`) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll probably have only one parameter instead of maxTokensPerLine
and maxTokensPerChunk
separately, I'll write an info once it's changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Danielle9897 Very impressive work! Descriptions are clear and detailed. Everything is well organized.
* The generated identifier will be: _"my-connection-string-to-google-ai"_ | ||
|
||
Allowed characters: only lowercase letters (a-z), numbers (0-9), and hyphens (-). | ||
See how this identifier is used in the [embeddings cache collection](../../ai-integration/generating-embeddings/embedding-collections#the-embeddings-cache-collection). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also worth linking other possible places where the task identifier is used:
- indexing -
LoadDocument("FieldName", "my-connection-string-to-google-ai")
- dynamic queries -
from Employees where vector.search(..., ai.task("my-connection-string-to-google-ai", ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, in this section where we explain the connection string identifier,
I wouldn't put links to these methods since they use the "task identifier" and not the "connection string identifier"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Please ignore :)
4. **Source properties & their hash**: | ||
This section contains properties from the source document that were converted into embeddings. | ||
Each property includes a hash derived from its content: | ||
`<property-name>: [<hash-created-from-conent>, ...]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conent -> content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* Each attachment contains a singe embedding. | ||
|
||
* The attachment name follows this format: | ||
`<task-identifier>_<property-name>_<hash-created-from-property-content>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hash-created-from-property-content
-> hash-created-from-chunked-property-content
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified to: hash-of-text-chunk-from-property-content
* This applies both when generating embeddings for source document content and when performing a vector search that requires an embedding for the search term. | ||
|
||
* To find a matching embedding, RavenDB: | ||
1. **Generates a hash** from the text content that requires embedding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure but maybe we should somehow mention here it's about chunked content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, added
|
||
Allowed characters: only lowercase letters (a-z), numbers (0-9), and hyphens (-). | ||
See how this identifier is used in the [Embeddings collection](../../ai-integration/generating-embeddings/embedding-collections#the-embeddings-collection) | ||
documents that reference the generated embeddings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, maybe also reference indexing and querying since those are places where a user needs to use reference the embeddings generation task via its identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added links here, as you suggested.
* The text is sent to the providers in batches. The batch size is configurable, see the | ||
[Ai.Embeddings.Generation.Task.MaxBatchSize](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.maxbatchsize) configuration key. | ||
* A failed embeddings generation task will retry after the duration set in the | ||
[Ai.Embeddings.Generation.Task.RetryDelayInSec](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.retrydelayinsec) configuration key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above might get changed during the refactoring work that is WIP
* [Ai.Embeddings.Generation.Task.MaxBatchSize](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.maxbatchsize) | ||
* [Ai.Embeddings.Generation.Task.MaxFallbackTimeInSec](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.maxfallbacktimeinsec) | ||
* [Ai.Embeddings.Generation.Task.RetryDelayInSec](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.retrydelayinsec) | ||
* [Ai.Embeddings.Generation.Task.RetryStrategy](../../server/configuration/ai-integration-configuration#ai.embeddings.generation.task.retrystrategy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those might get changed during the current refactoring work
|
||
--- | ||
|
||
* In this page: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* In this page: | |
* On this page: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we should consistently use either:
- On this page, or
- In this article
IMHO - we should go with "In this article" (similar to Microsoft’s docs)
and that fix should be applied across ALL articles in the documentation.
For now, I’ll apply this change to all articles in the "AI Integration" section.
* This article explains how to define a connection string to the [Azure OpenAI Service](https://azure.microsoft.com/en-us/products/ai-services/openai-service), | ||
enabling RavenDB to seamlessly integrate its [embeddings generation tasks](../../ai-integration/generating-embeddings/overview) with your Azure environment. | ||
|
||
* In this page: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* In this page: | |
* On this page: |
This model is embedded within RavenDB, enabling RavenDB to seamlessly handle its | ||
[embeddings generation tasks](../../ai-integration/generating-embeddings/overview) without requiring an external AI service. | ||
|
||
* In this page: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* In this page: | |
* On this page: |
|
||
* This configuration supports **Google AI embeddings** only and is Not compatible with Vertex AI. | ||
|
||
* In this page: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* In this page: | |
* On this page: |
* This article explains how to define a connection string to [Google AI](https://ai.google.dev/gemini-api/docs/embeddings), | ||
enabling RavenDB to seamlessly integrate its [embeddings generation tasks](../../ai-integration/generating-embeddings/overview) with Google's AI services. | ||
|
||
* This configuration supports **Google AI embeddings** only and is Not compatible with Vertex AI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* This configuration supports **Google AI embeddings** only and is Not compatible with Vertex AI. | |
* This configuration supports **Google AI provider** only and is not compatible with Vertex AI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public class WhatAreOperations { | ||
|
||
private interface IFoo<TResult, TEntity> { | ||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this commented-out block remain here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Java articles will be revised/updated in a separate issue
GetClientConfigurationOperation.Result result | ||
= store.maintenance().send(new GetClientConfigurationOperation()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetClientConfigurationOperation.Result result | |
= store.maintenance().send(new GetClientConfigurationOperation()); | |
GetClientConfigurationOperation.Result result = | |
store.maintenance().send(new GetClientConfigurationOperation()); |
|
||
//region wait_kill_syntax | ||
await waitForCompletion(); | ||
await kill() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
await kill() | |
await kill(); |
Just for consistency with the rest of the examples :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/** | ||
* Wait for operation completion. | ||
* | ||
* It throws TimoutException if $duration is set and operation execution time elapses duration interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* It throws TimoutException if $duration is set and operation execution time elapses duration interval. | |
* It throws TimeoutException if $duration is set and operation execution time elapses duration interval. |
And do we need this commented-out block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done,
and - yes (it shows in the article)
delete_by_query_op = DeleteByQueryOperation("from Products where Discontinued = true") | ||
|
||
# Execute the operation | ||
# Send returns an 'Operation' object that can be 'killed' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Send returns an 'Operation' object that can be 'killed' | |
# The send_async method returns an 'Operation' object that can be 'killed' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Wooow! Absolutely massive amount of work. It took me 4.5 hours to review — writing all this must’ve been a Herculean task. Huge respect!!!! |
710596a
to
1395741
Compare
…s that have an OpenAI-compatible API....)
|
||
* The server creates the following dedicated collections, | ||
which contain documents that reference the embedding attachments: | ||
* **Embeddings Collection** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indicates that there will be only one Embeddings Collection
; however, this is per collection that the user created a task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* For example: | ||
In this [task definition](../../ai-integration/generating-embeddings/embeddings-generation-task#configuring-an-embeddings-generation-task---from-the-studio), | ||
an embeddings generation task is defined on the `Categories` collection. | ||
This creates the `Categories/embeddings` collection, where a document will look as follows: | ||
|
||
 | ||
|
||
1. **Collection name** | ||
The unique name of the embeddings collection: `Categories/embeddings`. | ||
2. **Document ID** | ||
Each document ID in this collection follows the format: `<source-document-name>/embeddings` | ||
3. **Task identifier** | ||
The identifier of the task that generated the embeddings for the listed properties. | ||
4. **Source properties & their hash**: | ||
This section contains properties from the source document whose content was converted into embeddings. | ||
Each property contains an array of hashes derived from text chunks created from the property's content: | ||
`<property-name>: [<hash-created-from-text-chunk-1>, <hash-created-from-text-chunk-2>, ...]` | ||
5. **Attachment flag** | ||
Indicates that the document includes attachments, which store the embeddings. | ||
The next image shows the embedding attachments in the document's properties pane. | ||
|
||
 | ||
|
||
* Each attachment contains a single embedding. | ||
|
||
* The attachment name follows this format: | ||
`<task-identifier>_<property-name>_<hash-of-text-chunk-from-property-content>` | ||
|
||
* If the embeddings were [Quantized](../../ai-integration/vector-search/vector-search-using-dynamic-query#what-is-quantization) by the task, | ||
the format includes the quantization type: | ||
`<task-identifier>_<property-name>_<hash-of-text-chunk-from-property-content>_<quantization-type>` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This need to be adjusted soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Once the expiration date is reached, the document is automatically deleted (provided that [document expiration](../../studio/database/settings/document-expiration) is enabled). | ||
|
||
* When a source document (from which embeddings were generated) is modified, | ||
RavenDB extends the expiration date of the relevant document in this cache if the remaining time is less than one-third of the original duration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has been changed. It is now half of the original duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
…t code + Update images + Small fixes to flow charts + Fix to configuration + Updated the expiration policy
512ac71
to
6c3028b
Compare
Related issue:
https://issues.hibernatingrhinos.com/issue/RDoc-3254/AI-Integration-Document-Embeddings-Generation-via-Tasks