Skip to content

Commit d528e8d

Browse files
atroyncsvoss
andauthored
Add Chroma (#232)
* First draft of ChromaDataStore * Add embeddings to the invocation of collection.add * Resolve uncertainty over embedding_function by using outer embeddings * Add CHROMA_IN_MEMORY config variable * Update poetry.lock * Fix import error * Fix default collection name to pass validation * Add empty scaffolding for integration tests * Fix error: add should return document ids * Add first pass at integration tests using same fixtures as from Zilliz integration tests * Fix created_at handling and pass test test_query_filter * Fix more tests * Fix host/port initialization * Fix NotEnoughElementsException * Fix filter handling for source * Fix deletion tests * Change defaults, upsert method * Clients, metadata handling * Update tests * Docstrings * Update README * Updated tests and docs * Updated poetry.lock * Updated tests * Cleanup after tests * Fix docs path * Remove embeddings return * Reference --------- Co-authored-by: Chelsea Voss <[email protected]>
1 parent 51579b6 commit d528e8d

File tree

7 files changed

+2425
-597
lines changed

7 files changed

+2425
-597
lines changed

README.md

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ This README provides detailed information on how to set up, develop, and deploy
3535
- [Setup](#setup)
3636
- [General Environment Variables](#general-environment-variables)
3737
- [Choosing a Vector Database](#choosing-a-vector-database)
38+
- [Chroma](#chroma)
3839
- [Pinecone](#pinecone)
3940
- [Weaviate](#weaviate)
4041
- [Zilliz](#zilliz)
@@ -73,7 +74,7 @@ Follow these steps to quickly set up and run the ChatGPT Retrieval Plugin:
7374
export OPENAI_API_KEY=<your_openai_api_key>
7475
7576
# Optional environment variables used when running Azure OpenAI
76-
export OPENAI_API_BASE=https://<AzureOpenAIName>.openai.azure.com/
77+
export OPENAI_API_BASE=https://<AzureOpenAIName>.openai.azure.com/
7778
export OPENAI_API_TYPE=azure
7879
export OPENAI_EMBEDDINGMODEL_DEPLOYMENTID=<Name of text-embedding-ada-002 model deployment>
7980
export OPENAI_METADATA_EXTRACTIONMODEL_DEPLOYMENTID=<Name of deployment of model for metatdata>
@@ -239,27 +240,29 @@ The API requires the following environment variables to work:
239240

240241
| Name | Required | Description |
241242
| ---------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
242-
| `DATASTORE` | Yes | This specifies the vector database provider you want to use to store and query embeddings. You can choose from `pinecone`, `weaviate`, `zilliz`, `milvus`, `qdrant`, or `redis`. |
243+
| `DATASTORE` | Yes | This specifies the vector database provider you want to use to store and query embeddings. You can choose from `chroma`, `pinecone`, `weaviate`, `zilliz`, `milvus`, `qdrant`, or `redis`. |
243244
| `BEARER_TOKEN` | Yes | This is a secret token that you need to authenticate your requests to the API. You can generate one using any tool or method you prefer, such as [jwt.io](https://jwt.io/). |
244245
| `OPENAI_API_KEY` | Yes | This is your OpenAI API key that you need to generate embeddings using the `text-embedding-ada-002` model. You can get an API key by creating an account on [OpenAI](https://openai.com/). |
245246

246-
247247
### Using the plugin with Azure OpenAI
248248

249249
The Azure Open AI uses URLs that are specific to your resource and references models not by model name but by the deployment id. As a result, you need to set additional environment variables for this case.
250250

251-
In addition to the OPENAI_API_BASE (your specific URL) and OPENAI_API_TYPE (azure), you should also set OPENAI_EMBEDDINGMODEL_DEPLOYMENTID which specifies the model to use for getting embeddings on upsert and query. For this, we recommend deploying text-embedding-ada-002 model and using the deployment name here.
251+
In addition to the OPENAI_API_BASE (your specific URL) and OPENAI_API_TYPE (azure), you should also set OPENAI_EMBEDDINGMODEL_DEPLOYMENTID which specifies the model to use for getting embeddings on upsert and query. For this, we recommend deploying text-embedding-ada-002 model and using the deployment name here.
252252

253-
If you wish to use the data preparation scripts, you will also need to set OPENAI_METADATA_EXTRACTIONMODEL_DEPLOYMENTID, used for metadata extraction and
253+
If you wish to use the data preparation scripts, you will also need to set OPENAI_METADATA_EXTRACTIONMODEL_DEPLOYMENTID, used for metadata extraction and
254254
OPENAI_COMPLETIONMODEL_DEPLOYMENTID, used for PII handling.
255255

256-
257256
### Choosing a Vector Database
258257

259258
The plugin supports several vector database providers, each with different features, performance, and pricing. Depending on which one you choose, you will need to use a different Dockerfile and set different environment variables. The following sections provide brief introductions to each vector database provider.
260259

261260
For more detailed instructions on setting up and using each vector database provider, please refer to the respective documentation in the `/docs/providers/<datastore_name>/setup.md` file ([folders here](/docs/providers)).
262261

262+
#### Chroma
263+
264+
[Chroma](https://trychroma.com) is an AI-native open-source embedding database designed to make getting started as easy as possible. Chroma runs in-memory, or in a client-server setup. It supports metadata and keyword filtering out of the box. For detailed instructions, refer to [`/docs/providers/chroma/setup.md`](/docs/providers/chroma/setup.md).
265+
263266
#### Pinecone
264267

265268
[Pinecone](https://www.pinecone.io) is a managed vector database designed for speed, scale, and rapid deployment to production. It supports hybrid search and is currently the only datastore to natively support SPLADE sparse vectors. For detailed setup instructions, refer to [`/docs/providers/pinecone/setup.md`](/docs/providers/pinecone/setup.md).
@@ -367,7 +370,13 @@ Before deploying your app, you might want to remove unused dependencies from you
367370

368371
Once you have deployed your app, consider uploading an initial batch of documents using one of [these scripts](/scripts) or by calling the `/upsert` endpoint.
369372

370-
Here are detailed deployment instructions for various platforms:
373+
- **Chroma:** Remove `pinecone-client`, `weaviate-client`, `pymilvus`, `qdrant-client`, and `redis`.
374+
- **Pinecone:** Remove `chromadb`, `weaviate-client`, `pymilvus`, `qdrant-client`, and `redis`.
375+
- **Weaviate:** Remove `chromadb`, `pinecone-client`, `pymilvus`, `qdrant-client`, and `redis`.
376+
- **Zilliz:** Remove `chromadb`, `pinecone-client`, `weaviate-client`, `qdrant-client`, and `redis`.
377+
- **Milvus:** Remove `chromadb`, `pinecone-client`, `weaviate-client`, `qdrant-client`, and `redis`.
378+
- **Qdrant:** Remove `chromadb`, `pinecone-client`, `weaviate-client`, `pymilvus`, and `redis`.
379+
- **Redis:** Remove `chromadb`, `pinecone-client`, `weaviate-client`, `pymilvus`, and `qdrant-client`.
371380

372381
- [Deploying to Fly.io](/docs/deployment/flyio.md)
373382
- [Deploying to Heroku](/docs/deployment/heroku.md)

datastore/factory.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ async def get_datastore() -> DataStore:
77
assert datastore is not None
88

99
match datastore:
10+
case "chroma":
11+
from datastore.providers.chroma_datastore import ChromaDataStore
12+
13+
return ChromaDataStore()
1014
case "llama":
1115
from datastore.providers.llama_datastore import LlamaDataStore
1216
return LlamaDataStore()
Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
"""
2+
Chroma datastore support for the ChatGPT retrieval plugin.
3+
4+
Consult the Chroma docs and GitHub repo for more information:
5+
- https://docs.trychroma.com/usage-guide?lang=py
6+
- https://github.com/chroma-core/chroma
7+
- https://www.trychroma.com/
8+
"""
9+
10+
import os
11+
from datetime import datetime
12+
from typing import Dict, List, Optional
13+
14+
import chromadb
15+
16+
from datastore.datastore import DataStore
17+
from models.models import (
18+
DocumentChunk,
19+
DocumentChunkMetadata,
20+
DocumentChunkWithScore,
21+
DocumentMetadataFilter,
22+
QueryResult,
23+
QueryWithEmbedding,
24+
Source,
25+
)
26+
from services.chunks import get_document_chunks
27+
28+
CHROMA_IN_MEMORY = os.environ.get("CHROMA_IN_MEMORY", "True")
29+
CHROMA_PERSISTENCE_DIR = os.environ.get("CHROMA_PERSISTENCE_DIR", "openai")
30+
CHROMA_HOST = os.environ.get("CHROMA_HOST", "http://127.0.0.1")
31+
CHROMA_PORT = os.environ.get("CHROMA_PORT", "8000")
32+
CHROMA_COLLECTION = os.environ.get("CHROMA_COLLECTION", "openaiembeddings")
33+
34+
35+
class ChromaDataStore(DataStore):
36+
def __init__(
37+
self,
38+
in_memory: bool = CHROMA_IN_MEMORY,
39+
persistence_dir: Optional[str] = CHROMA_PERSISTENCE_DIR,
40+
collection_name: str = CHROMA_COLLECTION,
41+
host: str = CHROMA_HOST,
42+
port: str = CHROMA_PORT,
43+
client: Optional[chromadb.Client] = None,
44+
):
45+
if client:
46+
self._client = client
47+
else:
48+
if in_memory:
49+
settings = (
50+
chromadb.config.Settings(
51+
chroma_db_impl="duckdb+parquet",
52+
persist_directory=persistence_dir,
53+
)
54+
if persistence_dir
55+
else chromadb.config.Settings()
56+
)
57+
58+
self._client = chromadb.Client(settings=settings)
59+
else:
60+
self._client = chromadb.Client(
61+
settings=chromadb.config.Settings(
62+
chroma_api_impl="rest",
63+
chroma_server_host=host,
64+
chroma_server_http_port=port,
65+
)
66+
)
67+
self._collection = self._client.get_or_create_collection(
68+
name=collection_name,
69+
embedding_function=None,
70+
)
71+
72+
async def upsert(
73+
self, documents: List[DocumentChunk], chunk_token_size: Optional[int] = None
74+
) -> List[str]:
75+
"""
76+
Takes in a list of documents and inserts them into the database. If an id already exists, the document is updated.
77+
Return a list of document ids.
78+
"""
79+
80+
chunks = get_document_chunks(documents, chunk_token_size)
81+
82+
# Chroma has a true upsert, so we don't need to delete first
83+
return await self._upsert(chunks)
84+
85+
async def _upsert(self, chunks: Dict[str, List[DocumentChunk]]) -> List[str]:
86+
"""
87+
Takes in a list of list of document chunks and inserts them into the database.
88+
Return a list of document ids.
89+
"""
90+
91+
self._collection.upsert(
92+
ids=[chunk.id for chunk_list in chunks.values() for chunk in chunk_list],
93+
embeddings=[
94+
chunk.embedding
95+
for chunk_list in chunks.values()
96+
for chunk in chunk_list
97+
],
98+
documents=[
99+
chunk.text for chunk_list in chunks.values() for chunk in chunk_list
100+
],
101+
metadatas=[
102+
self._process_metadata_for_storage(chunk.metadata)
103+
for chunk_list in chunks.values()
104+
for chunk in chunk_list
105+
],
106+
)
107+
return list(chunks.keys())
108+
109+
def _where_from_query_filter(self, query_filter: DocumentMetadataFilter) -> Dict:
110+
output = {
111+
k: v
112+
for (k, v) in query_filter.dict().items()
113+
if v is not None and k != "start_date" and k != "end_date" and k != "source"
114+
}
115+
if query_filter.source:
116+
output["source"] = query_filter.source.value
117+
if query_filter.start_date and query_filter.end_date:
118+
output["$and"] = [
119+
{
120+
"created_at": {
121+
"$gte": int(
122+
datetime.fromisoformat(query_filter.start_date).timestamp()
123+
)
124+
}
125+
},
126+
{
127+
"created_at": {
128+
"$lte": int(
129+
datetime.fromisoformat(query_filter.end_date).timestamp()
130+
)
131+
}
132+
},
133+
]
134+
elif query_filter.start_date:
135+
output["created_at"] = {
136+
"$gte": int(datetime.fromisoformat(query_filter.start_date).timestamp())
137+
}
138+
elif query_filter.end_date:
139+
output["created_at"] = {
140+
"$lte": int(datetime.fromisoformat(query_filter.end_date).timestamp())
141+
}
142+
143+
return output
144+
145+
def _process_metadata_for_storage(self, metadata: DocumentChunkMetadata) -> Dict:
146+
stored_metadata = {}
147+
if metadata.source:
148+
stored_metadata["source"] = metadata.source.value
149+
if metadata.source_id:
150+
stored_metadata["source_id"] = metadata.source_id
151+
if metadata.url:
152+
stored_metadata["url"] = metadata.url
153+
if metadata.created_at:
154+
stored_metadata["created_at"] = int(
155+
datetime.fromisoformat(metadata.created_at).timestamp()
156+
)
157+
if metadata.author:
158+
stored_metadata["author"] = metadata.author
159+
if metadata.document_id:
160+
stored_metadata["document_id"] = metadata.document_id
161+
162+
return stored_metadata
163+
164+
def _process_metadata_from_storage(self, metadata: Dict) -> DocumentChunkMetadata:
165+
return DocumentChunkMetadata(
166+
source=Source(metadata["source"]) if "source" in metadata else None,
167+
source_id=metadata.get("source_id", None),
168+
url=metadata.get("url", None),
169+
created_at=datetime.fromtimestamp(metadata["created_at"]).isoformat()
170+
if "created_at" in metadata
171+
else None,
172+
author=metadata.get("author", None),
173+
document_id=metadata.get("document_id", None),
174+
)
175+
176+
async def _query(self, queries: List[QueryWithEmbedding]) -> List[QueryResult]:
177+
"""
178+
Takes in a list of queries with embeddings and filters and returns a list of query results with matching document chunks and scores.
179+
"""
180+
results = [
181+
self._collection.query(
182+
query_embeddings=[query.embedding],
183+
include=["documents", "distances", "metadatas"], # embeddings
184+
n_results=min(query.top_k, self._collection.count()),
185+
where=(
186+
self._where_from_query_filter(query.filter) if query.filter else {}
187+
),
188+
)
189+
for query in queries
190+
]
191+
192+
output = []
193+
for query, result in zip(queries, results):
194+
inner_results = []
195+
(ids,) = result["ids"]
196+
# (embeddings,) = result["embeddings"]
197+
(documents,) = result["documents"]
198+
(metadatas,) = result["metadatas"]
199+
(distances,) = result["distances"]
200+
for id_, text, metadata, distance in zip(
201+
ids,
202+
documents,
203+
metadatas,
204+
distances, # embeddings (https://github.com/openai/chatgpt-retrieval-plugin/pull/59#discussion_r1154985153)
205+
):
206+
inner_results.append(
207+
DocumentChunkWithScore(
208+
id=id_,
209+
text=text,
210+
metadata=self._process_metadata_from_storage(metadata),
211+
# embedding=embedding,
212+
score=distance,
213+
)
214+
)
215+
output.append(QueryResult(query=query.query, results=inner_results))
216+
217+
return output
218+
219+
async def delete(
220+
self,
221+
ids: Optional[List[str]] = None,
222+
filter: Optional[DocumentMetadataFilter] = None,
223+
delete_all: Optional[bool] = None,
224+
) -> bool:
225+
"""
226+
Removes vectors by ids, filter, or everything in the datastore.
227+
Multiple parameters can be used at once.
228+
Returns whether the operation was successful.
229+
"""
230+
if delete_all:
231+
self._collection.delete()
232+
return True
233+
234+
if ids and len(ids) > 0:
235+
if len(ids) > 1:
236+
where_clause = {"$or": [{"document_id": id_} for id_ in ids]}
237+
else:
238+
(id_,) = ids
239+
where_clause = {"document_id": id_}
240+
241+
if filter:
242+
where_clause = {
243+
"$and": [self._where_from_query_filter(filter), where_clause]
244+
}
245+
elif filter:
246+
where_clause = self._where_from_query_filter(filter)
247+
248+
self._collection.delete(where=where_clause)
249+
return True

docs/providers/chroma/setup.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
[Chroma](https://trychroma.com) is an AI-native open-source embedding database designed to make it easy to work with embeddings. Chroma runs in-memory, or in a client-server setup.
2+
3+
Install Chroma by running `pip install chromadb`. Once installed, the core API consists of four essential commands for creating collections, adding embeddings, documents, and metadata, and querying embeddings to find similar documents. Get started with Chroma by visiting the [Getting Started](https://docs.trychroma.com) page on their documentation website, or explore the open-source code on their [GitHub repository](https://github.com/chroma-core/chroma).
4+
5+
**Chroma Environment Variables**
6+
7+
To set up Chroma and start using it as your vector database provider, you need to define some environment variables to connect to your Chroma instance.
8+
9+
**Chroma Datastore Environment Variables**
10+
11+
Chroma runs _in-memory_ by default, with local persistence. It can also run in [self-hosted](https://docs.trychroma.com/usage-guide#running-chroma-in-clientserver-mode) client-server mode, with a fully managed hosted version coming soon.
12+
13+
| Name | Required | Description | Default |
14+
| ------------------------ | -------- | -------------------------------------------------------------------------------------------------- | ---------------- |
15+
| `DATASTORE` | Yes | Datastore name. Set this to `chroma` | |
16+
| `BEARER_TOKEN` | Yes | Your secret token for authenticating requests to the API | |
17+
| `OPENAI_API_KEY` | Yes | Your OpenAI API key for generating embeddings | |
18+
| `CHROMA_COLLECTION` | Optional | Your chosen Chroma collection name to store your embeddings | openaiembeddings |
19+
| `CHROMA_IN_MEMORY` | Optional | If set to `True`, ignore `CHROMA_HOST` and `CHROMA_PORT` and just use an in-memory Chroma instance | `True` |
20+
| `CHROMA_PERSISTENCE_DIR` | Optional | If set, and `CHROMA_IN_MEMORY` is set, persist to and load from this directory. | `openai` |
21+
22+
To run Chroma in self-hosted client-server mode, st the following variables:
23+
24+
| Name | Required | Description | Default |
25+
| ------------- | -------- | --------------------------------------------------- | ------------------ |
26+
| `CHROMA_HOST` | Optional | Your Chroma instance host address (see notes below) | `http://127.0.0.1` |
27+
| `CHROMA_PORT` | Optional | Your Chroma port number | `8000` |
28+
29+
> For **self-hosted instances**, if your instance is not at 127.0.0.1:8000, set `CHROMA_HOST` and `CHROMA_PORT` accordingly. For example: `CHROMA_HOST=http://localhost/` and `CHROMA_PORT=8080`.

0 commit comments

Comments
 (0)