-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding lists to the metadata #227
Comments
@Everminds hello! Is this information that you store outside of chroma as well? If so, I have another idea for a solution here. |
We can save it outside though it would be less convenient |
@jeffchuber any updates on this one? |
I would vote for this, it will be very useful if it supports the list directly and we won't need 3rd tool to retrieve all the vectors and compare again. It will be helpful for scenarios like we get a doc describing a thing but with different versions, models, etc. |
Thanks @jeffchuber , I guess not just IDs, widening it to metadata would be great! To clarify it:
I understand this would mean a lot of effort, but see below for how it helps: An example scenario: Similarly, you may change the "web page" to "products" of an online shopping site, we normally filter things with many options like price, category, shipping preference, seller, etc. We want to get a similar result by the product detail(content), and we also want to filter it using things we are familiar with so that we can make it more efficient. |
@8rV1n chroma has this :) though we currently do a bad job communicating it chroma/chromadb/test/test_api.py Line 1050 in a563700
look inside that test folder and you will see examples of all of these. The where filter in |
Thanks @jeffchuber! Any idea for using metadata like this? (adding, and querying) collection.add(
documents=["Alice meets rabbits...", "doc2", "doc3", ...],
metadatas=[{"charactor_roles": ['Alice', 'rabbits']}, {"charactor_roles": ['Steve Jobs', 'Tim Cook']}, {"charactor_roles": []}, ...],
ids=["id1", "id2", "id3", ...]
) It seems I can do this for the metadata when creating the collection: client.create_collection(
"my_collection",
metadata={"foo": ["bar", "bar2"]}
) |
+1, this would be incredibly useful for not needing a secondary datastore to just to be able to attach lists to documents |
Happy to take a stab at this. If I'm understanding correctly, this would mean adding -Metadata = Mapping[str, Union[str, int, float]]
+Metadata = Mapping[str, Union[str, int, float, List[Union[str, int, float]]]] So that lists can be added as a value in metadata: collection.add(ids=['test'], documents=['test'], metadatas=[{ 'list': [1, 2, 3] }]) The biggest source of uphill work, I think, would be adding support for Lists to the Where filter operators EDIT: Should we re-use existing operators and make them them work for lists? collection.get(where={ "list": { "$eq": 2 } }) or create new operators for lists? collection.get(where={"list": { "$contains": 2 } }) |
@Russell-Pollari yes that is correct! Where operator support is definitely the biggest lift here. I think |
IMO I think it would be better UX to have But I'm definitely pattern matching to MongoDB's query operators here. This is how they do it: I managed to get working prototype for filtering arrays with # Shortcut for $eq
if type(value) == str:
result.append(
f""" (
json_extract_string(metadata, '$.{key}') = '{value}'
OR
json_contains(json_extract(metadata, '$.{key}'), '\"{value}\"')
)
"""
)
if type(value) == int:
result.append(
f""" (
CASE WHEN json_type(json_extract(metadata, '$.{key}')) = 'ARRAY'
THEN
list_has(CAST(json_extract(metadata, '$.{key}') AS INT[]), {value})
ELSE
CAST(json_extract(metadata, '$.{key}') AS INT) = {value}
END
)
"""
)
if type(value) == float:
result.append(
f""" (
CASE WHEN json_type(json_extract(metadata, '$.{key}')) = 'ARRAY'
THEN
list_has(CAST(json_extract(metadata, '$.{key}') AS DOUBLE[]), {value})
ELSE
CAST(json_extract(metadata, '$.{key}') AS DOUBLE) = {value}
END
)
"""
) |
@Russell-Pollari indexing against how mongo does it is definitely a good idea! @HammadB what do you think? |
Threw up a PR, let me know what you think! If my solution works for y'all, happy to also update the JS client and the docs |
@Russell-Pollari thanks! will take a look today :) |
Hey, I'm also interested in using this functionality, I have documents with a bunch of possible tags as metadata, for example
If I could use the $contains operator I could filter for specific tags. Right now I'm trying turning all the tags into binary values, but I think that's breaking chroma somehow |
:( can you share more about what is breaking? this should work. are they |
Hey, I wasn't sure it could handle booleans or ints, so I ended up turning them into strings '0'/'1'. The error I got was from clickhouse (I'm using with a chroma server), I think it was related to the size of the query being to big, as I also have a cloud server where I got a 413 error. |
@tyatabe gotcha. there was a |
Exploring the new SQLite implementation. My naive approach would look something like this, having tables for def _insert_metadata(self, cur: Cursor, id: int, metadata: UpdateMetadata) -> None:
"""Insert or update each metadata row for a single embedding record"""
- t = Table("embedding_metadata")
+ t, str_list, int_list, float_list = Tables(
+ "embedding_metadata",
+ "embedding_metadata_string",
+ "embedding_metadata_int",
+ "embedding_metadata_float",
+ )
q = (
self._db.querybuilder()
.into(t)
.columns(t.id, t.key, t.string_value, t.int_value, t.float_value)
)
for key, value in metadata.items():
+ if isinstance(value, list):
+ if isinstance(value[0], str):
+ for val in value:
+ q_str = (
+ self._db.querybuilder()
+ .into(str_list)
+ .columns(str_list.metadata_id, str_list.value)
+ .insert(ParameterValue(id), ParameterValue(val))
+ )
+ if isinstance(value[0], int):
+ for val in value:
+ q_int = (
+ self._db.querybuilder()
+ .into(int_list)
+ .columns(int_list.metadata_id, int_list.value)
+ .insert(ParameterValue(id), ParameterValue(val))
+ )
+ if isinstance(value[0], float):
+ for val in value:
+ q_float = (
+ self._db.querybuilder()
+ .into(float_list)
+ .columns(float_list.metadata_id, float_list.value)
+ .insert(ParameterValue(id), ParameterValue(val))
+ )
if isinstance(value, str):
...
q = q.insert(
ParameterValue(id), Does this make sense? @jeffchuber @HammadB |
Update: got a hacky prototype for (branched off of #781 for my working dir) Migration for new table: CREATE TABLE embedding_metadata_ints (
id INTEGER REFERENCES embeddings(id),
key TEXT REFERENCES embedding_metadata(key),
int_value INTEGER NOT NULL
); Inserting metadata with list def _insert_metadata(self, cur: Cursor, id: int, metadata: UpdateMetadata) -> None:
"""Insert or update each metadata row for a single embedding record"""
(
t,
int_t,
) = Tables(
"embedding_metadata",
"embedding_metadata_ints",
)
q = (
self._db.querybuilder()
.into(t)
.columns(t.id, t.key, t.string_value, t.int_value, t.float_value)
)
for key, value in metadata.items():
if isinstance(value, list):
q = q.insert(
ParameterValue(id),
ParameterValue(key),
None,
None,
None,
)
if isinstance(value[0], int):
q_int = (
self._db.querybuilder()
.into(int_t)
.columns(int_t.id, int_t.key, int_t.int_value)
)
for val in value:
q_int = q_int.insert(
ParameterValue(id), ParameterValue(key), ParameterValue(val)
)
sql, params = get_sql(q_int)
sql = sql.replace("INSERT", "INSERT OR REPLACE")
if sql:
cur.execute(sql, params)
if isinstance(value, str):
... Querying for list of ints ( def get_metadata
....
embeddings_t, metadata_t, fulltext_t, int_t = Tables(
"embeddings",
"embedding_metadata",
"embedding_fulltext",
"embedding_metadata_ints",
)
q = (
(
self._db.querybuilder()
.from_(embeddings_t)
.left_join(metadata_t)
.on(embeddings_t.id == metadata_t.id)
.outer_join(int_t)
.on((metadata_t.key == int_t.key) & (metadata_t.id == int_t.id))
)
.select(
embeddings_t.id,
embeddings_t.embedding_id,
embeddings_t.seq_id,
metadata_t.key,
metadata_t.string_value,
metadata_t.int_value,
metadata_t.float_value,
int_t.int_value,
) constructing metadata object with list of ints def _record(self, rows: Sequence[Tuple[Any, ...]]) -> MetadataEmbeddingRecord:
"""Given a list of DB rows with the same ID, construct a
MetadataEmbeddingRecord"""
_, embedding_id, seq_id = rows[0][:3]
metadata = {}
for row in rows:
key, string_value, int_value, float_value, int_elem = row[3:]
if string_value is not None:
metadata[key] = string_value
elif int_value is not None:
metadata[key] = int_value
elif float_value is not None:
metadata[key] = float_value
elif int_elem is not None:
int_list = metadata.get(key, [])
int_list.append(int_elem)
metadata[key] = int_list Also requires updating the relevant types/validators to allow for lists |
Converging on a solution Initially, I created tables for each allowed list type (int, str, float). It was working but was getting messy. Ended up using another table with the same schema as CREATE TABLE embedding_metadata_lists (
id INTEGER REFERENCES embeddings(id),
key TEXT REFERENCES embedding_metadata(key),
string_value TEXT,
float_value REAL,
int_value INTEGER
); @override
def get_metadata(
self,
where: Optional[Where] = None,
where_document: Optional[WhereDocument] = None,
ids: Optional[Sequence[str]] = None,
limit: Optional[int] = None,
offset: Optional[int] = None,
) -> Sequence[MetadataEmbeddingRecord]:
"""Query for embedding metadata."""
embeddings_t, metadata_t, fulltext_t, metadata_list_t = Tables(
"embeddings",
"embedding_metadata",
"embedding_fulltext",
"embedding_metadata_lists",
)
q = (
(
self._db.querybuilder()
.from_(embeddings_t)
.left_join(metadata_t)
.on(embeddings_t.id == metadata_t.id)
.left_outer_join(metadata_list_t)
.on(
(metadata_t.key == metadata_list_t.key)
& (embeddings_t.id == metadata_list_t.id)
)
)
.select(
embeddings_t.id,
embeddings_t.embedding_id,
embeddings_t.seq_id,
metadata_t.key,
metadata_t.string_value,
metadata_t.int_value,
metadata_t.float_value,
metadata_list_t.string_value,
metadata_list_t.int_value,
metadata_list_t.float_value,
)
... If this approach makes sense, can you assign this issue to me, @jeffchuber? I just about have a shippable PR with tests (old and new) passing. |
Hi @Russell-Pollari , My use case is the following: I would like to pre-filter the data getting only the items that for example have "iot" and "business" as tags. Using the already present syntax (using-logical-operators) it could be something like this:
The same apply for &or operetor. |
@Buckler89 That's the intended use case for this feature! Supporting lists to embed metadata, and allow uses to filter based on those lists. I have a working local branch implementing this. I'll likely push a PR this week once the Chroma team merges their big SQLite refactor. |
needs to integrate fairly tightly with the need to create custom indices... |
Dear all, this issue came back in python 0.4.20. @jeffchuber
where
The error is:
|
Is this still on the roadmap? I'm trying to add a collection of "keywords" for each article I am storing and this seems like it'd be needed for that (I could also be architecturing this wrong myself...) |
Is this still a work in progress? |
What is the status on this ? |
We have several issues asking for this feature. Tracking in #3415 |
Hi,
We find ourselves having the need to save lists in the metadata (example, we are saving a slack message and want to have in the metadata all the users that are mentioned in the message)
And we want the search to be able to filter by this field to see if some value is in the list (e.g. find me all slack messages that a specific user was mentioned in)
It would be great to have support for this
Thanks!
The text was updated successfully, but these errors were encountered: