Adding lists to the metadata #227

Everminds · 2023-03-23T12:18:46Z

Hi,
We find ourselves having the need to save lists in the metadata (example, we are saving a slack message and want to have in the metadata all the users that are mentioned in the message)
And we want the search to be able to filter by this field to see if some value is in the list (e.g. find me all slack messages that a specific user was mentioned in)
It would be great to have support for this
Thanks!

jeffchuber · 2023-03-24T05:12:12Z

@Everminds hello! Is this information that you store outside of chroma as well? If so, I have another idea for a solution here.

mangate · 2023-03-24T05:20:30Z

We can save it outside though it would be less convenient

Everminds · 2023-04-02T09:03:51Z

@jeffchuber any updates on this one?

8rV1n · 2023-05-19T06:26:12Z

I would vote for this, it will be very useful if it supports the list directly and we won't need 3rd tool to retrieve all the vectors and compare again.

It will be helpful for scenarios like we get a doc describing a thing but with different versions, models, etc.

jeffchuber · 2023-05-19T06:36:37Z

@8rV1n you want to be able to pass an allowlist of ids to query, right?

that is underway :) #384

8rV1n · 2023-05-19T08:32:19Z

@8rV1n you want to be able to pass an allowlist of ids to query, right?

that is underway :) #384

Thanks @jeffchuber , I guess not just IDs, widening it to metadata would be great!

To clarify it:

Currently, we only support a Dict in setting up metadata values, I would expect we can also support list
With a list type of metadata value, maybe some operator like $contains, $range(AND OR could also do so) should be available for metadata.

I understand this would mean a lot of effort, but see below for how it helps:

An example scenario:
Say I have a web page, but it is rapidly updating like weekly. The ID could be just some randomly generated UUIDs but it has a label illustrating the week number.
So, if it supports the list, then we will be able to narrow down the range by filter like weeks 20-50.

Similarly, you may change the "web page" to "products" of an online shopping site, we normally filter things with many options like price, category, shipping preference, seller, etc. We want to get a similar result by the product detail(content), and we also want to filter it using things we are familiar with so that we can make it more efficient.

jeffchuber · 2023-05-19T17:09:59Z

@8rV1n chroma has this :) though we currently do a bad job communicating it

chroma/chromadb/test/test_api.py

Line 1050 in a563700

{"$and": [{"int_value": {"$eq": 3}}, {"string_value": {"$eq": "three"}}]},

look inside that test folder and you will see examples of all of these. The where filter in get will work with query as well

8rV1n · 2023-05-20T03:33:04Z

@8rV1n chroma has this :) though we currently do a bad job communicating it

chroma/chromadb/test/test_api.py

Line 1050 in a563700

{"$and": [{"int_value": {"$eq": 3}}, {"string_value": {"$eq": "three"}}]},

look inside that test folder and you will see examples of all of these. The where filter in get will work with query as well

Thanks @jeffchuber!

Any idea for using metadata like this? (adding, and querying)

collection.add(
    documents=["Alice meets rabbits...", "doc2", "doc3", ...],
    metadatas=[{"charactor_roles": ['Alice', 'rabbits']}, {"charactor_roles": ['Steve Jobs', 'Tim Cook']}, {"charactor_roles": []}, ...],
    ids=["id1", "id2", "id3", ...]
)

It seems I can do this for the metadata when creating the collection:

client.create_collection(
    "my_collection", 
    metadata={"foo": ["bar", "bar2"]}
)

pbarker · 2023-06-22T19:45:02Z

+1, this would be incredibly useful for not needing a secondary datastore to just to be able to attach lists to documents

Russell-Pollari · 2023-06-30T17:07:06Z

Happy to take a stab at this.

If I'm understanding correctly, this would mean adding List as an allowed value in Metadata

-Metadata = Mapping[str, Union[str, int, float]]
+Metadata = Mapping[str, Union[str, int, float, List[Union[str, int, float]]]]

So that lists can be added as a value in metadata:

collection.add(ids=['test'], documents=['test'], metadatas=[{ 'list': [1, 2, 3] }])

The biggest source of uphill work, I think, would be adding support for Lists to the Where filter operators

EDIT: Should we re-use existing operators and make them them work for lists?
e.g.

collection.get(where={ "list": {  "$eq": 2 } })

or create new operators for lists?
e.g.

collection.get(where={"list": { "$contains": 2 } })

jeffchuber · 2023-06-30T20:30:03Z

@Russell-Pollari yes that is correct!

Where operator support is definitely the biggest lift here.

I think $in and $notin (or the better named version of those) is probably the minimal case...

Russell-Pollari · 2023-06-30T20:59:21Z

@jeffchuber

IMO $in and $nin imply that I should supply an array to filter against. They would be useful operators for all types.

I think it would be better UX to have $eq and $ne also work with lists (effectively as $contains or $notContains when appropriate)

But I'm definitely pattern matching to MongoDB's query operators here. This is how they do it:

I managed to get working prototype for filtering arrays with $eq for duckdb:

            # Shortcut for $eq
            if type(value) == str:
                result.append(
                    f""" (
                        json_extract_string(metadata, '$.{key}') = '{value}'
                        OR
                        json_contains(json_extract(metadata, '$.{key}'), '\"{value}\"')
                    )
                    """
                )
            if type(value) == int:
                result.append(
                    f""" (
                        CASE WHEN json_type(json_extract(metadata, '$.{key}')) = 'ARRAY'
                        THEN
                        list_has(CAST(json_extract(metadata, '$.{key}') AS INT[]), {value})
                        ELSE
                        CAST(json_extract(metadata, '$.{key}') AS INT) = {value}
                        END
                    )
                    """
                )
            if type(value) == float:
                result.append(
                    f""" (
                        CASE WHEN json_type(json_extract(metadata, '$.{key}')) = 'ARRAY'
                        THEN
                        list_has(CAST(json_extract(metadata, '$.{key}') AS DOUBLE[]), {value})
                        ELSE
                        CAST(json_extract(metadata, '$.{key}') AS DOUBLE) = {value}
                        END
                    )
                    """
                )

jeffchuber · 2023-06-30T21:01:20Z

@Russell-Pollari indexing against how mongo does it is definitely a good idea!

@HammadB what do you think?

Russell-Pollari · 2023-07-03T13:31:15Z

Threw up a PR, let me know what you think!

If my solution works for y'all, happy to also update the JS client and the docs

jeffchuber · 2023-07-04T13:31:12Z

@Russell-Pollari thanks! will take a look today :)

tyatabe · 2023-07-06T09:55:33Z

Hey, I'm also interested in using this functionality, I have documents with a bunch of possible tags as metadata, for example

Document(page_content='lorem impsum ...',
metadata={
'id': '5f874c6591bc3f9a540c3722',
'title': 'hello world',
'tags': 'tag1, tag2, tag3, etc'
}
)

If I could use the $contains operator I could filter for specific tags. Right now I'm trying turning all the tags into binary values, but I think that's breaking chroma somehow

jeffchuber · 2023-07-06T15:54:54Z

but I think that's breaking chroma somehow

:( can you share more about what is breaking? this should work. are they true/false or 1/0?

tyatabe · 2023-07-07T10:43:53Z

Hey, I wasn't sure it could handle booleans or ints, so I ended up turning them into strings '0'/'1'. The error I got was from clickhouse (I'm using with a chroma server), I think it was related to the size of the query being to big, as I also have a cloud server where I got a 413 error.
I ended up looping over the documents and that solved the issue, so I'm guessing that having so many metadata fields makes the documents to big to be handled by clickhouse? (not really sure how it all works though)

jeffchuber · 2023-07-07T14:04:14Z

@tyatabe gotcha. there was a max_query_size issue people had run into with clickhouse. We are removing clickhouse now and that should fix up this sort of sharp edge.

Russell-Pollari · 2023-07-12T01:40:01Z

Exploring the new SQLite implementation.

My naive approach would look something like this, having tables for int str and float

     def _insert_metadata(self, cur: Cursor, id: int, metadata: UpdateMetadata) -> None:
         """Insert or update each metadata row for a single embedding record"""
-        t = Table("embedding_metadata")
+        t, str_list, int_list, float_list = Tables(
+            "embedding_metadata",
+            "embedding_metadata_string",
+            "embedding_metadata_int",
+            "embedding_metadata_float",
+        )
         q = (
             self._db.querybuilder()
             .into(t)
             .columns(t.id, t.key, t.string_value, t.int_value, t.float_value)
         )
         for key, value in metadata.items():
+            if isinstance(value, list):
+                if isinstance(value[0], str):
+                    for val in value:
+                        q_str = (
+                            self._db.querybuilder()
+                            .into(str_list)
+                            .columns(str_list.metadata_id, str_list.value)
+                            .insert(ParameterValue(id), ParameterValue(val))
+                        )
+                if isinstance(value[0], int):
+                    for val in value:
+                        q_int = (
+                            self._db.querybuilder()
+                            .into(int_list)
+                            .columns(int_list.metadata_id, int_list.value)
+                            .insert(ParameterValue(id), ParameterValue(val))
+                        )
+                if isinstance(value[0], float):
+                    for val in value:
+                        q_float = (
+                            self._db.querybuilder()
+                            .into(float_list)
+                            .columns(float_list.metadata_id, float_list.value)
+                            .insert(ParameterValue(id), ParameterValue(val))
+                        )
             if isinstance(value, str):
                ...
                 q = q.insert(
                     ParameterValue(id),

Does this make sense? @jeffchuber @HammadB

Russell-Pollari · 2023-07-13T15:54:43Z

Update: got a hacky prototype for list[int]. Should be straightforward to generalize to other types

(branched off of #781 for my working dir)

Migration for new table:

CREATE TABLE embedding_metadata_ints (
    id INTEGER REFERENCES embeddings(id),
    key TEXT REFERENCES embedding_metadata(key),
    int_value INTEGER NOT NULL
);

Inserting metadata with list chromadb/segment/impl/metadata/sqlite.py

    def _insert_metadata(self, cur: Cursor, id: int, metadata: UpdateMetadata) -> None:
        """Insert or update each metadata row for a single embedding record"""
        (
            t,
            int_t,
        ) = Tables(
            "embedding_metadata",
            "embedding_metadata_ints",
        )
        q = (
            self._db.querybuilder()
            .into(t)
            .columns(t.id, t.key, t.string_value, t.int_value, t.float_value)
        )
        for key, value in metadata.items():
            if isinstance(value, list):
                q = q.insert(
                    ParameterValue(id),
                    ParameterValue(key),
                    None,
                    None,
                    None,
                )
                if isinstance(value[0], int):
                    q_int = (
                        self._db.querybuilder()
                        .into(int_t)
                        .columns(int_t.id, int_t.key, int_t.int_value)
                    )
                    for val in value:
                        q_int = q_int.insert(
                            ParameterValue(id), ParameterValue(key), ParameterValue(val)
                        )
                    sql, params = get_sql(q_int)
                    sql = sql.replace("INSERT", "INSERT OR REPLACE")
                    if sql:
                        cur.execute(sql, params)

            if isinstance(value, str):
             ...

Querying for list of ints (SqliteMetadataSegment.get_metadata)

    def get_metadata
....
        embeddings_t, metadata_t, fulltext_t, int_t = Tables(
            "embeddings",
            "embedding_metadata",
            "embedding_fulltext",
            "embedding_metadata_ints",
        )

        q = (
            (
                self._db.querybuilder()
                .from_(embeddings_t)
                .left_join(metadata_t)
                .on(embeddings_t.id == metadata_t.id)
                .outer_join(int_t)
                .on((metadata_t.key == int_t.key) & (metadata_t.id == int_t.id))
            )
            .select(
                embeddings_t.id,
                embeddings_t.embedding_id,
                embeddings_t.seq_id,
                metadata_t.key,
                metadata_t.string_value,
                metadata_t.int_value,
                metadata_t.float_value,
                int_t.int_value,
            )

constructing metadata object with list of ints

    def _record(self, rows: Sequence[Tuple[Any, ...]]) -> MetadataEmbeddingRecord:
        """Given a list of DB rows with the same ID, construct a
        MetadataEmbeddingRecord"""
        _, embedding_id, seq_id = rows[0][:3]
        metadata = {}
        for row in rows:
            key, string_value, int_value, float_value, int_elem = row[3:]
            if string_value is not None:
                metadata[key] = string_value
            elif int_value is not None:
                metadata[key] = int_value
            elif float_value is not None:
                metadata[key] = float_value
            elif int_elem is not None:
                int_list = metadata.get(key, [])
                int_list.append(int_elem)
                metadata[key] = int_list

Also requires updating the relevant types/validators to allow for lists

Russell-Pollari · 2023-07-14T01:32:05Z

Converging on a solution

Initially, I created tables for each allowed list type (int, str, float). It was working but was getting messy.

Ended up using another table with the same schema as embedding_metadata, which let me reuse a lot of existing functions

CREATE TABLE embedding_metadata_lists (
    id INTEGER REFERENCES embeddings(id),
    key TEXT REFERENCES embedding_metadata(key),
    string_value TEXT,
    float_value REAL,
    int_value INTEGER
);

    @override
    def get_metadata(
        self,
        where: Optional[Where] = None,
        where_document: Optional[WhereDocument] = None,
        ids: Optional[Sequence[str]] = None,
        limit: Optional[int] = None,
        offset: Optional[int] = None,
    ) -> Sequence[MetadataEmbeddingRecord]:
        """Query for embedding metadata."""

        embeddings_t, metadata_t, fulltext_t, metadata_list_t = Tables(
            "embeddings",
            "embedding_metadata",
            "embedding_fulltext",
            "embedding_metadata_lists",
        )

        q = (
            (
                self._db.querybuilder()
                .from_(embeddings_t)
                .left_join(metadata_t)
                .on(embeddings_t.id == metadata_t.id)
                .left_outer_join(metadata_list_t)
                .on(
                    (metadata_t.key == metadata_list_t.key)
                    & (embeddings_t.id == metadata_list_t.id)
                )
            )
            .select(
                embeddings_t.id,
                embeddings_t.embedding_id,
                embeddings_t.seq_id,
                metadata_t.key,
                metadata_t.string_value,
                metadata_t.int_value,
                metadata_t.float_value,
                metadata_list_t.string_value,
                metadata_list_t.int_value,
                metadata_list_t.float_value,
            )
            ...

If this approach makes sense, can you assign this issue to me, @jeffchuber? I just about have a shippable PR with tests (old and new) passing.

Buckler89 · 2023-07-15T18:44:01Z

Hi @Russell-Pollari ,
can you explain how those changes will impac the usage of the chorma from a user point of view?

My use case is the following:
Each item in the database is tagged using the appropriate key (in my case it's "tags"). I would like to pre-filter the query results based alson on the tags. Let's say we have 3 documents:
the first has tags = [iot, business, machine]
the second has tags = [iot, business, support]
the third has tags = [iot]

I would like to pre-filter the data getting only the items that for example have "iot" and "business" as tags.

Using the already present syntax (using-logical-operators) it could be something like this:

where={
     "$and": [
         {
             "tags": {
                 $contains: "iot"
             }
         },
         {
             "tags": {
                 $contains: "business"
             }
         }
     ]
}

The same apply for &or operetor.

Russell-Pollari · 2023-07-17T12:20:08Z

@Buckler89 That's the intended use case for this feature! Supporting lists to embed metadata, and allow uses to filter based on those lists. I have a working local branch implementing this.

I'll likely push a PR this week once the Chroma team merges their big SQLite refactor.

jeffchuber · 2023-09-13T21:47:35Z

needs to integrate fairly tightly with the need to create custom indices...

PeterTF656 · 2023-12-20T07:38:42Z

Dear all, this issue came back in python 0.4.20. @jeffchuber

collection.add(
    documents=[x["metadata"]["summary"] for x in data],
    embeddings=embeds_2.embeddings,
    metadatas=[x['metadata'] for x in data],
     ids=[x['uid'] for x in data]
)

where data is a list of object, each object is like this:

{
        "uid": string,
        "field1": string,
        "field2": string[],
        "metadata": {
            "field1": string[],
            "field2": number[],
            "field4": string,
        }
    },

The error is:

ValueError                                Traceback (most recent call last)
Cell In[107], [line 1](vscode-notebook-cell:?execution_count=107&line=1)
----> [1](vscode-notebook-cell:?execution_count=107&line=1) collection.add(
      [2](vscode-notebook-cell:?execution_count=107&line=2)     documents=[x["metadata"]["summary"] for x in data],
      [3](vscode-notebook-cell:?execution_count=107&line=3)     embeddings=embeds_2.embeddings,
      [4](vscode-notebook-cell:?execution_count=107&line=4)     metadatas=[x['metadata'] for x in data],
      [5](vscode-notebook-cell:?execution_count=107&line=5)      ids=[x['uid'] for x in data]
      [6](vscode-notebook-cell:?execution_count=107&line=6) )

File [d:\dev2.0\deep-processing\.venv\Lib\site-packages\chromadb\api\models\Collection.py:146](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:146), in Collection.add(self, ids, embeddings, metadatas, documents, images, uris)
    [104](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:104) def add(
    [105](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:105)     self,
    [106](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:106)     ids: OneOrMany[ID],
   (...)
    [116](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:116)     uris: Optional[OneOrMany[URI]] = None,
    [117](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:117) ) -> None:
    [118](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:118)     """Add embeddings to the data store.
    [119](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:119)     Args:
    [120](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:120)         ids: The ids of the embeddings you wish to add
   (...)
    [136](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:136) 
    [137](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:137)     """
    [139](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:139)     (
    [140](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/models/Collection.py:140)         ids,
...
    [277](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/types.py:277)             f"Expected metadata value to be a str, int, float or bool, got {value} which is a {type(value)}"
    [278](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/types.py:278)         )
    [279](file:///D:/dev2.0/deep-processing/.venv/Lib/site-packages/chromadb/api/types.py:279) return metadata

ValueError: Expected metadata value to be a str, int, float or bool, got ['901123200'] which is a <class 'list'>

ivanol55 · 2024-02-22T09:07:48Z

Is this still on the roadmap? I'm trying to add a collection of "keywords" for each article I am storing and this seems like it'd be needed for that (I could also be architecturing this wrong myself...)

aswin024 · 2024-09-25T09:13:52Z

Is this still a work in progress?

owquresh · 2024-09-30T04:13:10Z

What is the status on this ?

itaismith · 2025-01-06T20:39:57Z

We have several issues asking for this feature. Tracking in #3415

jeffchuber added the enhancement New feature or request label Mar 29, 2023

jeffchuber added the good first issue Good for newcomers label Jun 26, 2023

Russell-Pollari mentioned this issue Jul 3, 2023

support lists in metadatas #754

Closed

Russell-Pollari mentioned this issue Jul 18, 2023

Add lists to embedding metadata #840

Closed

jeffchuber removed the good first issue Good for newcomers label Jul 20, 2023

jeffchuber added the to-discuss label Sep 5, 2023

jeffchuber mentioned this issue Sep 6, 2023

[Feature Request]: Allow list or dict in validate_metadata #809

Closed

jeffchuber added the needs-cip label Sep 13, 2023

PeterTF656 mentioned this issue Dec 20, 2023

ValueError: Expected metadata value to be a str, int, float or bool, got ["somestring"] which is a <class 'list'> #1552

Closed

epinzur mentioned this issue Dec 16, 2024

[Feature Request]: $contains for metadata or allow type list for metadata and have $in for lists as filter option #3153

Closed

itaismith mentioned this issue Jan 6, 2025

[Feature Request]: Support lists in metadata #3415

Open

itaismith closed this as completed Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding lists to the metadata #227

Adding lists to the metadata #227

Everminds commented Mar 23, 2023

jeffchuber commented Mar 24, 2023

mangate commented Mar 24, 2023

Everminds commented Apr 2, 2023

8rV1n commented May 19, 2023

jeffchuber commented May 19, 2023

8rV1n commented May 19, 2023

jeffchuber commented May 19, 2023

8rV1n commented May 20, 2023 •

edited

Loading

pbarker commented Jun 22, 2023

Russell-Pollari commented Jun 30, 2023 •

edited

Loading

jeffchuber commented Jun 30, 2023

Russell-Pollari commented Jun 30, 2023 •

edited

Loading

jeffchuber commented Jun 30, 2023

Russell-Pollari commented Jul 3, 2023

jeffchuber commented Jul 4, 2023

tyatabe commented Jul 6, 2023

jeffchuber commented Jul 6, 2023

tyatabe commented Jul 7, 2023

jeffchuber commented Jul 7, 2023

Russell-Pollari commented Jul 12, 2023

Russell-Pollari commented Jul 13, 2023 •

edited

Loading

Russell-Pollari commented Jul 14, 2023

Buckler89 commented Jul 15, 2023

Russell-Pollari commented Jul 17, 2023

jeffchuber commented Sep 13, 2023

PeterTF656 commented Dec 20, 2023 •

edited

Loading

ivanol55 commented Feb 22, 2024

aswin024 commented Sep 25, 2024

owquresh commented Sep 30, 2024

itaismith commented Jan 6, 2025

Adding lists to the metadata #227

Adding lists to the metadata #227

Comments

Everminds commented Mar 23, 2023

jeffchuber commented Mar 24, 2023

mangate commented Mar 24, 2023

Everminds commented Apr 2, 2023

8rV1n commented May 19, 2023

jeffchuber commented May 19, 2023

8rV1n commented May 19, 2023

jeffchuber commented May 19, 2023

8rV1n commented May 20, 2023 • edited Loading

pbarker commented Jun 22, 2023

Russell-Pollari commented Jun 30, 2023 • edited Loading

jeffchuber commented Jun 30, 2023

Russell-Pollari commented Jun 30, 2023 • edited Loading

jeffchuber commented Jun 30, 2023

Russell-Pollari commented Jul 3, 2023

jeffchuber commented Jul 4, 2023

tyatabe commented Jul 6, 2023

jeffchuber commented Jul 6, 2023

tyatabe commented Jul 7, 2023

jeffchuber commented Jul 7, 2023

Russell-Pollari commented Jul 12, 2023

Russell-Pollari commented Jul 13, 2023 • edited Loading

Russell-Pollari commented Jul 14, 2023

Buckler89 commented Jul 15, 2023

Russell-Pollari commented Jul 17, 2023

jeffchuber commented Sep 13, 2023

PeterTF656 commented Dec 20, 2023 • edited Loading

ivanol55 commented Feb 22, 2024

aswin024 commented Sep 25, 2024

owquresh commented Sep 30, 2024

itaismith commented Jan 6, 2025

8rV1n commented May 20, 2023 •

edited

Loading

Russell-Pollari commented Jun 30, 2023 •

edited

Loading

Russell-Pollari commented Jun 30, 2023 •

edited

Loading

Russell-Pollari commented Jul 13, 2023 •

edited

Loading

PeterTF656 commented Dec 20, 2023 •

edited

Loading