Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customized PGVector Component does not respect ids as unique #5612

Open
drdrewAQ opened this issue Jan 9, 2025 · 0 comments
Open

Customized PGVector Component does not respect ids as unique #5612

drdrewAQ opened this issue Jan 9, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@drdrewAQ
Copy link

drdrewAQ commented Jan 9, 2025

Bug Description

I expect that ids passed to PGVector.from_documents will result in unique db entries.

Reproduction

  1. Ingest some document, pipe it to text splitter.
  2. Customize text splitter to include an ID (in my sample code, we're tracking neighboring text chunks).
  3. Connect split-text Data to PGVector vectorstore (with any embedding model) and customize from_documents call to include ids.
  4. Run pipeline twice.
  5. Check postgres for duplicates. select count(*) from langchain_pg_embedding where custom_id = '<some-id>';

Simplified modification to "Split Text" Component

    def split_text(self) -> list[Data]:
        separator = unescape_string(self.separator)

        documents = [_input.to_lc_document() for _input in self.data_inputs if isinstance(_input, Data)]

        splitter = CharacterTextSplitter(
            chunk_overlap=self.chunk_overlap,
            chunk_size=self.chunk_size,
            separator=separator,
        )
        docs = splitter.split_documents(documents)
        for i, doc in enumerate(docs):
            doc.metadata["id"] = i
            doc.metadata["prev_chunk"]: i - 1 if i > 0 else None,
            doc.metadata["next_chunk"]: i + 1 if i < len(docs) - 1 else None,

        data = self._docs_to_data(docs)
        self.status = data
        return data

Modification to PGVector Component:

            pgvector = PGVector.from_documents(
                embedding=self.embedding,
                documents=documents,
                collection_name=self.collection_name,
                connection_string=connection_string_parsed,
                ids=[doc.metadata['id'] for doc in documents]
            )

Expected behavior

Following the above steps, select count(*) from langchain_pg_embedding where custom_id = '<some-id>'; should return 1.

I do not know if this is a byproduct of using the (deprecated) langchain_community.vectorstores version of PGVector, but I do not have the option to import (the recommended) PGVector from langchain_postgresin LangFlow. documentation indicating uniqueness of provided id

Who can help?

No response

Operating System

Docker

Langflow Version

1.1.1

Python Version

None

Screenshot

No response

Flow File

No response

@drdrewAQ drdrewAQ added the bug Something isn't working label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant