Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Embedding function input validation #3566

Open
tazarov opened this issue Jan 25, 2025 · 0 comments · May be fixed by #3567
Open

[Bug]: Embedding function input validation #3566

tazarov opened this issue Jan 25, 2025 · 0 comments · May be fixed by #3567
Labels
bug Something isn't working by-chroma

Comments

@tazarov
Copy link
Contributor

tazarov commented Jan 25, 2025

What happened?

Many of our embedding function input validations are not strong enough causing misleading and confusing results.

Consider the following:

from chromadb.utils.embedding_functions.onnx_mini_lm_l6_v2 import ONNXMiniLM_L6_V2
import numpy as np
ef = ONNXMiniLM_L6_V2()


embeddings = ef("hello world")

assert isinstance(embeddings, list)
assert all(isinstance(e, np.ndarray) for e in embeddings)
embeddings1 = ef(["hello world"])
assert isinstance(embeddings1, list)
assert all(isinstance(e, np.ndarray) for e in embeddings1)
tolerance = 1e-5
assert np.allclose(embeddings[0], embeddings1[0], atol=tolerance)

In the above all, but the last assertion will pass:

AssertionError                            Traceback (most recent call last)
Cell In[1], line 14
     12 assert all(isinstance(e, np.ndarray) for e in embeddings2)
     13 tolerance = 1e-5
---> 14 assert np.allclose(embeddings1[0], embeddings2[0], atol=tolerance)

AssertionError:

Wha the example demonstrates is that if the EF is provided a string or a list of strings, it will happily accept the input and generate seemingly valid embeddings.

This behavior is not limited to our default EF. Here's a list that I tested so far:

  • OpenAIEmbeddingFunction
  • SentenceTransformerEmbeddingFunction
  • GoogleGenerativeAiEmbeddingFunction

Contrary to the above:

  • Ollama - works fine with single string and list of strings as this is supported by the API
  • HuggingFace inference API - same as ollama (although this is not in the official docs, experiments show that API supports both types of inputs)
  • Cohere - properly validates the input and fails with UnprocessableEntityError: status_code: 422, body: {'message': 'invalid type: parameter texts is of type string but should be of type []Object. For proper usage, please refer to https://docs.cohere.com/v1/reference/embed'}

Versions

Chroma 0.4.x-0.6.x

Relevant log output

@tazarov tazarov added the bug Something isn't working label Jan 25, 2025
@tazarov tazarov linked a pull request Jan 25, 2025 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working by-chroma
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant