Skip to content

Refactor/optimize embedding module#207

Open
mehmetcanay wants to merge 8 commits into
mainfrom
refactor/optimize-embedding-module
Open

Refactor/optimize embedding module#207
mehmetcanay wants to merge 8 commits into
mainfrom
refactor/optimize-embedding-module

Conversation

@mehmetcanay
Copy link
Copy Markdown
Member

No description provided.

@mehmetcanay mehmetcanay self-assigned this Mar 30, 2026
@mehmetcanay mehmetcanay added the refactor Improve existing code structure without changing behavior. label Mar 30, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 93.10345% with 6 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
datastew/embedding/base.py 84.61% 6 Missing ⚠️
Files with missing lines Coverage Δ
datastew/embedding/hugging_face.py 87.50% <100.00%> (+2.88%) ⬆️
datastew/embedding/ollama.py 76.92% <100.00%> (+76.92%) ⬆️
datastew/embedding/openai.py 88.88% <100.00%> (+88.88%) ⬆️
datastew/embedding/vectorizer.py 100.00% <100.00%> (+33.33%) ⬆️
datastew/embedding/base.py 90.90% <84.61%> (-0.76%) ⬇️

@mehmetcanay mehmetcanay requested a review from tiadams March 31, 2026 08:35
Comment thread datastew/embedding/base.py Outdated
return cached

embedding = self._generate_embedding(text)
self._add_to_cache(text, embedding)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if self.cache?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somehow that lead to a bug that's why I changed it into if self._cache is not None.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block still triggers though if param cache=False

If the cache is not being utilized, I don't see any point in filling it

self._cache = None
self._cache_lock = None

@abstractmethod
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why drop abstract?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The adapters only differ in the way they call and interact with their API. Previously the common things like converting the output into lists were also covered by the child adapters.
I implemented get_embedding function in the base file and covered all the common expects of adapters. This new method calls a protected abstractmethod called _generate_embedding which implemented in each child adapter according to their API. Thus, the adapters only need to implement a very small function.

pass

def add_to_cache(self, text: str, embedding: Sequence[float]):
@abstractmethod
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having an abstract "private" method here is an anti-pattern - abstract methods are meant to be visible and to be implemented

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a protected method though, as it only has one prefix underscore.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, abstract methods should be public in general by design, meant to signal that this needs implementation

Acts as a unified interface for local Hugging Face models, OpenAI API models, and locally hosted Ollama models.
"""

_MODEL_REGISTRY = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not restrict this to these models. I think a better way would be to check if the model exists (hf api) and throw an exception on initialization if it does not

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense as well, I just wanted to type them so the user will have autocomplete on their IDE.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we cannot check from HF API because not all models there are using sentence-transformers and current adapter sometimes have issues with other models.

"""Reset global caches and initialize a fresh DummyAdapter before each test."""
_GLOBAL_CACHES.clear()
_GLOBAL_LOCKS.clear()
self.adapter = DummyAdapter(model_name="dummy-model", cache=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not cover cases where cache=False if set up here like this

"""Reset the class-level model cache and initialize a mocked adapter."""
HuggingFaceAdapter._model_cache.clear()
self.mock_model_instance = mock_transformer.return_value
self.adapter = HuggingFaceAdapter(cache=False)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not cover cache=True

def setUp(self, mock_client):
"""Initialize the adapter with a mocked Ollama client."""
self.mock_client_instance = mock_client.return_value
self.adapter = OllamaAdapter(cache=False)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not cover cache=True

def setUp(self, mock_openai):
"""Initialize the adapter with a mocked OpenAI client."""
self.mock_client = mock_openai.return_value
self.adapter = GPT4Adapter(api_key="test-key", cache=False)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not cover cache=True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactor Improve existing code structure without changing behavior.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants