Refactor/optimize embedding module by mehmetcanay · Pull Request #207 · SCAI-BIO/datastew

mehmetcanay · 2026-03-30T20:02:51Z

No description provided.

…d pattern

… registry

… isolated clients

codecov-commenter · 2026-03-30T20:06:36Z

Codecov Report

❌ Patch coverage is 93.10345% with 6 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
datastew/embedding/base.py	84.61%	6 Missing ⚠️

Files with missing lines	Coverage Δ
datastew/embedding/hugging_face.py	`87.50% <100.00%> (+2.88%)`	⬆️
datastew/embedding/ollama.py	`76.92% <100.00%> (+76.92%)`	⬆️
datastew/embedding/openai.py	`88.88% <100.00%> (+88.88%)`	⬆️
datastew/embedding/vectorizer.py	`100.00% <100.00%> (+33.33%)`	⬆️
datastew/embedding/base.py	`90.90% <84.61%> (-0.76%)`	⬇️

tiadams · 2026-04-01T09:08:56Z

+                return cached
+
+        embedding = self._generate_embedding(text)
+        self._add_to_cache(text, embedding)


if self.cache?

somehow that lead to a bug that's why I changed it into if self._cache is not None.

This block still triggers though if param cache=False

If the cache is not being utilized, I don't see any point in filling it

tiadams · 2026-04-01T09:09:27Z

            self._cache = None
            self._cache_lock = None

-    @abstractmethod


why drop abstract?

The adapters only differ in the way they call and interact with their API. Previously the common things like converting the output into lists were also covered by the child adapters.
I implemented get_embedding function in the base file and covered all the common expects of adapters. This new method calls a protected abstractmethod called _generate_embedding which implemented in each child adapter according to their API. Thus, the adapters only need to implement a very small function.

tiadams · 2026-04-01T09:11:13Z

        pass

-    def add_to_cache(self, text: str, embedding: Sequence[float]):
+    @abstractmethod


having an abstract "private" method here is an anti-pattern - abstract methods are meant to be visible and to be implemented

This is a protected method though, as it only has one prefix underscore.

Still, abstract methods should be public in general by design, meant to signal that this needs implementation

tiadams · 2026-04-01T09:13:47Z

+    Acts as a unified interface for local Hugging Face models, OpenAI API models, and locally hosted Ollama models.
+    """
+
+    _MODEL_REGISTRY = {


I would not restrict this to these models. I think a better way would be to check if the model exists (hf api) and throw an exception on initialization if it does not

That make sense as well, I just wanted to type them so the user will have autocomplete on their IDE.

Also we cannot check from HF API because not all models there are using sentence-transformers and current adapter sometimes have issues with other models.

tiadams · 2026-04-01T11:32:48Z

+        """Reset global caches and initialize a fresh DummyAdapter before each test."""
+        _GLOBAL_CACHES.clear()
+        _GLOBAL_LOCKS.clear()
+        self.adapter = DummyAdapter(model_name="dummy-model", cache=True)


Does not cover cases where cache=False if set up here like this

tiadams · 2026-04-01T11:33:24Z

+        """Reset the class-level model cache and initialize a mocked adapter."""
+        HuggingFaceAdapter._model_cache.clear()
+        self.mock_model_instance = mock_transformer.return_value
+        self.adapter = HuggingFaceAdapter(cache=False)


Does not cover cache=True

tiadams · 2026-04-01T11:33:37Z

+    def setUp(self, mock_client):
+        """Initialize the adapter with a mocked Ollama client."""
+        self.mock_client_instance = mock_client.return_value
+        self.adapter = OllamaAdapter(cache=False)


Does not cover cache=True

tiadams · 2026-04-01T11:33:53Z

+    def setUp(self, mock_openai):
+        """Initialize the adapter with a mocked OpenAI client."""
+        self.mock_client = mock_openai.return_value
+        self.adapter = GPT4Adapter(api_key="test-key", cache=False)


Does not cover cache=True

mehmetcanay added 7 commits March 30, 2026 21:45

refactor(embedding): encapsulate cache logic and apply template metho…

2a6959f

…d pattern

refactor(embedding): replace factory if-else ladder with static model…

f8e5df5

… registry

perf(embedding): limit local model memory footprint with LRU cache

c7bb787

fix(embedding): isolate OpenAI client and implement API batch chunking

af67867

feat(embedding): implement batch chunking for local Ollama requests

1df4bef

test(embedding): add coverage for registry factory, API chunking, and…

1161119

… isolated clients

refactor(test): move embedding tests to a dedicated sub-package

efb9d47

mehmetcanay self-assigned this Mar 30, 2026

mehmetcanay added the refactor Improve existing code structure without changing behavior. label Mar 30, 2026

mehmetcanay requested a review from tiadams March 31, 2026 08:35

tiadams reviewed Apr 1, 2026

View reviewed changes

refactor: add strict type and value validation

fe4959f

tiadams reviewed Apr 1, 2026

View reviewed changes

Conversation

mehmetcanay commented Mar 30, 2026

Uh oh!

codecov-commenter commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Mar 30, 2026 •

edited

Loading