collections: added task to compute num of records #1853

alejandromumo · 2024-10-21T14:10:22Z

added task to compute number of records for all the collections
added service methods to read collections (many, all)
added tests

closes #1852

invenio_rdm_records/collections/service.py

invenio_rdm_records/collections/models.py

ntarocco

Reviewed with @ptamarit

ntarocco · 2024-10-24T12:15:33Z

invenio_rdm_records/collections/api.py

+        if not ids_:
+            return []


Very minor comment: why checking this year, and not delegating it to the caller?
The caller could avoid calling the method if no ids given.
Otherwise, it feels like that we should check the values of all params, including depth.

invenio_rdm_records/collections/api.py

ntarocco · 2024-10-24T12:32:43Z

invenio_rdm_records/collections/service.py

+        else:
+            collection = collection_or_id
+
+        params.update({"collection_id": collection.id})


I am puzzled by fetching the collection query inside the community service here.
It feels like a specific case, and if in the future when need to add another extra_filter there, we will copy the pattern and add an extra if/else there.
Why not fetching here the collection query, and injecting it in the search as an extra filter?
It also feels like we are fetching multiple times the collection object around.

I think this is a really good point and I had the same feeling that it could go just in one place. Plus, collections are not bound to communities at its core so the search should be possible without going through the community records service.

For this change, however, we need a new endpoint inside collections to expose the collection service search method

We discussed IRL and we were on the same page: the collection's records search is better fitted in the collection service. Therefore, I added a new commit with the required changes.

In order not to block the development, both commits of this PR are working so we can split them if needed.

invenio_rdm_records/collections/service.py

ntarocco · 2024-10-24T12:37:23Z

invenio_rdm_records/collections/models.py

+        return cls.query.order_by(cls.path, cls.order)
+
+    def update(
+        self, /, slug=None, title=None, search_query=None, order=None, num_records=None


What is the / param? Did you mean _?

I modified the method now based on your suggestions above.

Anyway, I wanted to use keyword-only parameters (*) and not positional-only. The idea was that if someone uses collection.update they need to be explicit of what they want to update, to avoid unexpected errors coming from e.g. swapping the order unintentionally or drilling objects.

~~Since I now refactored the update code to have a set of allowed parameters, it accomplishes the same and this was removed.~~ The validation is now done in the schema, so it's protected already.

invenio_rdm_records/collections/models.py

invenio_rdm_records/collections/service.py

invenio_rdm_records/collections/models.py

kpsherva · 2024-10-24T13:23:36Z

invenio_rdm_records/collections/schema.py

@@ -14,7 +14,8 @@ class CollectionSchema(Schema):

    slug = fields.Str()
    title = fields.Str()
-    depth = fields.Int()
+    depth = fields.Int(dump_only=True)


won't this prevent us from creating collections with desired depth via service in the future ?

No, the depth of a collection tells you how deep it is in the tree and is a read-only field. It is computed based on the path when you create a collection. The path is the field that matters here since collections are based on the materialized path pattern.

Initially, we added depth as a field to improve the performance of read queries.

The service has two methods to create collections:

create: creates a new "root" collection (e.g. no parent) inside a CollectionTree. The path is always "empty" since it's a root (path=','). I added a docstring to be more explicit about it.

add: adds a collection as a child of another one. The path will be computed based on the parent.

kpsherva · 2024-10-24T13:24:33Z

invenio_rdm_records/collections/schema.py

    num_records = fields.Int()
+    search_query = fields.Str(load_only=True)


search query will be stripped out when dumping, don't we want it visible in the UI in the future?
Not sure, what was the reason to instroduce this change?

Yes, it's stripped out for now because it doesn't have a lot of value by itself. The "real" query is computed based on the parents' queries too, so for now it's only required to be updated.

If we need it in the UI in the future, first we must decide what we want to show (e.g. the collection individual query or the "final" query)

kpsherva · 2024-10-24T13:26:41Z

invenio_rdm_records/collections/service.py

+            identity, res, self.collection_schema, None, self.links_item_tpl
+        )
+
+    def read_all(self, identity, depth=2):


shouldn't the name be search to align with the usual service method naming convention?
any difference between read_all and read_many? wouldn't it be easier to merge them and search based on presence of ids_?
Same with underlying api methods .resolve_many and resolve_all

Naming it search

I don't agree, I would expect search to search in the search engine. Plus read (and read_many, read_all) are also widely used in other services.

Moreover, collections_service.search could also be ambiguous. Does it mean search collections or search records inside the collections? In my opinion the latter makes more sense, so we added collections_service.search_records to be explicit about it.

Difference between read_many and read_all

Read many is a specific query SELECT ... FROM ... WHERE id in (1,2,3) while read all means SELECT ... FROM. So I split it because:

1 - I want it to be very explicit they do different things
2- To avoid ids_=[] having any implicit meaning. e.g. would it mean fetch all the collections OR fetch 0 collections?

Underling resolve methods

These were renamed to be consistent with the service methods.

invenio_rdm_records/collections/service.py

kpsherva · 2024-10-24T13:30:20Z

invenio_rdm_records/collections/tasks.py

+    for citem in res:
+        try:
+            collection = citem._collection
+            res = collections_service.search_records(system_identity, collection)


does it make sense to take advantage of the OS count query instead?

I am not sure whether it's faster or not (I assume it should be, but to be tested), but is it supported by the other services methods? Searching for records inside a collection ultimately means using the CommunityRecords or RecordsService search.

kpsherva · 2024-10-24T13:32:12Z

invenio_rdm_records/collections/service.py

+        if isinstance(collection_or_id, int):
+            collection = self.collection_cls.resolve(id_=collection_or_id)
+        else:
+            collection = collection_or_id


just for me to understand: in what circumstances do we pass collection, not the id? I think most of the services only accept id, not the obj

This was added to avoid resolving collections twice. I.e. this is useful when chaining service method calls. For the celery task implemented here, I need to read all the collections and then search each one of them.

The flow of having only an id and resolving the entity inside works well when in a request context and when all the logic is contained within one service method.

An alternative would be having a service method that does all of that (e.g. search_all_collections), but it's not as granular and reusable as I would expect from a service method.

* added task to compute number of records for all the collections * added "collection_id" parameter to record search * added service methods to read collections (many, all) * added tests * collections: refactor 'resolve' to 'read' * collections: rename 'search_records' method * collections: update read method signature

jrcastro2

Peer reviewed with @0einstein0 LGTU! 🚀
We just left a just a minor question

jrcastro2 · 2024-10-29T13:53:55Z

invenio_rdm_records/collections/services/service.py

@@ -131,3 +165,47 @@ def read_logo(self, identity, slug):
        if _exists:
            return url_for("static", filename=logo_path)
        raise LogoNotFoundError()
+
+    def read_many(self, identity, ids_, depth=2):


Querstion: Unless we are missing something this service method doesn't seem to be used at all (unless there is plans to use it I would remove it)

Good point.

IMO it's useful to have it alongside read_all to make a clear distinction since the underlying queries are distinct and have performance differences.

Therefore I prefer it stay there from early on.

ntarocco

Reviewed with ❤️ by Nico and Carlin.

We think that you could merge, and eventually improve/fix things in a subsequent PR.

ntarocco · 2024-10-30T15:51:22Z

invenio_rdm_records/collections/services/service.py

+    def update(self, identity, collection_or_id, data=None, uow=None):
+        """Update a collection."""
+        if isinstance(collection_or_id, int):
+            collection = self.collection_cls.read(id_=collection_or_id)


Minor: instead of hardcoding the collection class (line 34), we could have added it to the ServiceConfig, as normally done, so it could be more easily overridden.

Created an issue: #1865

ntarocco · 2024-10-30T15:52:50Z

invenio_rdm_records/collections/api.py

+    def update(self, **kwargs):
+        """Update the collection."""
+        if "search_query" in kwargs:
+            Collection.validate_query(kwargs["search_query"])


What about using self here, even if it is a classmethod, instead of hardcoding?

Suggested change

Collection.validate_query(kwargs["search_query"])

self.validate_query(self, kwargs["search_query"])

Otherwise, it is probably better to define validate_query as a static method, given that I can't override it.

created an issue: #1867

ntarocco · 2024-10-30T16:00:17Z

invenio_rdm_records/collections/services/service.py

+            identity, res, self.collection_schema, None, self.links_item_tpl
+        )
+
+    def search_collection_records(self, identity, collection_or_id, params=None):


Suggested change

def search_collection_records(self, identity, collection_or_id, params=None):

def search_records(self, identity, collection_or_id, params=None):

That was the original name, I modified it after reviews :(

ntarocco · 2024-10-30T16:03:25Z

invenio_rdm_records/collections/tasks.py

+    for citem in res:
+        try:
+            collection = citem._collection
+            res = collections_service.search_collection_records(


Given that we only need to get the total n. of results, and that this task runs quite often for ALL collections, I would suggest using the count APIs of OpenSearch rather than a classic search, as it is more performant.
You might want to keep the search_records method in the service, and add a second one count_records.

Created an issue: #1866

ntarocco · 2024-10-30T16:05:30Z

Before merging, please fix the tests.

* Added resource for collections * Moved records search from community records to collections service

alejandromumo commented Oct 21, 2024

View reviewed changes

invenio_rdm_records/collections/service.py Outdated Show resolved Hide resolved

alejandromumo commented Oct 21, 2024

View reviewed changes

invenio_rdm_records/collections/models.py Outdated Show resolved Hide resolved

ntarocco reviewed Oct 24, 2024

View reviewed changes

kpsherva reviewed Oct 24, 2024

View reviewed changes

invenio_rdm_records/collections/service.py Outdated Show resolved Hide resolved

kpsherva reviewed Oct 24, 2024

View reviewed changes

alejandromumo force-pushed the add_collections_tasks branch from db5d814 to 0a6df3e Compare October 25, 2024 11:04

alejandromumo force-pushed the add_collections_tasks branch 2 times, most recently from 29697f3 to 6645c1f Compare October 25, 2024 16:15

jrcastro2 approved these changes Oct 29, 2024

View reviewed changes

ntarocco approved these changes Oct 30, 2024

View reviewed changes

collections: move records search into service

27eeac6

* Added resource for collections * Moved records search from community records to collections service

alejandromumo force-pushed the add_collections_tasks branch from 6645c1f to 27eeac6 Compare October 31, 2024 09:15

This was referenced Oct 31, 2024

collections: make collection class configurable #1865

Open

collections: compute number of records using "count" API instead of "search" #1866

Open

collections: validate query passing the instance instead of hardcoding the class #1867

Open

alejandromumo merged commit d2ccb60 into inveniosoftware:master Oct 31, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collections: added task to compute num of records #1853

collections: added task to compute num of records #1853

alejandromumo commented Oct 21, 2024 •

edited

Loading

ntarocco left a comment

ntarocco Oct 24, 2024

ntarocco Oct 24, 2024

alejandromumo Oct 25, 2024

alejandromumo Oct 25, 2024

ntarocco Oct 24, 2024

alejandromumo Oct 25, 2024 •

edited

Loading

kpsherva Oct 24, 2024

alejandromumo Oct 25, 2024

kpsherva Oct 24, 2024

alejandromumo Oct 25, 2024

kpsherva Oct 24, 2024 •

edited

Loading

alejandromumo Oct 25, 2024

kpsherva Oct 24, 2024

alejandromumo Oct 25, 2024

kpsherva Oct 24, 2024

alejandromumo Oct 25, 2024

jrcastro2 left a comment

jrcastro2 Oct 29, 2024

alejandromumo Oct 31, 2024

ntarocco left a comment

ntarocco Oct 30, 2024

alejandromumo Oct 31, 2024

ntarocco Oct 30, 2024

alejandromumo Oct 31, 2024

ntarocco Oct 30, 2024

alejandromumo Oct 31, 2024

ntarocco Oct 30, 2024

alejandromumo Oct 31, 2024

ntarocco commented Oct 30, 2024

		num_records = fields.Int()
		search_query = fields.Str(load_only=True)

	Collection.validate_query(kwargs["search_query"])
	self.validate_query(self, kwargs["search_query"])

	def search_collection_records(self, identity, collection_or_id, params=None):
	def search_records(self, identity, collection_or_id, params=None):

collections: added task to compute num of records #1853

collections: added task to compute num of records #1853

Conversation

alejandromumo commented Oct 21, 2024 • edited Loading

ntarocco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alejandromumo Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kpsherva Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrcastro2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ntarocco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ntarocco commented Oct 30, 2024

alejandromumo commented Oct 21, 2024 •

edited

Loading

alejandromumo Oct 25, 2024 •

edited

Loading

kpsherva Oct 24, 2024 •

edited

Loading