Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collections: added task to compute num of records #1853

Merged

Conversation

alejandromumo
Copy link
Member

@alejandromumo alejandromumo commented Oct 21, 2024

  • added task to compute number of records for all the collections
  • added service methods to read collections (many, all)
  • added tests

closes #1852

Copy link
Contributor

@ntarocco ntarocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with @ptamarit

Comment on lines 94 to 95
if not ids_:
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor comment: why checking this year, and not delegating it to the caller?
The caller could avoid calling the method if no ids given.
Otherwise, it feels like that we should check the values of all params, including depth.

else:
collection = collection_or_id

params.update({"collection_id": collection.id})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am puzzled by fetching the collection query inside the community service here.
It feels like a specific case, and if in the future when need to add another extra_filter there, we will copy the pattern and add an extra if/else there.
Why not fetching here the collection query, and injecting it in the search as an extra filter?
It also feels like we are fetching multiple times the collection object around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a really good point and I had the same feeling that it could go just in one place. Plus, collections are not bound to communities at its core so the search should be possible without going through the community records service.

For this change, however, we need a new endpoint inside collections to expose the collection service search method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed IRL and we were on the same page: the collection's records search is better fitted in the collection service. Therefore, I added a new commit with the required changes.

In order not to block the development, both commits of this PR are working so we can split them if needed.

return cls.query.order_by(cls.path, cls.order)

def update(
self, /, slug=None, title=None, search_query=None, order=None, num_records=None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the / param? Did you mean _?

Copy link
Member Author

@alejandromumo alejandromumo Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the method now based on your suggestions above.

Anyway, I wanted to use keyword-only parameters (*) and not positional-only. The idea was that if someone uses collection.update they need to be explicit of what they want to update, to avoid unexpected errors coming from e.g. swapping the order unintentionally or drilling objects.

Since I now refactored the update code to have a set of allowed parameters, it accomplishes the same and this was removed. The validation is now done in the schema, so it's protected already.

@@ -14,7 +14,8 @@ class CollectionSchema(Schema):

slug = fields.Str()
title = fields.Str()
depth = fields.Int()
depth = fields.Int(dump_only=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this prevent us from creating collections with desired depth via service in the future ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the depth of a collection tells you how deep it is in the tree and is a read-only field. It is computed based on the path when you create a collection. The path is the field that matters here since collections are based on the materialized path pattern.

Initially, we added depth as a field to improve the performance of read queries.

The service has two methods to create collections:

  • create: creates a new "root" collection (e.g. no parent) inside a CollectionTree. The path is always "empty" since it's a root (path=','). I added a docstring to be more explicit about it.
  • add: adds a collection as a child of another one. The path will be computed based on the parent.

num_records = fields.Int()
search_query = fields.Str(load_only=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search query will be stripped out when dumping, don't we want it visible in the UI in the future?
Not sure, what was the reason to instroduce this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's stripped out for now because it doesn't have a lot of value by itself. The "real" query is computed based on the parents' queries too, so for now it's only required to be updated.

If we need it in the UI in the future, first we must decide what we want to show (e.g. the collection individual query or the "final" query)

identity, res, self.collection_schema, None, self.links_item_tpl
)

def read_all(self, identity, depth=2):
Copy link
Contributor

@kpsherva kpsherva Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the name be search to align with the usual service method naming convention?
any difference between read_all and read_many? wouldn't it be easier to merge them and search based on presence of ids_?
Same with underlying api methods .resolve_many and resolve_all

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming it search

I don't agree, I would expect search to search in the search engine. Plus read (and read_many, read_all) are also widely used in other services.

Moreover, collections_service.search could also be ambiguous. Does it mean search collections or search records inside the collections? In my opinion the latter makes more sense, so we added collections_service.search_records to be explicit about it.


Difference between read_many and read_all

Read many is a specific query SELECT ... FROM ... WHERE id in (1,2,3) while read all means SELECT ... FROM. So I split it because:

1 - I want it to be very explicit they do different things
2- To avoid ids_=[] having any implicit meaning. e.g. would it mean fetch all the collections OR fetch 0 collections?


Underling resolve methods

These were renamed to be consistent with the service methods.

for citem in res:
try:
collection = citem._collection
res = collections_service.search_records(system_identity, collection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to take advantage of the OS count query instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether it's faster or not (I assume it should be, but to be tested), but is it supported by the other services methods? Searching for records inside a collection ultimately means using the CommunityRecords or RecordsService search.

Comment on lines 133 to 139
if isinstance(collection_or_id, int):
collection = self.collection_cls.resolve(id_=collection_or_id)
else:
collection = collection_or_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for me to understand: in what circumstances do we pass collection, not the id? I think most of the services only accept id, not the obj

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added to avoid resolving collections twice. I.e. this is useful when chaining service method calls. For the celery task implemented here, I need to read all the collections and then search each one of them.

The flow of having only an id and resolving the entity inside works well when in a request context and when all the logic is contained within one service method.

An alternative would be having a service method that does all of that (e.g. search_all_collections), but it's not as granular and reusable as I would expect from a service method.

* added task to compute number of records for all the collections
* added "collection_id" parameter to record search
* added service methods to read collections (many, all)
* added tests
* collections: refactor 'resolve' to 'read'
* collections: rename 'search_records' method
* collections: update read method signature
@alejandromumo alejandromumo force-pushed the add_collections_tasks branch 2 times, most recently from 29697f3 to 6645c1f Compare October 25, 2024 16:15
Copy link
Contributor

@jrcastro2 jrcastro2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Peer reviewed with @0einstein0 LGTU! 🚀
We just left a just a minor question

@@ -131,3 +165,47 @@ def read_logo(self, identity, slug):
if _exists:
return url_for("static", filename=logo_path)
raise LogoNotFoundError()

def read_many(self, identity, ids_, depth=2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Querstion: Unless we are missing something this service method doesn't seem to be used at all (unless there is plans to use it I would remove it)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

IMO it's useful to have it alongside read_all to make a clear distinction since the underlying queries are distinct and have performance differences.

Therefore I prefer it stay there from early on.

Copy link
Contributor

@ntarocco ntarocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with ❤️ by Nico and Carlin.

We think that you could merge, and eventually improve/fix things in a subsequent PR.

def update(self, identity, collection_or_id, data=None, uow=None):
"""Update a collection."""
if isinstance(collection_or_id, int):
collection = self.collection_cls.read(id_=collection_or_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: instead of hardcoding the collection class (line 34), we could have added it to the ServiceConfig, as normally done, so it could be more easily overridden.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue: #1865

def update(self, **kwargs):
"""Update the collection."""
if "search_query" in kwargs:
Collection.validate_query(kwargs["search_query"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using self here, even if it is a classmethod, instead of hardcoding?

Suggested change
Collection.validate_query(kwargs["search_query"])
self.validate_query(self, kwargs["search_query"])

Otherwise, it is probably better to define validate_query as a static method, given that I can't override it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created an issue: #1867

identity, res, self.collection_schema, None, self.links_item_tpl
)

def search_collection_records(self, identity, collection_or_id, params=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def search_collection_records(self, identity, collection_or_id, params=None):
def search_records(self, identity, collection_or_id, params=None):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the original name, I modified it after reviews :(

for citem in res:
try:
collection = citem._collection
res = collections_service.search_collection_records(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we only need to get the total n. of results, and that this task runs quite often for ALL collections, I would suggest using the count APIs of OpenSearch rather than a classic search, as it is more performant.
You might want to keep the search_records method in the service, and add a second one count_records.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue: #1866

@ntarocco
Copy link
Contributor

Before merging, please fix the tests.

* Added resource for collections
* Moved records search from community records to collections service
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

collections: update number of records regularly
4 participants