-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collections: added task to compute num of records #1853
collections: added task to compute num of records #1853
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed with @ptamarit
if not ids_: | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor comment: why checking this year, and not delegating it to the caller?
The caller could avoid calling the method if no ids
given.
Otherwise, it feels like that we should check the values of all params, including depth
.
else: | ||
collection = collection_or_id | ||
|
||
params.update({"collection_id": collection.id}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am puzzled by fetching the collection query inside the community service here.
It feels like a specific case, and if in the future when need to add another extra_filter
there, we will copy the pattern and add an extra if/else there.
Why not fetching here the collection query, and injecting it in the search
as an extra filter?
It also feels like we are fetching multiple times the collection object around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a really good point and I had the same feeling that it could go just in one place. Plus, collections are not bound to communities at its core so the search should be possible without going through the community records service.
For this change, however, we need a new endpoint inside collections to expose the collection service search method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed IRL and we were on the same page: the collection's records search is better fitted in the collection service. Therefore, I added a new commit with the required changes.
In order not to block the development, both commits of this PR are working so we can split them if needed.
return cls.query.order_by(cls.path, cls.order) | ||
|
||
def update( | ||
self, /, slug=None, title=None, search_query=None, order=None, num_records=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the /
param? Did you mean _
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the method now based on your suggestions above.
Anyway, I wanted to use keyword-only parameters (*
) and not positional-only
. The idea was that if someone uses collection.update
they need to be explicit of what they want to update, to avoid unexpected errors coming from e.g. swapping the order unintentionally or drilling objects.
Since I now refactored the The validation is now done in the schema, so it's protected already.update
code to have a set
of allowed parameters, it accomplishes the same and this was removed.
@@ -14,7 +14,8 @@ class CollectionSchema(Schema): | |||
|
|||
slug = fields.Str() | |||
title = fields.Str() | |||
depth = fields.Int() | |||
depth = fields.Int(dump_only=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this prevent us from creating collections with desired depth via service in the future ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the depth
of a collection tells you how deep it is in the tree and is a read-only
field. It is computed based on the path
when you create a collection. The path
is the field that matters here since collections are based on the materialized path pattern.
Initially, we added depth
as a field to improve the performance of read queries.
The service has two methods to create collections:
create
: creates a new "root" collection (e.g. no parent) inside aCollectionTree
. The path is always "empty" since it's a root (path=','
). I added a docstring to be more explicit about it.add
: adds a collection as a child of another one. The path will be computed based on the parent.
num_records = fields.Int() | ||
search_query = fields.Str(load_only=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
search query will be stripped out when dumping, don't we want it visible in the UI in the future?
Not sure, what was the reason to instroduce this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's stripped out for now because it doesn't have a lot of value by itself. The "real" query is computed based on the parents' queries too, so for now it's only required to be updated.
If we need it in the UI in the future, first we must decide what we want to show (e.g. the collection individual query or the "final" query)
identity, res, self.collection_schema, None, self.links_item_tpl | ||
) | ||
|
||
def read_all(self, identity, depth=2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the name be search
to align with the usual service method naming convention?
any difference between read_all and read_many? wouldn't it be easier to merge them and search based on presence of ids_?
Same with underlying api methods .resolve_many
and resolve_all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming it search
I don't agree, I would expect search
to search in the search engine. Plus read
(and read_many
, read_all
) are also widely used in other services.
Moreover, collections_service.search
could also be ambiguous. Does it mean search collections or search records inside the collections? In my opinion the latter makes more sense, so we added collections_service.search_records
to be explicit about it.
Difference between read_many
and read_all
Read many is a specific query SELECT ... FROM ... WHERE id in (1,2,3)
while read all means SELECT ... FROM
. So I split it because:
1 - I want it to be very explicit they do different things
2- To avoid ids_=[]
having any implicit meaning. e.g. would it mean fetch all the collections OR fetch 0 collections?
Underling resolve
methods
These were renamed to be consistent with the service methods.
for citem in res: | ||
try: | ||
collection = citem._collection | ||
res = collections_service.search_records(system_identity, collection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to take advantage of the OS count query instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether it's faster or not (I assume it should be, but to be tested), but is it supported by the other services methods? Searching for records inside a collection ultimately means using the CommunityRecords
or RecordsService
search.
if isinstance(collection_or_id, int): | ||
collection = self.collection_cls.resolve(id_=collection_or_id) | ||
else: | ||
collection = collection_or_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for me to understand: in what circumstances do we pass collection, not the id? I think most of the services only accept id, not the obj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added to avoid resolving collections twice. I.e. this is useful when chaining service method calls. For the celery task implemented here, I need to read all the collections and then search each one of them.
The flow of having only an id
and resolving the entity inside works well when in a request context and when all the logic is contained within one service method.
An alternative would be having a service method that does all of that (e.g. search_all_collections
), but it's not as granular and reusable as I would expect from a service method.
db5d814
to
0a6df3e
Compare
* added task to compute number of records for all the collections * added "collection_id" parameter to record search * added service methods to read collections (many, all) * added tests * collections: refactor 'resolve' to 'read' * collections: rename 'search_records' method * collections: update read method signature
29697f3
to
6645c1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Peer reviewed with @0einstein0 LGTU! 🚀
We just left a just a minor question
@@ -131,3 +165,47 @@ def read_logo(self, identity, slug): | |||
if _exists: | |||
return url_for("static", filename=logo_path) | |||
raise LogoNotFoundError() | |||
|
|||
def read_many(self, identity, ids_, depth=2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Querstion: Unless we are missing something this service method doesn't seem to be used at all (unless there is plans to use it I would remove it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
IMO it's useful to have it alongside read_all
to make a clear distinction since the underlying queries are distinct and have performance differences.
Therefore I prefer it stay there from early on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed with ❤️ by Nico and Carlin.
We think that you could merge, and eventually improve/fix things in a subsequent PR.
def update(self, identity, collection_or_id, data=None, uow=None): | ||
"""Update a collection.""" | ||
if isinstance(collection_or_id, int): | ||
collection = self.collection_cls.read(id_=collection_or_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: instead of hardcoding the collection class (line 34), we could have added it to the ServiceConfig, as normally done, so it could be more easily overridden.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue: #1865
def update(self, **kwargs): | ||
"""Update the collection.""" | ||
if "search_query" in kwargs: | ||
Collection.validate_query(kwargs["search_query"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using self
here, even if it is a classmethod, instead of hardcoding?
Collection.validate_query(kwargs["search_query"]) | |
self.validate_query(self, kwargs["search_query"]) |
Otherwise, it is probably better to define validate_query
as a static method, given that I can't override it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created an issue: #1867
identity, res, self.collection_schema, None, self.links_item_tpl | ||
) | ||
|
||
def search_collection_records(self, identity, collection_or_id, params=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def search_collection_records(self, identity, collection_or_id, params=None): | |
def search_records(self, identity, collection_or_id, params=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was the original name, I modified it after reviews :(
for citem in res: | ||
try: | ||
collection = citem._collection | ||
res = collections_service.search_collection_records( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we only need to get the total n. of results, and that this task runs quite often for ALL collections, I would suggest using the count
APIs of OpenSearch rather than a classic search, as it is more performant.
You might want to keep the search_records
method in the service, and add a second one count_records
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue: #1866
Before merging, please fix the tests. |
* Added resource for collections * Moved records search from community records to collections service
6645c1f
to
27eeac6
Compare
closes #1852