Skip to content

feat: allow aggregates inside spread one-to-many and many-to-many embedded resources #4002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

laurenceisla
Copy link
Member

@laurenceisla laurenceisla commented Apr 9, 2025

WIP: Completes #3041 and continues what was discussed in this comment chain #3640 (comment)

There I commented that:

I'm not yet convinced for or against the group_by proposal. I think adding explicit operations helps in understanding what is "expected" or not. But I still don't want to lose the simplicity of not needing a separate group_by when aggregating.

After trying a tentative implementation and checking many cases, using group_by gets quite tedious to maintain in the code or create the queries for the users. Mostly when we have many nested relationships (pending examples).

The proposal in this PR is to kinda mix both concepts, that is having an expected result even if they're "useless" but avoiding to use a separate group_by. To achieve this we would need to separate the concept of "spreading" and "hoisting". Doing a "spread" means we're removing the surrounding object (or array in to-many relationships) but we won't be hoisting the aggregate to the top. In other words, what we get before using the spread is the same as what we get after, but without the surrounding object/array.

That's why we would need a separate operator to hoist the aggregates (and the fields to group by) to the top, something like ^. For example, in the following request, the cost.sum() won't be hoisted automatically, but it will remove the surrounding process_costs:

curl 'localhost:3000/processes?select=category_id,...process_costs(cost.sum())'
[{"category_id":1,"sum":150.00},
 {"category_id":2,"sum":200.00},
 {"category_id":1,"sum":180.00},
 {"category_id":1,"sum":70.00},
 {"category_id":2,"sum":40.00},
 {"category_id":2,"sum":70.00},
 {"category_id":2,"sum":null},
 {"category_id":2,"sum":40.00}]

We would now need to use the hoist operator to move the aggregate to the top:

curl 'localhost:3000/processes?select=category_id,^process_costs(cost.sum())'
[
  {"category_id":2,"sum":350.00},
  {"category_id":1,"sum":400.00}
]

With this we would always expect that the GROUP BY will always be done inside the relationship that contains it (even if it useless), and the only way of changing this is to explicitly hoist it to the top. There is also potential in allowing to hoist the aggregates in to-many spreads.

Also when trying different queries I keep seeing/expecting the to-many aggregates as single responses (what was implemented before this convo), so I'm not against that either.

I still vote for this. The "single object" result is what made me doubt in using ... to hoist or doing an automatic hoisting altogether. I think there is some confusion at simple glance if we did something like:

curl 'localhost:3000/factories?...processes(...process_costs(cost.sum()))

Would we want the aggregate cost.sum() to go to the top (i.e. for all the factories?), or keep it at the level of processes and just remove the processes array wrapper. Using the hoist operator would clarify what we'd want to get.

breaking:
* Spreading using `...` no longer hoists the aggregates to the top relationships.
* The new operator `^` is now used to hoist the aggregate to the top, e.g. `/table?select=^rel(cost.sum())`
@laurenceisla laurenceisla changed the title feat: allow spreading one-to-many and many-to-many embedded resources feat: allow aggregates inside spread one-to-many and many-to-many embedded resources Apr 9, 2025
@wolfgangwalther
Copy link
Member

To achieve this we would need to separate the concept of "spreading" and "hoisting".

While this reads nice at first, I wonder: Is there any real use-case that benefits from it? Aka is there any scenario, where both "use spreading only" and "use spreading + hoisting" give sensible results?

@laurenceisla
Copy link
Member Author

laurenceisla commented Apr 12, 2025

Is there any real use-case that benefits from it?

Mostly the case of to-many spreading/hoisting that I mentioned at the end. It's mostly to avoid confusion when using any of those operations.

is there any scenario, where both "use spreading only" and "use spreading + hoisting" give sensible results?

In that example, if we want to hoist the sum() to the top (to factories) we'd use:

curl 'localhost:3000/factories?^processes(^process_costs(cost.sum()))
[{"sum": 1000}]

If we want to keep the sum at the processes level, but if we wanted to remove the surrounding object, we'd do:

curl 'localhost:3000/factories?...processes(^process_costs(cost.sum()))
[
  {"sum": 200},
  {"sum": 100},
  "..."
]

In this case the sensible results that we were implementing (e.g. hoisting the aggregate to the top on to-one spreads) won't happen anymore, they'd need to be explicitly stated by the user by using the specific operator (e.g. using ^ instead of those to-one spreads). We'd sacrifice those sensible queries for a more rigorous way to get the data in order to avoid possible confusion.

I may be misunderstanding the question though, lmk.

Edit: My example needed a hoist in process_costs.

@wolfgangwalther
Copy link
Member

I may be misunderstanding the question though, lmk.

I guess my question is, whether the second example, without hoisting, is "sensible". I.e. would anybody use this?

I see the syntax from your example and the expected outcome - but is it something "real"? Do the returned values make sense, would somebody actually use this? Certainly not in the contrived example with only a single "sum" column, but multiple rows of it, because it lacks context.

I'm skeptical that, in the absence of explicit GROUP BY anywhere, the resulting numbers can actually be made sense of. I'm expecting that, when you return those multiple aggregation results, you will always end up returning some additional columns as an identifier of each of the rows, so that you can tell what the aggregation means. But once you have that identifier... the result with hoisting will be the same, because the identifier will be part of the GROUP BY.

So... I don't see it, yet. Imho this adds another way to shoot yourself in the foot with our query syntax, with no sensible output.

@laurenceisla
Copy link
Member Author

I guess my question is, whether the second example, without hoisting, is "sensible". I.e. would anybody use this?

It would be sensible, for to-many relationships at least. My example was wrong though, the spread was meant to be used in the processes table and the hoist the cost.sum() to that embedding (hence why it's not sensible for to-one rels).

But once you have that identifier... the result with hoisting will be the same, because the identifier will be part of the GROUP BY.

In the case of to-many relationships they wouldn't be the same. I think a better word for "hoisting" would be "flattening" since we're selecting the columns in the top resource before grouping them. For example, say we want to get the sum of processes costs grouped by category_id for each factory, and then return it in a "spread" format.

curl 'localhost:3000/factories?...processes(category_id,^process_costs(cost.sum()))
[
  {"category_id": [1, 2], "sum": [200, 100]},   // Factory with id: 1
  {"category_id": [1, 2], "sum": [300, 200]},   // Factory with id: 2
  ...
]

No flattening done here, we return the response in arrays for each factory (my original example doesn't show it in arrays because we return a single sum object without column to group by).

But once we hoist it all the way to the top factories table, then we flatten it and GROUP BY every selected column in the way. Our example now would be like:

curl 'localhost:3000/factories?^processes(category_id,^process_costs(cost.sum()))
[
  {"category_id": 1, "sum":  600},
  {"category_id": 2, "sum":  400}
]

I'm skeptical that, in the absence of explicit GROUP BY anywhere, the resulting numbers can actually be made sense of.

Yes, that's fair. The user would need to check the hoisting or flattening at every single level to see if the GROUP BY is applied to that embedding with the aggregate or the subsequent top one. The pro is that it avoids the duplicity caused by a separate group_by .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants