-
Notifications
You must be signed in to change notification settings - Fork 25.4k
ES|QL: Fix Fork field reference tracking #131723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ES|QL: Fix Fork field reference tracking #131723
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, can you please:
- link to the issue
- change the PR title - this is less about adding tests and more about improving field resolution when FORK is used
- add the proper PR labels
- add some description to the PR
@@ -828,6 +823,32 @@ static PreAnalysisResult fieldNames(LogicalPlan parsed, Set<String> enrichPolicy | |||
return result.withFieldNames(IndexResolver.ALL_FIELDS); | |||
} | |||
|
|||
Holder<Boolean> projectAfterFork = new Holder<>(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these look like the changes from #128193 - these are not going to work.
please check what happens later in the fieldNames
method. For example here where we remove field references - this has the potential to work incorrectly for FORK:
elasticsearch/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/EsqlSession.java
Lines 909 to 929 in 3ecfed1
if (canRemoveAliases[0]) { | |
// remove any already discovered UnresolvedAttributes that are in fact aliases defined later down in the tree | |
// for example "from test | eval x = salary | stats max = max(x) by gender" | |
// remove the UnresolvedAttribute "x", since that is an Alias defined in "eval" | |
// also remove other down-the-tree references to the extracted fields from "grok" and "dissect" | |
AttributeSet planRefs = p.references(); | |
Set<String> fieldNames = planRefs.names(); | |
p.forEachExpressionDown(NamedExpression.class, ne -> { | |
if ((ne instanceof Alias || ne instanceof ReferenceAttribute) == false) { | |
return; | |
} | |
// do not remove the UnresolvedAttribute that has the same name as its alias, ie "rename id AS id" | |
// or the UnresolvedAttributes that are used in Functions that have aliases "STATS id = MAX(id)" | |
if (fieldNames.contains(ne.name())) { | |
return; | |
} | |
referencesBuilder.removeIf( | |
attr -> matchByName(attr, ne.name(), keepRefs.contains(attr) || dropWildcardRefs.contains(attr)) | |
); | |
}); | |
} |
As an example, I tested the following queries locally:
FROM employees
| FORK
( STATS x = VALUES(last_name))
( EVAL last_name = first_name | STATS y = VALUES(last_name))
FROM employees
| FORK
( EVAL last_name = first_name | STATS y = VALUES(last_name))
( STATS x = VALUES(last_name))
They are pretty much the same - just the order of the FORK branches is different.
The first one breaks with:
line 4:26: Unknown column [last_name], did you mean [first_name]?
the second one works.
the first one breaks because most likely we remove the reference to last_name
as part of the logic from the snippet I shared earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing those. For the future, please record such failing cases in the original github issue. This will be helpful for folks working on it to understand the context better.
@@ -2118,17 +2118,41 @@ public void testForkFieldsWithStatsInOneBranch() { | |||
} | |||
|
|||
public void testForkFieldsWithEnrichAndLookupJoins() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some more tests?
it would be great if you can test this case and see that the following 2 queries have the same references:
FROM test
| FORK
( STATS x = VALUES(last_name))
( EVAL last_name = first_name | STATS y = VALUES(last_name))
FROM test
| FORK
( EVAL last_name = first_name | STATS y = VALUES(last_name))
( STATS x = VALUES(last_name))
Then I would also add some more tests where the FORK branches themselves use KEEP and DROP (with wildcard field patterns too). When I added these tests the FORK branches did not support commands like KEEP and DROP, but now they do.
If you can add other tests, like FORK branches that use ENRICH for example, that would be great - thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying! Again, for future reference, the original issue is a great place to record those particulars.
One way we could do this is to call the This does a top down traversal. Let's say we have this query:
FORK is a n-ary plan and it will have a two separate plans as children When we do I propose that when we see a Once we have seen a |
Hi @svilen-mihaylov-elastic, I've created a changelog YAML for you. |
Hi @svilen-mihaylov-elastic, I've updated the changelog YAML for you. |
@elasticsearchmachine test this |
parsed.forEachDown(p -> {// go over each plan top-down | ||
if (p instanceof RegexExtract re) { // for Grok and Dissect | ||
var processingLambda = new Holder<Function<LogicalPlan, Boolean>>(); | ||
processingLambda.set((LogicalPlan p) -> {// go over each plan top-down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lambda is essentially recursive here. I can factor into a separate function if you prefer. Left it this way to minimize the diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the change in Node
is necessary. An explanation in the review comment.
I didn't look at the code in FieldNameUtils
yet (that requires a more significant time investment and I plan to do it later), but it would help build confidence in it with more tests. I suggested an optional approach to "automatically" create more tests for fork
in FieldNameUtilsTests
. Thanks
@@ -65,13 +65,28 @@ public List<T> children() { | |||
} | |||
|
|||
@SuppressWarnings("unchecked") | |||
public void forEachDown(Consumer<? super T> action) { | |||
action.accept((T) this); | |||
public boolean forEachDownMayReturnEarly(Function<? super T, Boolean> action) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Node
class is fundamental to the QL code in ESQL and it very rarely (if at all) changes.
I don't think this change here is really necessary to warrant for a Node
change. Having this early "exit" from tree traversal outside Node
is also helpful for whoever writes code that traverses the tree, to be aware of the need of early "exit" and explicitly use it. There are multiple examples in ESQL code for this, one of them is here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out the example. I was aware such examples. Two issues:
- As you mention, the example you linked to is not a true exit. If the tree is particularly deep you will still need to traverse it, and incur stack overhead at the very least. By contrast this introduces a true early exit.
- I think the change here is fine because the regular forEachDown() function below (line 84) relies on it and by virtue of this it is begin exercised by virtually all ESQL tests in existence. It is a fairly standard way to extend functionality which I've used successfully in the past. If you have specific concerns about additional testing, or change of the API, I'd be happy to address one by one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the pattern that we already use in ES|QL.
I agree that the Node class is fundamental and if this is the only place where we need to use something like a forEachDownMayReturnEarly
it might not be worth adding it.
If there's an opportunity for us to add something like a forEachDownMayReturnEarly
, because we see this pattern in multiple places, we can have that as a separate conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is the only place where we need to use something like a
forEachDownMayReturnEarly
it might not be worth adding it.
It is clearly not the case, there is at least one other example as Andrei pointed out, and probably many others, where it will be useful to limit the traversal, from a performance point of view.
Again, if there is specific amount of testing for this, or an API change, lets discuss this, and not reject blindly based on existing usage (and its limitations). I'm looking for constructive feedback here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, if there is specific amount of testing for this, or an API change, lets discuss this, and not reject blindly based on existing usage (and its limitations). I'm looking for constructive feedback here.
I know the change doesn't seem much and seems very trivial to evaluate and review but there is some experience and some paranoia involved with this class and the edge cases are many and not easy to spot. And I do also understand your point of view.
I need more time to look at this in depth, there is a lot on my plate this week and I won't have time to properly review it until early next week.
If you cannot wait until then, then please go ahead and ask for a review from @ioanatia (she's the author of fork
and knows all there is to know about this command) only and merge when the PR is ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've looked in depth at the code change proposal and these below are my arguments for not approving this change as is now:
- this is a harder to grasp code (the method in
Node
and the needed changes inresolveFieldNames
) for whoever is looking at the code as a first-timer with the goal to understand it. I've tried to refactor it to see if there is a better structure that makes it more easily readable (isolating thetrue
branch of theforEachDownMayReturnEarly
only tofork
needs), but I couldn't find one. - the entire bulk of code here (which existed before the need for
forEachDownMayReturnEarly
method) suddently needs to be aware of the early exit fromforEachDown
even though it doesn't need to know this, onlyfork
code must be aware of the early exit logic. This is the main reason why a different exit logic, which is isolated tofork
as much as possible, is better. The logic inresolveFieldNames
is complex enough and many other people who looked at it and contributed to it mentioned the code is complex enough and code comments are essential. Let's aim to make it less complex (with code comments for the next person looking at the code) or at least keep it at the same complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded in a new thread here: https://github.com/elastic/elasticsearch/pull/131723/files#r2252384903 since this one has become outdated.
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/session/FieldNameUtilsTests.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous evaluation still stands. Technically, the solution is correct; there are other aspects that can be improved/re-evaluated. Please, see comments.
Kudos for the number of tests, I looked only at few of them and tested some other different queries and things seem to work. I will review all the tests after an update to the PR following my review.
@@ -110,19 +106,42 @@ public static PreAnalysisResult resolveFieldNames(LogicalPlan parsed, EnrichReso | |||
// lookup indices where we request "*" because we may require all their fields | |||
Set<String> wildcardJoinIndices = new java.util.HashSet<>(); | |||
|
|||
boolean[] canRemoveAliases = new boolean[] { true }; | |||
var canRemoveAliases = new Holder<>(true); | |||
var needsAllFields = new Holder<>(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this. There is an already existent variable for this purpose. projectAll
referencesBuilder.set(AttributeSet.builder()); | ||
var return_result = child.forEachDownMayReturnEarly(processingLambda.get()); | ||
// No nested Forks for now... | ||
assert return_result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in FieldNameUtils is complex enough and covers many edge cases, good comments make it more easily approachable by others.
The assert
needs a message (something like FORKs within FORKs is not supported. This is already checked here ....., this assert here is for good measure
) and, also, a better comment, please; or, if the message of the assert
is meaningful enough, the comment can be changed to just double checking we don't run into a fork within a fork
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this comment pending the change of API, please take another look.
var processingLambda = new Holder<Function<LogicalPlan, Boolean>>(); | ||
processingLambda.set((LogicalPlan p) -> {// go over each plan top-down | ||
if (p instanceof Fork fork) { | ||
// Early return from forEachDown. We will iterate over the children manually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment here didn't help me much in grasping the code in an easier way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated it, take another look.
@@ -65,13 +65,28 @@ public List<T> children() { | |||
} | |||
|
|||
@SuppressWarnings("unchecked") | |||
public void forEachDown(Consumer<? super T> action) { | |||
action.accept((T) this); | |||
public boolean forEachDownMayReturnEarly(Function<? super T, Boolean> action) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've looked in depth at the code change proposal and these below are my arguments for not approving this change as is now:
- this is a harder to grasp code (the method in
Node
and the needed changes inresolveFieldNames
) for whoever is looking at the code as a first-timer with the goal to understand it. I've tried to refactor it to see if there is a better structure that makes it more easily readable (isolating thetrue
branch of theforEachDownMayReturnEarly
only tofork
needs), but I couldn't find one. - the entire bulk of code here (which existed before the need for
forEachDownMayReturnEarly
method) suddently needs to be aware of the early exit fromforEachDown
even though it doesn't need to know this, onlyfork
code must be aware of the early exit logic. This is the main reason why a different exit logic, which is isolated tofork
as much as possible, is better. The logic inresolveFieldNames
is complex enough and many other people who looked at it and contributed to it mentioned the code is complex enough and code comments are essential. Let's aim to make it less complex (with code comments for the next person looking at the code) or at least keep it at the same complexity.
var return_result = child.forEachDownMayReturnEarly(processingLambda.get()); | ||
// No nested Forks for now... | ||
assert return_result; | ||
if (referencesBuilder.get().isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the logic behind this check? Why would an empty referencesBuidler
here would terminate early (and mark the query as needing *
all fields)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below on line 232:
if (fieldNames.isEmpty() && enrichPolicyMatchFields.isEmpty())
This has the comment
// there cannot be an empty list of fields, we'll ask the simplest and lightest one instead: _index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that's correct. The logic at line 232 is about queries like this from test | eval x = 123 | keep x
. Meaning, the query actually doesn't need any fields from _field_caps
. The logic for fork
says completely the opposite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that's correct. The logic at line 232 is about queries like this
from test | eval x = 123 | keep x
. Meaning, the query actually doesn't need any fields from_field_caps
. The logic forfork
says completely the opposite.
I agree that the logic seems pretty confusing, but at the same time I believe the statement you make regarding the fields is only partially true. Consider the following query with match.
FROM employees
| FORK
( STATS x = count(*))
( WHERE emp_no == "2" )
Consider the "stats" stage which looks like this
Eval[[fork1[KEYWORD] AS _fork#3]]
\_Aggregate[[],[?count[*] AS x#2]]
\_UnresolvedRelation[employees]
It has no referenced columns (p.references() is empty). Thus by the logic of the comment I mentioned above we should retain all columns. This information needs to be propagated back to the caller (the Fork). Thus if any of the Fork's children need all the fields, the we can break out early and just retain all fields.
I can try to make this a bit clear in the following way. The lambda can return a pair <bool, ColumnSet>. The bool can indicate if we need to retain all columns. How does this sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about the lambda here, but about the logic of the field name resolution for fork
.
I have tested this query
FROM employees | fork (eval x = 1 | keep x) (eval y = 2 | keep y) (eval z = 3 | keep z)
and with the logic as projectAll.set(true);
if referencesBuilder
is empty we ask *
from ES even though we don't need all the fields, actually we don't need any field except _index
as I mentioned previously.
I've tested the same query with no projectAll.set(true)
in the referencesBuilder
empty branch and we will ask for _index
only and the query returns the same columns and rows as the previous test.
Thank you for bringing up the stats
query as well, that's a good example.
stats x = count(*)
is similar with eval y = ... | keep y
in that it "resets" the columns list to something that's not found in the ES indices (aka a FieldAttribute
) and this branch of fork
doesn't need anything from ES. Actually, running from employees | stats x = count(*)
will not bypass the field resolution and shortcut it to *
(all fields), but run into the _index
branch. I don't see why fork
should be have differently here; after all, fork
runs a bunch of queries that are missing the from
part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I have added the query you have in your comment as a test, and it returns *. I'm not following if there is another change your are proposing or you simply have a question/observation about the existing (pre-fork) code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. Actually, running from employees | stats x = count(*) will not bypass the field resolution and shortcut it > to * (all fields), but run into the _index branch. I don't see why fork should be have differently here; after > all, fork runs a bunch of queries that are missing the from part.
I'm not following this part. I'm not sure I understand how fork behaves differently here... If we end up with empty set for resolved fields, we will convert to _index, no? In that regard this behavior is the same as it should be, but maybe I'm missing something in your explanation.
public void forEachDown(Consumer<? super T> action) { | ||
action.accept((T) this); | ||
void forEachDownMayReturnEarly(BiConsumer<? super T, Holder<Boolean>> action, Holder<Boolean> breakEarly) { | ||
action.accept((T) this, breakEarly); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quoting the previous comment from @astefan from here #131723 (comment) here since it became outdated.
Ok, I've looked in depth at the code change proposal and these below are my arguments for not approving this change as is now:
this is a harder to grasp code (the method in Node and the needed changes in resolveFieldNames) for >whoever is looking at the code as a first-timer with the goal to understand it. I've tried to refactor it to >see if there is a better structure that makes it more easily readable (isolating the true branch of the >forEachDownMayReturnEarly only to fork needs), but I couldn't find one.
the entire bulk of code here (which existed >before the need for forEachDownMayReturnEarly method) suddently needs to be aware of the early exit >from forEachDown even though it doesn't need to know this, only fork code must be aware of the early >exit logic. This is the main reason why a different exit logic, which is isolated to fork as much as possible, >is better. The logic in resolveFieldNames is complex enough and many other people who looked at it and >contributed to it mentioned the code is complex enough and code comments are essential. Let's aim to >make it less complex (with code comments for the next person looking at the code) or at least keep it at >the same complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback. To my understanding you are unhappy with the API I originally had which had a lambda returning true/false to indicate early exit. As you pointed out, existing code needs to be aware of the correct default value (true) to return if it does not with to return the traverse early. This is fair criticism.
I've made a small tweak below. Essentially instead of returning true/false, I pass a boolean holder. This holder needs to be updated to true only if the lambda needs the traverse to end early, by default no action is needed. Thus only the fork code is aware of the need to exit early and the non-fork code stays as is.
I think this should satisfy the criteria you mentioned. I'd be happy to iterate further as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imho, this PR is an improvement for FORK fields resolution but it's not the definitive solution. I am all in for having this cautiously improved 👍 . And thank you for keep iterating and providing constructive answers to my PR reviews.
I've added some suggestions. Please see below.
The most important one is about Node
: do not change the existent forEachDown
method. Leave it as it previously was and create new methods to accommodate for forEachDownMayReturnEarly
. Imo, forEachDownMayReturnEarly
might even disappear if the definitive solution to fork
field resolution is provided.
x-pack/plugin/esql-core/src/main/java/org/elasticsearch/xpack/esql/core/tree/Node.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/FieldNameUtils.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/FieldNameUtils.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/session/FieldNameUtils.java
Outdated
Show resolved
Hide resolved
var return_result = child.forEachDownMayReturnEarly(processingLambda.get()); | ||
// No nested Forks for now... | ||
assert return_result; | ||
if (referencesBuilder.get().isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about the lambda here, but about the logic of the field name resolution for fork
.
I have tested this query
FROM employees | fork (eval x = 1 | keep x) (eval y = 2 | keep y) (eval z = 3 | keep z)
and with the logic as projectAll.set(true);
if referencesBuilder
is empty we ask *
from ES even though we don't need all the fields, actually we don't need any field except _index
as I mentioned previously.
I've tested the same query with no projectAll.set(true)
in the referencesBuilder
empty branch and we will ask for _index
only and the query returns the same columns and rows as the previous test.
Thank you for bringing up the stats
query as well, that's a good example.
stats x = count(*)
is similar with eval y = ... | keep y
in that it "resets" the columns list to something that's not found in the ES indices (aka a FieldAttribute
) and this branch of fork
doesn't need anything from ES. Actually, running from employees | stats x = count(*)
will not bypass the field resolution and shortcut it to *
(all fields), but run into the _index
branch. I don't see why fork
should be have differently here; after all, fork
runs a bunch of queries that are missing the from
part.
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/session/FieldNameUtilsTests.java
Show resolved
Hide resolved
x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/session/FieldNameUtilsTests.java
Show resolved
Hide resolved
…session/FieldNameUtils.java Co-authored-by: Andrei Stefan <[email protected]>
…session/FieldNameUtils.java Co-authored-by: Andrei Stefan <[email protected]>
…session/FieldNameUtils.java Co-authored-by: Andrei Stefan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @svilen-mihaylov-elastic , I'm chiming in because I noticed the changes to Node.java
.
I agree with @astefan : I appreciate wanting to reduce code duplication, but I'd really rather leave the implementation of forEachDown
alone, please, and just add a new method forEachDownMayReturnEarly
separately. We know that walking the plan/expression tree can be expensive in some cases, so this method actually affects performance.
If we really want to change forEachDown
, we'd have to monitor the nightly benchmarks for regressions (and carefully consider, first, if a regression is even covered by benchmarks) - and I'd seek out a review from @idegtiarenko then, who has worked on and improved the performance from walking large query plans in the past.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @svilen-mihaylov-elastic , the changes to Node.java
LGTM!
Addresses #127208