-
Notifications
You must be signed in to change notification settings - Fork 25.4k
ES|QL: Add TBUCKET function #131449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ES|QL: Add TBUCKET function #131449
Conversation
Introduce the function TBUCKET(<time interval>) which applies grouping on the @timestamp field, truncating its value to the specified granularity: TBUCKET(1h) is equivalent to BUCKET(1 hour, @timestamp) TBUCKET(7d) is equivalent to BUCKET(7 days, @timestamp) Closes elastic#131068
Introduce the function TBUCKET(<time interval>) which applies grouping on the @timestamp field, truncating its value to the specified granularity: TBUCKET(1h) is equivalent to BUCKET(1 hour, @timestamp) TBUCKET(7d) is equivalent to BUCKET(7 days, @timestamp) Closes elastic#131068
...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/grouping/TBucket.java
Outdated
Show resolved
Hide resolved
...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/grouping/TBucket.java
Outdated
Show resolved
Hide resolved
...in/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/grouping/TBucket.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/qa/testFixtures/src/main/resources/tbucket.csv-spec
Outdated
Show resolved
Hide resolved
Replace evaluation by a surrogate. Closes elastic#131068
Replace evaluation by a surrogate. Closes elastic#131068
Fix tests Closes elastic#131068
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya, I'm only chiming in with regard to the proposed change of the optimizer rules; didn't consider the rest of the PR.
I'd like to look into how we can avoid another copy of SubstituteSurrogateExpressions
as additional rules add complexity to the optimizer and are difficult to refactor later.
@@ -142,6 +142,7 @@ protected static Batch<LogicalPlan> substitutions() { | |||
new ReplaceAggregateAggExpressionWithEval(), | |||
// lastly replace surrogate functions | |||
new SubstituteSurrogateAggregations(), | |||
new SubstituteSurrogateExpressions(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heya, if possible, I'd like to avoid adding another copy of SubstituteSurrogateExpressions
just for TBucket
. It very much looks like SubstituteSurrogateAggregations
should be dealing with this.
I think SubstituteSurrogateAggregations
may currently not substitute the surrogate in the grouping because groupings work a little differently from other aggregates. Can we investigate if this can be amended before adding a new rule to the substitution batch?
Scratch that, I'm looking at the optimizer sequence right now and will get back with a suggestion that does not not make sense, I hope.
- Remove SubstituteSurrogateExpressions rule from LogicalPlanOptimizer - Add TBucket translation to TranslateTimeSeriesAggregate
@@ -225,6 +226,12 @@ LogicalPlan translate(TimeSeriesAggregate aggregate) { | |||
throw new IllegalArgumentException("expected at most one time bucket"); | |||
} | |||
timeBucketRef.set(e); | |||
} else if (child instanceof TBucket tbucket && tbucket.field().equals(timestamp.get())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alex-spies , @fang-xing-esql I've removed the duplicating rule from LogicalPlanOptimizer in favor of this piece of code. Please, take a look. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks more localized now, thank you.
You could attempt the substitution before checking if the substitution result is a Bucket
, but the current version should work, too.
@@ -225,6 +226,12 @@ LogicalPlan translate(TimeSeriesAggregate aggregate) { | |||
throw new IllegalArgumentException("expected at most one time bucket"); | |||
} | |||
timeBucketRef.set(e); | |||
} else if (child instanceof TBucket tbucket && tbucket.field().equals(timestamp.get())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks more localized now, thank you.
You could attempt the substitution before checking if the substitution result is a Bucket
, but the current version should work, too.
public void testImplicitFieldNames() { | ||
assertFieldNames(""" | ||
FROM sample_data | ||
| STATS x = 1 year + TBUCKET(1 day) BY b1d = TBUCKET(1 day)""", Set.of("@timestamp", "@timestamp.*")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, ideally let's add a bunch of other queries like this.
Interesting examples use e.g. KEEP @timestamp
before the STATS
, or a KEEP @*
or KEEP *stamp*
.
Is it valid to have another STATS
later if @timestampsurvives? Like
STATS ... BY TBUCKET(1 day), @timestamp | WHERE ... | STATS BY TBUCKET(1 hour)`?
Also, what happens if theres an eval FROM sample_data | EVAL @timestamp = "2024-01-01"::date | STATS ... BY TBUCKET(2 days)
?
For these tests to be robust, we want to be as creative as possible :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Alex, I added all the tests that you mentioned to both FieldNameUtilsTests and CsvTests. Thanks!
@@ -166,6 +171,13 @@ public static PreAnalysisResult resolveFieldNames(LogicalPlan parsed, EnrichReso | |||
// METRICS aggs generally rely on @timestamp without the user having to mention it. | |||
referencesBuilder.get().add(new UnresolvedAttribute(ur.source(), MetadataAttribute.TIMESTAMP_FIELD)); | |||
} | |||
|
|||
p.forEachExpression(UnresolvedFunction.class, uf -> { | |||
if (FUNCTIONS_REQUIRING_TIMESTAMP.contains(uf.name().toLowerCase(Locale.ROOT))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks correct to me, but I'd like to solicit a review by @astefan just for this specific part of the PR as this is delicate when done wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am looking now at this. Thank you for the ping @alex-spies
- Add more tests for corner cases
- Fix IT by adding SORT
@@ -14,6 +14,6 @@ | |||
public class GroupingWritables { | |||
|
|||
public static List<NamedWriteableRegistry.Entry> getNamedWriteables() { | |||
return List.of(Bucket.ENTRY, Categorize.ENTRY); | |||
return List.of(Bucket.ENTRY, Categorize.ENTRY, TBucket.ENTRY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed if tbucket
is on coordinator node only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not needed. Removed it from here and replaced code in the methods writeTo and getWriteableName to throw exceptions similarly to ToIp. Thank you!
# Conflicts: # x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/VerifierTests.java
private TBucket(StreamInput in) throws IOException { | ||
this(Source.readFrom((PlanStreamInput) in), in.readNamedWriteable(Expression.class), in.readNamedWriteable(Expression.class)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed if serialization is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, sorry it's been more complicated than expected.
@@ -0,0 +1,343 @@ | |||
// TBUCKET-specific tests | |||
|
|||
tbucketByTenSecondsDuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, it's possible to include spaces in the test names for CSV tests
|
||
FROM sample_data | ||
| KEEP @timestamp, event_duration, message | ||
| EVAL t = @timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what happens if an eval actually changes the timestamp, something like | EVAL @timestamp = @timestamp + 3 hours
. Does TBUCKET
pick up the original or modified timestamp value?
This could be tested in a follow up PR, doesn't have to block this from merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to also include a test or two for TBUCKET
in an eval; something like | EVAL key = TBUCKET(1 hour) | STATS minimum = MIN(whatever) BY key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @leontyevdv ! I added one last comment related to serialization of tbucket
, the rest LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering why do we consider such a scenario an acceptable one:
from test | stats max(emp_no) by tbucket(1 hour)`
generating an error like Unknown column [@timestamp]
.
Do we have another case in ESQL where the user is supposed to know that @timestamp
is a field that must be present somewhere even though the user didn't actually type in the query @timestamp
?
Imho, this would be an acceptable use case if the query would be TS test | stats ...... by tbucket(1 hour)
meaning the user is aware that by using TS
source command, it is expected to be in the area of "timeseries" indices and queries and some things (like the @timestamp
field presence) are somewhat expected to happen.
Also, if I run this query from employees | stats min(salary) by tbucket(birth_date)
I get back an error message Unknown column [@timestamp]
.
It is ok when running ..... by tbucket(birth_date,1month)
to get back ql_illegal_argument_exception expects exactly one argument
but, as an user who is exploring tbucket
, if I remove 1month
and keep birth_date
(which is a date
field) to get back something that has nothing to do with my query, it is unexpected.
@martijnvg @kkrik-es @dnhatn thoughts?
@astefan we expect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. My earlier comment is unrelated to what, technically, the PR is doing. Please, regard that comment as an observation and something to discuss post-merge, if my observation is valid.
Thanks Andrei, makes sense. I think there may be a pattern here, let's see how we can better accommodate this paradigm in the language. |
Thank you all for your feedback, folks! I will definitely address all the suggestions for improvements in the following PR since this one is getting harder to maintain and to follow because of its size and the number of discussions. |
Introduce the function TBUCKET() which applies grouping on the @timestamp field, truncating its value to the specified granularity:
TBUCKET(1h) is equivalent to BUCKET(1 hour, @timestamp) TBUCKET(7d) is equivalent to BUCKET(7 days, @timestamp)
Closes #131068