Skip to content

Conversation

leontyevdv
Copy link
Contributor

Introduce the function TBUCKET() which applies grouping on the @timestamp field, truncating its value to the specified granularity:

TBUCKET(1h) is equivalent to BUCKET(1 hour, @timestamp) TBUCKET(7d) is equivalent to BUCKET(7 days, @timestamp)

Closes #131068

Introduce the function TBUCKET(<time interval>) which applies grouping
on the @timestamp field, truncating its value to the specified
granularity:

TBUCKET(1h) is equivalent to BUCKET(1 hour, @timestamp)
TBUCKET(7d) is equivalent to BUCKET(7 days, @timestamp)

Closes elastic#131068
Introduce the function TBUCKET(<time interval>) which applies grouping
on the @timestamp field, truncating its value to the specified
granularity:

TBUCKET(1h) is equivalent to BUCKET(1 hour, @timestamp)
TBUCKET(7d) is equivalent to BUCKET(7 days, @timestamp)

Closes elastic#131068
Copy link
Contributor

github-actions bot commented Jul 22, 2025

Replace evaluation by a surrogate.

Closes elastic#131068
Replace evaluation by a surrogate.

Closes elastic#131068
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:StorageEngine labels Jul 24, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya, I'm only chiming in with regard to the proposed change of the optimizer rules; didn't consider the rest of the PR.

I'd like to look into how we can avoid another copy of SubstituteSurrogateExpressions as additional rules add complexity to the optimizer and are difficult to refactor later.

@@ -142,6 +142,7 @@ protected static Batch<LogicalPlan> substitutions() {
new ReplaceAggregateAggExpressionWithEval(),
// lastly replace surrogate functions
new SubstituteSurrogateAggregations(),
new SubstituteSurrogateExpressions(),
Copy link
Contributor

@alex-spies alex-spies Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya, if possible, I'd like to avoid adding another copy of SubstituteSurrogateExpressions just for TBucket. It very much looks like SubstituteSurrogateAggregations should be dealing with this.

I think SubstituteSurrogateAggregations may currently not substitute the surrogate in the grouping because groupings work a little differently from other aggregates. Can we investigate if this can be amended before adding a new rule to the substitution batch?

Scratch that, I'm looking at the optimizer sequence right now and will get back with a suggestion that does not not make sense, I hope.

- Remove SubstituteSurrogateExpressions rule from LogicalPlanOptimizer
- Add TBucket translation to TranslateTimeSeriesAggregate
@@ -225,6 +226,12 @@ LogicalPlan translate(TimeSeriesAggregate aggregate) {
throw new IllegalArgumentException("expected at most one time bucket");
}
timeBucketRef.set(e);
} else if (child instanceof TBucket tbucket && tbucket.field().equals(timestamp.get())) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alex-spies , @fang-xing-esql I've removed the duplicating rule from LogicalPlanOptimizer in favor of this piece of code. Please, take a look. Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks more localized now, thank you.

You could attempt the substitution before checking if the substitution result is a Bucket, but the current version should work, too.

@leontyevdv leontyevdv requested a review from alex-spies August 22, 2025 14:36
@@ -225,6 +226,12 @@ LogicalPlan translate(TimeSeriesAggregate aggregate) {
throw new IllegalArgumentException("expected at most one time bucket");
}
timeBucketRef.set(e);
} else if (child instanceof TBucket tbucket && tbucket.field().equals(timestamp.get())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks more localized now, thank you.

You could attempt the substitution before checking if the substitution result is a Bucket, but the current version should work, too.

public void testImplicitFieldNames() {
assertFieldNames("""
FROM sample_data
| STATS x = 1 year + TBUCKET(1 day) BY b1d = TBUCKET(1 day)""", Set.of("@timestamp", "@timestamp.*"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, ideally let's add a bunch of other queries like this.
Interesting examples use e.g. KEEP @timestamp before the STATS, or a KEEP @* or KEEP *stamp*.

Is it valid to have another STATS later if @timestampsurvives? LikeSTATS ... BY TBUCKET(1 day), @timestamp | WHERE ... | STATS BY TBUCKET(1 hour)`?

Also, what happens if theres an eval FROM sample_data | EVAL @timestamp = "2024-01-01"::date | STATS ... BY TBUCKET(2 days)?

For these tests to be robust, we want to be as creative as possible :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Alex, I added all the tests that you mentioned to both FieldNameUtilsTests and CsvTests. Thanks!

@@ -166,6 +171,13 @@ public static PreAnalysisResult resolveFieldNames(LogicalPlan parsed, EnrichReso
// METRICS aggs generally rely on @timestamp without the user having to mention it.
referencesBuilder.get().add(new UnresolvedAttribute(ur.source(), MetadataAttribute.TIMESTAMP_FIELD));
}

p.forEachExpression(UnresolvedFunction.class, uf -> {
if (FUNCTIONS_REQUIRING_TIMESTAMP.contains(uf.name().toLowerCase(Locale.ROOT))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks correct to me, but I'd like to solicit a review by @astefan just for this specific part of the PR as this is delicate when done wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am looking now at this. Thank you for the ping @alex-spies

@leontyevdv leontyevdv requested a review from alex-spies August 25, 2025 12:09
- Fix IT by adding SORT
@@ -14,6 +14,6 @@
public class GroupingWritables {

public static List<NamedWriteableRegistry.Entry> getNamedWriteables() {
return List.of(Bucket.ENTRY, Categorize.ENTRY);
return List.of(Bucket.ENTRY, Categorize.ENTRY, TBucket.ENTRY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed if tbucket is on coordinator node only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not needed. Removed it from here and replaced code in the methods writeTo and getWriteableName to throw exceptions similarly to ToIp. Thank you!

# Conflicts:
#	x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/analysis/VerifierTests.java
Comment on lines 80 to 82
private TBucket(StreamInput in) throws IOException {
this(Source.readFrom((PlanStreamInput) in), in.readNamedWriteable(Expression.class), in.readNamedWriteable(Expression.class));
}
Copy link
Member

@fang-xing-esql fang-xing-esql Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed if serialization is not needed.

Copy link
Member

@not-napoleon not-napoleon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, sorry it's been more complicated than expected.

@@ -0,0 +1,343 @@
// TBUCKET-specific tests

tbucketByTenSecondsDuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, it's possible to include spaces in the test names for CSV tests


FROM sample_data
| KEEP @timestamp, event_duration, message
| EVAL t = @timestamp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what happens if an eval actually changes the timestamp, something like | EVAL @timestamp = @timestamp + 3 hours. Does TBUCKET pick up the original or modified timestamp value?

This could be tested in a follow up PR, doesn't have to block this from merging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to also include a test or two for TBUCKET in an eval; something like | EVAL key = TBUCKET(1 hour) | STATS minimum = MIN(whatever) BY key

Copy link
Member

@fang-xing-esql fang-xing-esql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @leontyevdv ! I added one last comment related to serialization of tbucket, the rest LGTM.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why do we consider such a scenario an acceptable one:

from test | stats max(emp_no) by tbucket(1 hour)`

generating an error like Unknown column [@timestamp].
Do we have another case in ESQL where the user is supposed to know that @timestamp is a field that must be present somewhere even though the user didn't actually type in the query @timestamp?

Imho, this would be an acceptable use case if the query would be TS test | stats ...... by tbucket(1 hour) meaning the user is aware that by using TS source command, it is expected to be in the area of "timeseries" indices and queries and some things (like the @timestamp field presence) are somewhat expected to happen.


Also, if I run this query from employees | stats min(salary) by tbucket(birth_date) I get back an error message Unknown column [@timestamp].

It is ok when running ..... by tbucket(birth_date,1month) to get back ql_illegal_argument_exception expects exactly one argument but, as an user who is exploring tbucket, if I remove 1month and keep birth_date (which is a date field) to get back something that has nothing to do with my query, it is unexpected.

@martijnvg @kkrik-es @dnhatn thoughts?

@kkrik-es
Copy link
Contributor

@astefan we expect tbucket to apply to all data streams that implicitly define @timestamp. This is not limited to metrics, should be applicable to logs and more. I'm not sure if we have such a precedent in ESQL, but we should try to simplify the syntax for such heavily used applications imho.

@astefan astefan self-requested a review August 27, 2025 08:33
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. My earlier comment is unrelated to what, technically, the PR is doing. Please, regard that comment as an observation and something to discuss post-merge, if my observation is valid.

@kkrik-es
Copy link
Contributor

Please, regard that comment as an observation and something to discuss post-merge, if my observation is valid.

Thanks Andrei, makes sense. I think there may be a pattern here, let's see how we can better accommodate this paradigm in the language.

@leontyevdv
Copy link
Contributor Author

Thank you all for your feedback, folks! I will definitely address all the suggestions for improvements in the following PR since this one is getting harder to maintain and to follow because of its size and the number of discussions.

@leontyevdv leontyevdv merged commit f2b364c into elastic:main Aug 27, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement :StorageEngine/ES|QL Timeseries / metrics / logsdb capabilities in ES|QL :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:StorageEngine v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ES|QL: Add TBUCKET function
9 participants