Skip to content

Common subexpression elimination masks hides aggregations in physical explain output #19684

@pepijnve

Description

@pepijnve

Describe the bug

When the common subexpression elimination deduplicates aggregations it can generate aliases for the common expression of the form __common_expr_<n>. In the logical plan explain output this gets output as <original expr> as __common_expr_<n>. In the physical plan explain output though only __common_expr_<n> is printed. The actual expression corresponding to this alias is no longer visible. This makes the explain output hard to interpret.

To Reproduce

Here's an example logic plan constructed using the data frame API. The problematic line is

AggregateExec: mode=Partial, gby=[idx@1 as idx], aggr=[__common_expr_1]
Logical plan
============
Projection: idx, agg, ord
  Aggregate: groupBy=[[idx]], aggr=[[sum(column1) AS agg, sum(column1) AS ord]]
    Projection: column1, column2, CASE WHEN column2 <= Int64(0) THEN Int64(0) WHEN column2 <= Int64(200) THEN Int64(1) WHEN column2 <= Int64(314) THEN Int64(3) ELSE Int64(4) END AS idx
      Values: (Int64(1), Int64(100)), (Int64(2), Int64(200)), (Int64(3), Int64(314))

Optimized logical plan
======================
Projection: idx, __common_expr_1 AS agg, __common_expr_1 AS ord
  Aggregate: groupBy=[[idx]], aggr=[[sum(column1) AS __common_expr_1]]
    Projection: column1, CASE WHEN column2 <= Int64(0) THEN Int64(0) WHEN column2 <= Int64(200) THEN Int64(1) WHEN column2 <= Int64(314) THEN Int64(3) ELSE Int64(4) END AS idx
      Values: (Int64(1), Int64(100)), (Int64(2), Int64(200)), (Int64(3), Int64(314))

Physical plan
=============
ProjectionExec: expr=[idx@0 as idx, __common_expr_1@1 as agg, __common_expr_1@1 as ord]
  AggregateExec: mode=FinalPartitioned, gby=[idx@0 as idx], aggr=[__common_expr_1]
    RepartitionExec: partitioning=Hash([idx@0], 10), input_partitions=1
      AggregateExec: mode=Partial, gby=[idx@1 as idx], aggr=[__common_expr_1]
        ProjectionExec: expr=[column1@0 as column1, CASE WHEN column2@1 <= 0 THEN 0 WHEN column2@1 <= 200 THEN 1 WHEN column2@1 <= 314 THEN 3 ELSE 4 END as idx]
          DataSourceExec: partitions=1, partition_sizes=[1]

Expected behavior

Rather than

AggregateExec: mode=Partial, gby=[idx@1 as idx], aggr=[__common_expr_1]

the explain output should show

AggregateExec: mode=Partial, gby=[idx@1 as idx], aggr=[sum(column1@0) as __common_expr_1]

similarly to how the group by expression are printed.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions