Skip to content

Conversation

@geooo109
Copy link
Collaborator

@geooo109 geooo109 commented Oct 20, 2025

Fixes #6137

We should avoid simplifying cases like SELECT 1 AND x or SELECT 0 or x.
So, this PR makes this check tighter by checking if the LHS or RHS is Boolean type.

@georgesittas
Copy link
Collaborator

Discussed offline– we should check whether all engines evaluate connectors into boolean values, instead of treating them as control flow. If SELECT <always true value> AND other is always boolean, we can't just simplify it to SELECT other, because the semantics change. Unfortunately, this is what we do today, which is incorrect.

@geooo109 geooo109 changed the title fix(optimizer)!: simplify AND with static BOOLEAN type fix(optimizer)!: simplify AND/OR with static BOOLEAN type Oct 21, 2025
@geooo109 geooo109 force-pushed the geooo109/simplify_and branch 2 times, most recently from 306b884 to 66ffdd7 Compare October 22, 2025 12:55
@georgesittas georgesittas marked this pull request as draft October 23, 2025 09:49
@georgesittas
Copy link
Collaborator

We discovered that post-simplifying we lose some types. That is a bug and we decided to address it in this PR instead of postponing, because that will reduce the PR's complexity as a side-effect.

@georgesittas
Copy link
Collaborator

@barakalon @tobymao curious what your thoughts are on this.

The problem: today we infer types & then run other rules that transform the AST, resulting in building new, untyped, expressions. This is a bug on its own, and it also affects type-sensitive rules. E.g., simplify now checks the boolean type to apply certain transformations but inbetween types are lost and so it cannot always work.

My worry is that fixing each call site introduces a new pattern where we have to be alert for rules after annotate_types and ensure new AST nodes are always typed. OTOH, not sure how else we can unblock these rules in simplify that run up to a fix point...

@geooo109 geooo109 force-pushed the geooo109/simplify_and branch from e5d9ed2 to 052bdcc Compare October 29, 2025 13:37
@geooo109 geooo109 marked this pull request as ready for review October 29, 2025 15:06
@geooo109
Copy link
Collaborator Author

/benchmark

@github-actions
Copy link
Contributor

Benchmark Results

Legend:

  • 🟢🟢 = 2x+ faster
  • 🟢 = 1.1x - 2x faster
  • ⚪ = No significant change (< 1.1x)
  • 🔴 = 1.1x - 2x slower
  • 🔴🔴 = 2x+ slower

Parsing Benchmark

Benchmark bench_parse_main bench_parse_pr
parse_sqlglot_tpch 5.51 ms 5.61 ms: ⚪ 1.02x slower
parse_sqlglot_short 346 us 348 us: ⚪ 1.01x slower
parse_sqlglot_long 5.28 ms 5.31 ms: ⚪ 1.01x slower
parse_sqlglotrs_tpch 4.03 ms 4.08 ms: ⚪ 1.01x slower
parse_sqlglotrs_short 284 us 288 us: ⚪ 1.01x slower
parse_sqlglotrs_long 3.87 ms 3.99 ms: ⚪ 1.03x slower
parse_sqlglotrs_crazy 11.5 ms 11.9 ms: ⚪ 1.03x slower
Geometric mean (ref) 1.02x slower

Benchmark hidden because not significant (1): parse_sqlglot_crazy


Optimization Benchmark

Benchmark bench_optimize_main bench_optimize_pr
optimize_tpch 778 ms 788 ms: ⚪ 1.01x slower
optimize_condition_100 96.3 ms 97.0 ms: ⚪ 1.01x slower
Geometric mean (ref) 1.00x slower

Benchmark hidden because not significant (2): optimize_condition_10, optimize_condition_1000

@geooo109
Copy link
Collaborator Author

/benchmark

@github-actions
Copy link
Contributor

Benchmark Results

Legend:

  • 🟢🟢 = 2x+ faster
  • 🟢 = 1.1x - 2x faster
  • ⚪ = No significant change (< 1.1x)
  • 🔴 = 1.1x - 2x slower
  • 🔴🔴 = 2x+ slower

Parsing Benchmark

Benchmark bench_parse_main bench_parse_pr
parse_sqlglot_long 5.32 ms 5.30 ms: ⚪ 1.00x faster
parse_sqlglotrs_long 3.88 ms 3.91 ms: ⚪ 1.01x slower
Geometric mean (ref) 1.00x slower

Benchmark hidden because not significant (6): parse_sqlglot_tpch, parse_sqlglot_short, parse_sqlglot_crazy, parse_sqlglotrs_tpch, parse_sqlglotrs_short, parse_sqlglotrs_crazy


Optimization Benchmark

Benchmark bench_optimize_main bench_optimize_pr
optimize_condition_100 96.7 ms 96.0 ms: ⚪ 1.01x faster
Geometric mean (ref) 1.00x faster

Benchmark hidden because not significant (3): optimize_tpch, optimize_condition_10, optimize_condition_1000

@geooo109
Copy link
Collaborator Author

geooo109 commented Oct 30, 2025

Current benchmarks don't show a significant performance slow down.

On the other hand, queries like the following (create a lot of AST transformation due to the NOT (x CONNECTOR y) => re-compute hash), create significant slow downs:

SELECT * FROM t WHERE
    NOT (
      NOT (
        NOT (
          NOT (
            NOT (
              (
                NOT ((a = 1 AND b = 2 AND TRUE) OR (c = 3 AND d = 4 OR FALSE)) AND
                NOT ((e = 5 OR f = 6 AND TRUE) AND (g = 7 OR h = 8 OR FALSE))
              ) OR (
                NOT (
                  NOT ((i = 9 AND j = 10) OR (k = 11 AND l = 12)) AND
                  NOT ((m = 13 OR n = 14) AND (o = 15 OR p = 16))
                )
              )
            ) AND (
              NOT (
                NOT (
                  NOT ((q = 17 AND r = 18) OR (s = 19 AND t = 20)) OR
                  NOT ((u = 21 OR v = 22) AND (w = 23 OR x = 24))
                ) AND (
                  (y = 25 OR NOT (z = 26 AND aa = 27))
                )
              )
            )
          ) OR (
            NOT (
              NOT (
                NOT (
                  (bb = 28 AND cc = 29) OR (dd = 30 AND ee = 31)
                ) AND (
                  (ff = 32 OR gg = 33) AND (hh = 34 OR ii = 35)
                )
              ) OR (
                NOT (
                  NOT ((jj = 36 AND kk = 37) OR (ll = 38 AND mm = 39))
                )
              )
            )
          )
        ) AND (
          NOT (
            NOT (
              NOT ((nn = 40 AND oo = 41) OR (pp = 42 AND qq = 43)) AND
              NOT ((rr = 44 OR ss = 45) AND (tt = 46 OR uu = 47))
            )
          ) OR (
            NOT (
              (vv = 48 AND NOT (ww = 49 OR xx = 50)) OR
              (NOT (yy = 51 AND zz = 52) AND aaa = 53)
            )
          )
        ) 
      )
    );

Old simplify (using the optimize function):
Total time for 50 runs: 8.7888s
Average time per run: 0.1758s

New simplify (using the optimize function):
Total time for 50 runs: 12.5861s
Average time per run: 0.2517s

@georgesittas
Copy link
Collaborator

I see, so stressing the "unhappy" code paths can result in a ~1.4x slowdown.. That's annoying, although somewhat expected. Let's pause for a bit and disuss to see how we can improve this- FYI @tobymao.

@geooo109 geooo109 force-pushed the geooo109/simplify_and branch from bea403f to 5fb0d4b Compare October 31, 2025 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The optimizer changes the query's result from a boolean value to the column's raw value

4 participants