Skip to content

Upgrade Apache Calcite from 1.40.0 to 1.42.0#18658

Open
yashmayya wants to merge 1 commit into
apache:masterfrom
yashmayya:calcite-1.42.0-upgrade
Open

Upgrade Apache Calcite from 1.40.0 to 1.42.0#18658
yashmayya wants to merge 1 commit into
apache:masterfrom
yashmayya:calcite-1.42.0-upgrade

Conversation

@yashmayya
Copy link
Copy Markdown
Contributor

Summary

Upgrades Apache Calcite from 1.40.0 to 1.42.0. Pinot's master was never moved to 1.41, so this folds in both the 1.40→1.41 and 1.41→1.42 deltas in a single bump.

The bulk of the upgrade is a faithful re-sync of Pinot's customized SQL parser grammar to upstream 1.42, plus a handful of targeted workarounds for behavioral changes Calcite introduced across these two releases. No public Pinot API or wire/segment format changes.

Changes

Dependency

  • pom.xml: calcite.version1.42.0 (covers calcite-core and calcite-babel).
  • Pinned joou-java-6 to 0.9.5 in dependencyManagement to resolve a dependency-convergence conflict: calcite-core 1.42 pulls joou-java-6 0.9.5 while the transitive avatica-core 1.28.0 still wants 0.9.4.
  • LICENSE-binary: bumped the Calcite/Avatica entries to 1.42.0 / 1.28.0, added org.jooq:joou-java-6:0.9.5, and dropped the now-unbundled avatica-metrics (Avatica 1.28 no longer pulls it).

SQL parser codegen sync (pinot-common/src/main/codegen)

  • Re-synced templates/Parser.jj, config.fmpp, and default_config.fmpp to upstream Calcite 1.42, preserving every PINOT CUSTOMIZATION region.
  • The new babel feature flags introduced upstream — includeStarExclude (SELECT * EXCLUDE/REPLACE), includeSelectBy (SELECT ... BY), and includeIntervalWithoutQualifier — are intentionally kept OFF. The grammar is synced but inactive: the multi-stage engine has no downstream support for these features, so enabling them would parse syntax the planner/runtime can't execute. They are wired through default_config.fmpp so a future change can flip them on deliberately.
  • UNSIGNED is added as a non-reserved keyword (needed for the unsigned integer types below).

Behavioral workarounds for 1.41/1.42 changes

  • CALCITE-7189 (non-strict GROUP BY): 1.41+ BABEL enables MySQL-style non-strict GROUP BY (wrapping non-grouped columns in ANY_VALUE()), but the implementation NPEs when a window function is combined with GROUP BY (e.g. SELECT MIN(col) OVER() FROM t GROUP BY col). Validator now uses a SqlDelegatingConformance over BABEL that overrides isNonStrictGroupBy() to false. This is also the semantically correct behavior for Pinot, which requires all non-aggregated columns to appear in GROUP BY. The feature remains present (un-reverted) in 1.42, so the override is retained.
  • CALCITE-7379 (decorrelation type assertion): the upstream fix does not fully cover the correlated-subquery shapes Pinot produces — a post-decorrelation Litmus.THROW type assertion still fires for a nullability-only divergence. PinotRelDecorrelator (new, see below) relaxes that one assertion: it logs a warning when the row types differ only in nullability and continues, but still fails fast on any structural type change.
  • CALCITE-7351: RelDataTypeSystem#getMaxNumericScale/getMaxNumericPrecision became final. TypeSystem drops the now-illegal overrides; the equivalent behavior is preserved via the type-specific getMaxScale/getMaxPrecision(DECIMAL) overrides Pinot already defines.
  • Filtered MIN/MAX nullability: 1.42 exposes SqlOperatorBinding#hasEmptyGroup(). PinotMinMaxReturnTypeInference now also treats a possibly-empty group as nullable (alongside the existing getGroupCount() == 0 / hasFilter() checks), matching the runtime's null-on-empty behavior for MIN(x) FILTER (WHERE ...).

Unsigned integer types (CALCITE-1466)

  • BABEL now parses UTINYINT/USMALLINT/UINTEGER/UBIGINT. Pinot has no native unsigned storage, so these are accepted and treated as their signed equivalents everywhere a SqlTypeName switch handles integers:
    • RelToPlanNodeConverter / v2 PRelToPlanNodeConverter: map to the narrowest signed type that holds the full range — UTINYINT/USMALLINTINT, UINTEGER/UBIGINTLONG (a signed INT would wrap UINTEGER values above 2³¹).
    • PinotEvaluateLiteralRule: fold a constant unsigned cast into its signed-equivalent type — Calcite cannot build a RexLiteral of an unsigned type, so without this CAST(5 AS INTEGER UNSIGNED) failed constant-folding.
    • TypeSystem.deriveSumType: widen to (signed) BIGINT like the signed integer types, so SUM over a sub-64-bit unsigned column accumulates into 64 bits instead of silently overflowing a 32-bit INT.
    • ArithmeticFunctionUtils.normalizeNumericType: normalize to the signed integral type so arithmetic over an unsigned operand stays integral rather than widening to DOUBLE.

New class

  • org.apache.pinot.calcite.sql2rel.PinotRelDecorrelator — a minimal subclass of Calcite's RelDecorrelator that exists solely to relax the CALCITE-7379 assertion described above. It lives under the org.apache.calcite.sql2rel package because the relevant members (CorelMap, decorrelate, etc.) are package/protected-subclass visible in Calcite.

Testing & validation

  • pinot-query-planner unit suite: 1262/1262 pass.
  • pinot-query-runtime result-correctness vs H2 — ResourceBasedQueriesTest 3571 pass / 0 fail (6 pre-existing skips), QueryRunnerTest 130/130.
  • OfflineClusterIntegrationTest run locally; updated testQueryWithRepeatedColumnsV2 to reflect that 1.42 now accepts repeated columns in GROUP BY but still rejects ambiguous repeated columns in ORDER BY.
  • New/updated regression tests pin every workaround: filtered MIN/MAX nullability, the CALCITE-7379 decorrelation path (the structural-vs-nullability divergence decision is extracted and unit-tested), the non-strict-GROUP-BY-with-window NPE, unsigned-type cast acceptance, SUM/arithmetic return types over unsigned operands, and single-stage unsigned casts.
  • The 8 pinot-query-planner/src/test/resources/queries/*.json EXPLAIN-plan snapshots are mechanically regenerated (label/whitespace deltas from upstream rule changes), not hand-edited.

Behavior & compatibility notes

  • Repeated GROUP BY key now accepted (MSE): under the multi-stage engine, a query with a repeated grouping key (e.g. SELECT x, COUNT(*) FROM t GROUP BY x, x) previously failed validation and now succeeds — Calcite 1.42 de-duplicates the repeated key. This is a benign relaxation (no previously-working query breaks), but during a mixed-version rolling upgrade the same query is rejected by a 1.40 broker and accepted by a 1.42 broker. Repeated columns in ORDER BY are still rejected as ambiguous (covered by the updated testQueryWithRepeatedColumnsV2).
  • Plan shape: semi → inner join in a few IN/EXISTS shapes. The 1.41 subquery-decorrelation rework rewrites the outer semi-join to an inner join in a small number of nested IN/semi-join shapes (visible in the regenerated JoinPlans.json/PhysicalOptimizerPlans.json). This is a sound, plan-equivalent rewrite — Calcite only applies it where the join's right input is already distinct (e.g. fed by an aggregate) — and result-equivalence is covered by the H2-comparison suites (ResourceBasedQueriesTest) and the integration tests, which all pass.
  • New parseable bitwise infix operators (<<, &, ^). The faithful parser sync makes these parseable for the first time (upstream-stock 1.42 additions to BinaryRowOperator, mapping to SqlStdOperatorTable.BIT_LEFT_SHIFT/BITAND_OPERATOR/BITXOR_OPERATOR); previously they were parse errors. No operator wiring is added by this PR. End-to-end status differs by engine: in the multi-stage engine PinotOperatorTable is a curated allow-list that does not register them, so such expressions fail at operator resolution (validation) — same status as the other synced-but-inert 1.42 grammar. In the single-stage engine the canonical names of &/^ (bitand/bitxor) coincide with Pinot's existing bitAnd/bitXor scalar functions, so a & b / a ^ b may resolve to those (a MySQL-style infix alias), while << has no Pinot equivalent and errors. This v1/v2 divergence is an inherent consequence of faithfully syncing the upstream grammar; explicitly wiring up or rejecting these operators is left as a separate, deliberate decision.
  • Unsigned casts in the single-stage engine. Because the single-stage parser also uses BABEL, CAST(x AS INTEGER UNSIGNED) (and the other unsigned types) is now parseable in v1 too. DataTypeConversionFunctions.cast maps the unsigned type-literal to its signed equivalent (UTINYINT/USMALLINT → INT, UINTEGER/UBIGINT → LONG), mirroring the multi-stage converter, so v1 and v2 treat unsigned casts consistently (previously the v1 path would have errored at execution with "Unknown data type"). Covered by DataTypeConversionFunctionsTest#testCastToUnsignedTypes.
  • No config keys, SPI signatures, enum/DataType additions, JSON/Protobuf fields, or DataTable/segment-version changes — nothing else has mixed-version visibility.

Notes for reviewers

  • The synced-but-disabled grammar (EXCLUDE/REPLACE/SELECT-BY/bare-INTERVAL) is deliberately inert — included to keep Parser.jj a faithful upstream sync rather than a divergent fork. Flipping the three fmpp flags is a separate, future decision.
  • The upstream colon-path field-access grammar (AddOptionalColonPath/ColonBracketSegment) was also synced and is likewise inert under Pinot — but it is gated by the upstream conformance method SqlConformance.isColonFieldAccessAllowed() (which returns false for BABEL), not by an fmpp flag. These productions are byte-for-byte from upstream Calcite 1.42.0 (verified against calcite-1.42.0 Parser.jj), so they intentionally carry no PINOT CUSTOMIZATION markers.
  • All PINOT CUSTOMIZATION markers from the prior grammar are preserved.
  • Oversized integer literal overflow (e.g. a UBIGINT literal above Long.MAX) wraps modulo the target signed type, identical to the existing single-stage behavior documented in RequestUtils.getLiteral. The final stored value is the same regardless of where the narrowing happens, so the MSE literal path is left consistent with single-stage rather than diverging.
  • Follow-up (not in this PR): the unsigned→signed handling touches several type-dispatch switches (the two converters, DataTypeConversionFunctions, ArithmeticFunctionUtils.normalizeNumericType, TypeSystem.deriveSumType). Each maps to a different target enum and uses case labels (which must be compile-time constants, so they can't delegate to a shared predicate), so the per-switch listing is largely inherent. The one genuinely-collapsible duplication is RelToPlanNodeConverter.convertToColumnDataType and the v2 PRelToPlanNodeConverter.convertToColumnDataType, which are byte-for-byte identical public static copies — a pre-existing smell this diff merely extends. Collapsing those two (have v2 delegate to v1) is worth a dedicated refactor; left out here to keep the diff scoped to the version bump.
  • The filtered MIN/MAX fix (hasEmptyGroup()) is a planning-time type-nullability correction — it lets the query validate under 1.42's stricter nullability checks. Pinot's DataSchema/ColumnDataType erases nullability, so this does not change runtime values; the empty-filtered-group → NULL runtime semantics are pre-existing and unchanged. The fix is covered by a compile-time regression test (testFilteredMinMaxAggregateNullability).

@yashmayya yashmayya added the dependencies Pull requests that update a dependency file label Jun 2, 2026
Bump calcite-core/babel to 1.42.0 (folds in both the 1.40->1.41 and
1.41->1.42 deltas, since master was never moved to 1.41) and pin
joou-java-6 to 0.9.5 to resolve a dependency-convergence conflict
between calcite-core and the transitive avatica-core 1.28.0.

Sync the customized SQL parser (Parser.jj + the fmpp configs) to upstream
1.42, preserving all PINOT CUSTOMIZATION regions. The new babel feature
flags (includeStarExclude, includeSelectBy, includeIntervalWithoutQualifier)
are intentionally kept OFF: the grammar is synced but inactive, as the
multi-stage engine has no downstream support for those features yet. The
upstream colon-path field-access grammar was likewise synced and is inert
under Pinot's BABEL conformance (isColonFieldAccessAllowed() returns false) -
gated by conformance rather than an fmpp flag. The sync also makes the bitwise
infix operators '<<', '&' and infix '^' parseable (upstream-stock 1.42
additions); they are not registered in PinotOperatorTable, so they are not yet
supported end-to-end.

Handle 1.42 behavioral changes:
- CALCITE-7189: Validator disables non-strict GROUP BY (BABEL enables it
  in 1.41+, which NPEs for window functions combined with GROUP BY).
- CALCITE-7379: PinotRelDecorrelator relaxes the post-decorrelation type
  assertion for the nullability-only divergence that still fires on some
  Pinot correlated-subquery shapes; it fails fast on structural changes.
- CALCITE-7351: drop the now-final getMaxNumericScale/getMaxNumericPrecision
  overrides (they delegate to the type-specific overrides Pinot keeps).
- Filtered MIN/MAX are now nullable via SqlOperatorBinding.hasEmptyGroup().
- Unsigned integer types (CALCITE-1466) parse under BABEL and are treated as
  their signed equivalents throughout (converters, literal folding, SUM return
  type, and arithmetic normalization), so e.g. SUM over an unsigned column
  widens to BIGINT instead of overflowing a 32-bit INT.

Regenerate the EXPLAIN plan snapshots and update/extend the affected tests.
@yashmayya yashmayya force-pushed the calcite-1.42.0-upgrade branch from b8bf3a6 to 3cbe3eb Compare June 2, 2026 21:14
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 2, 2026

Codecov Report

❌ Patch coverage is 97.05882% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 64.45%. Comparing base (f6b930b) to head (3cbe3eb).
⚠️ Report is 24 commits behind head on master.

Files with missing lines Patch % Lines
...he/pinot/calcite/sql2rel/PinotRelDecorrelator.java 95.23% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             master   #18658    +/-   ##
==========================================
  Coverage     64.45%   64.45%            
- Complexity     1282     1291     +9     
==========================================
  Files          3352     3366    +14     
  Lines        207171   208152   +981     
  Branches      32348    32497   +149     
==========================================
+ Hits         133534   134168   +634     
- Misses        62910    63201   +291     
- Partials      10727    10783    +56     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.45% <97.05%> (+<0.01%) ⬆️
temurin 64.45% <97.05%> (+<0.01%) ⬆️
unittests 64.45% <97.05%> (+<0.01%) ⬆️
unittests1 56.87% <97.05%> (+0.06%) ⬆️
unittests2 37.14% <26.47%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 high-signal issue; see inline comment.

case UINTEGER:
// UBIGINT (0..2^64-1) has no wider signed type, so values above Long.MAX_VALUE wrap (two's-complement) - this is
// unavoidable and acceptable since Pinot has no unsigned storage type.
case UBIGINT:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting UBIGINT here turns BIGINT UNSIGNED into a lossy signed LONG. Any value above Long.MAX_VALUE will now silently wrap, so this is a wrong-result regression rather than a harmless type downgrade. Since Pinot cannot represent the full unsigned 64-bit range, this needs to fail validation/planning instead of being mapped to LONG.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants