Skip to content

Typed columns from ScanIterator down; LE layouts move to VortexFormat#199

Merged
dfa1 merged 7 commits into
mainfrom
chunk-columnname
Jul 4, 2026
Merged

Typed columns from ScanIterator down; LE layouts move to VortexFormat#199
dfa1 merged 7 commits into
mainfrom
chunk-columnname

Conversation

@dfa1

@dfa1 dfa1 commented Jul 4, 2026

Copy link
Copy Markdown
Owner

What

Two refactors completing the "strings at the boundary, types inside" arc, plus the new docs-with-every-change rule written into CLAUDE.md.

  1. f8ad15d1 — typed columns from ScanIterator down. Chunk's two parallel string-keyed maps (columns + columnDtypes) collapse into one unmodifiable SequencedMap<ColumnName, Chunk.Column> — each column's Array and DType travel together, so desync is unrepresentable, and schema order is now part of the contract (previously Map.of gave no order guarantee for 1–2 column chunks). ColumnName originates once in ScanIterator.initialize() from the file's already-policy-certified schema and flows typed through ChunkSpec, layout lookups, and zone caches. Public rims keep String sugar (chunk.column(String), RowFilter, columnZoneStats) converted exactly once; a policy-invalid query name fails fast — it could never match a certified column. Compute kernels verified: ColumnName construction is per-chunk-per-column, never per-row.
  2. c060d34fLE_* layouts move PTypeIOVortexFormat. Endianness is a wire-format fact, not a ptype fact; VortexFormat (magic, trailer, version) is where format facts live. 116 files sweep onto the single source; six private duplicates deleted — two of which spelled the identical constant SHORT_LE instead of LE_SHORT.

Also rides along: the CLAUDE.md "Documentation is part of every change" section (8a81331f), with docs/reference.md and CHANGELOG updated in-branch for both refactors, plus the review-parity @throws fix on Chunk.as (0c61bd42).

Verification

  • ./mvnw verify green after every commit — all modules including the Rust-interop failsafe suite (no wire changes; the oracle is a pure regression check here).
  • Adversarial review on the typed-columns commit: no blockers; one should-fix (the @throws parity) applied; hot-path allocation audit explicitly confirmed clean.
  • ./mvnw javadoc:javadoc -pl core — zero output.

🤖 Generated with Claude Code

dfa1 and others added 7 commits July 4, 2026 20:01
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Chunk's two parallel string-keyed maps (columns + columnDtypes)
collapse into one SequencedMap<ColumnName, Chunk.Column> — the
Array and its DType travel together in the Column carrier, so
desync is unrepresentable, and schema order is now part of the
contract (previously Map.of gave no order guarantee for 1-2
column chunks).

ColumnName originates in ScanIterator.initialize(), parsed once
from the file's DType.Struct (the parse edge has already policy-
certified every name); ChunkSpec, layout lookups, zone-stat caches
and the column-map builders are typed end to end. Public rims keep
String sugar converted exactly once: chunk.column(String),
RowFilter references, columnZoneStats. A policy-invalid query name
fails fast with the policy message — it could never match a
certified column; valid-but-absent names keep the exact previous
behavior and messages.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review parity finding: as(String, ...) routes through ColumnName.of
like column(String) but did not document the IllegalArgumentException.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Endianness is a property of the wire format — trailer fields,
spec-table indexes, scaffolding, and element values are all
little-endian — so the shared unaligned LE layouts belong in
VortexFormat next to the magic and trailer shape, not in PTypeIO
(which keeps its real job: mapping ptypes onto those layouts).

Sweeps 116 files onto the single source and deletes the six private
copies (Trailer, LazyDecimalArray, GenericArray, ChunkedEncoding-
Decoder, PcoTansDecoder, PcoEncodingDecoder) — two of which used
reversed names (SHORT_LE) for byte-identical constants.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Found by IntelliJ inspections (value of colIdx is always 0) — dead
generality from the typed-columns refactor.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@dfa1 dfa1 merged commit 06d7aad into main Jul 4, 2026
6 checks passed
@dfa1 dfa1 deleted the chunk-columnname branch July 4, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant