Skip to content

Conversation

@Weijun-H
Copy link
Member

@Weijun-H Weijun-H commented Jan 5, 2026

Which issue does this PR close?

  • Closes #NNN.

Rationale for this change

What changes are included in this PR?

This PR implements projection-aware field skipping in the arrow-json reader:

  1. New API: ReaderBuilder::with_projection(bool) enables opt-in field filtering
  2. Skip optimization: When enabled, JSON fields not in schema are skipped during tape parsing rather than fully parsed and discarded later
  3. Fail-fast for strict_mode: Unknown fields now error immediately during tape parsing instead of waiting until array decoding

Behavior matrix:

strict_mode projection Behavior
false false Parse all fields, ignore unknown (original)
false true Skip unknown fields at tape level
true * Error on first unknown field (fail-fast)
图片

Are these changes tested?

Yes, all existing tests pass

Are there any user-facing changes?

Yes, new public API:

  • ReaderBuilder::with_projection(bool) - opt-in to skip unknown JSON fields during parsing

This is additive and does not break existing behavior (default is false).

@Weijun-H Weijun-H changed the title feat: add projection support to TapeDecoder for skipping unknown fields in json parsing feat: add projection support to TapeDecoder for skipping unknown fields in json parsing (1.4x speedup) Jan 5, 2026
@Weijun-H Weijun-H marked this pull request as ready for review January 5, 2026 17:24
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea of skipping unwanted fields -- pure overhead to keep them -- but this PR feels overly complex/nested. I wonder if there's a "flatter" way to handle the situation?

Comment on lines 253 to 254
const SKIP_IN_STRING: u8 = 1 << 0; // 0x01
const SKIP_ESCAPE: u8 = 1 << 1; // 0x02
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use the hex constants directly, out of curiosity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also -- my intuition is that these two flags are only needed because SkipValue does too much. The newly introduced code has a lot of looping and nesting, where the existing enum variants are quite flat. The difference seems to be that the existing variants hand off to a new state whenever they detect a state change?

So e.g. instead of messing with flags, one might declare three new enum variants, SkipValue, SkipString and SkipEscape, where each nests exclusively inside the one before it? e.g. if the projection skipped field foo, then the following JSON fragment:

{ 
  "foo": {
    "bar": "hello\nworld!"
  }
}

would:

  • push a SkipValue as soon as : detects that foo is not selected
  • push a SkipString as soon as it hits the opening " of the string
  • push a SkipEscape as soon as it hits the \ inside the string
  • pop once the escape was processed
  • pop once the closing " is found
  • pop once the next field starts (or whatever is currently the ending condition for SkipValue)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the same token, one would arguably want to push multiple SkipValue states instead of tracking nesting depth with a new variable? But then enum variants start to proliferate (basically need two of each).

Would it instead make sense to have a single skip offset that is the first stack index being skipped?
And then have pairs of match arms that decide what state gets pushed vs. merely traversed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to enum variant in a255860

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, when I ran the benchmark on my laptop:

  • Your original PR had 8% overhead (wide run) and 36% benefit (narrow run)
  • The revised PR has overhead 9% and benefit 30%

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to remove the regression, I added a projection in ReaderBuilder to enable projection-aware parsing.

When enabled, JSON fields not present in the schema are skipped during tape parsing rather than being fully parsed and later ignored. This improves performance for narrow projections over wide JSON data.

图片

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re projection -- Very slick!

Were you able to repro the slowdown going from a421d89 (original version) to a2aa758 (extra enum variants)? It's pretty consistent for me.

Copy link
Member Author

@Weijun-H Weijun-H Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to repro the slowdown going from a421d89 (original version) to a2aa758 (extra enum variants)? It's pretty consistent for me.

This is also consistent on my side
图片

Copy link
Contributor

@scovich scovich Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I threw an LLM at this whole situation during a boring meeting, and arrived at a surprisingly different potential approach, if you're game to try it out?

The short version is:

  • Keep the existing (highly optimized and efficient) decoding logic, but factor it out to a helper method that is generic over const SKIP: bool that says whether to actually store the parsed output.
  • Wrap that helper in decode and decode_skip methods, with a clean transition between the two: enter at the : match (like the PR does today), and decode_skip breaks back out to decode when the stack length drops back down.
  • We need a new boolean skipping field to handle cases where input bytes were exhausted while skipping (so the next call to decode can jump straight to decode_skip when starting the next buffer of bytes)
  • The state stack tracks everything related to skipping (small memory cost but very efficient).
  • No new tape decoder enum variants needed.

In theory, the approach should be simpler (less duplicated source code) while also having friendlier branching (fewer and/or more predictable branches).

Is that something you'd want me to put a bit more time into exploring further?
Or something you'd prefer to dig into yourself?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 6, 2026
@Weijun-H Weijun-H force-pushed the tape-skip-in-json-parse branch from b427eeb to dfcbc97 Compare January 6, 2026 11:03
@Weijun-H Weijun-H force-pushed the tape-skip-in-json-parse branch from 08fcaaa to 8c1f4e9 Compare January 6, 2026 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants