Performance 10588043604: Decode only required columns in processing pipeline with dynamic schema #2766

alexowens90 · 2025-11-18T16:07:14Z

Reference Issues/PRs

What does this implement or fix?

PipelineContext::overall_column_bitset_ is used to construct a hash set of columns that are to be decoded in the processing pipeline path.
Prior to this change, this bitset was only populated for static schema and (unsupported) bucketize dynamic schema data. This resulted in all columns being decoded with dynamic schema, which becomes a performance bottleneck with very wide dataframes where only a few columns are required for the processing and output.

Test coverage already fairly comprehensive in test_basic_version_store.py::test_dynamic_schema_read_columns, and read_index behaviour in test_read_index.py.

…ipeline with dynamic schema

IvoDD · 2025-11-18T16:41:00Z

cpp/arcticdb/async/tasks.cpp

    ARCTICDB_TRACE(log::codec(), "Creating segment");
    SegmentInMemory segment_in_memory(std::move(descriptor));
    decode_into_memory_segment(seg, hdr, segment_in_memory, desc);
+    segment_in_memory.set_row_data(std::max(segment_in_memory.row_count() - 1, ranges_and_key_.row_range().diff() - 1));


When can the row_range.diff() be different than the decoded segment size?

Also out of curiosity was this change needed to make this PR work or it's just a general improvement? I agree it is more correct to e.g. populate the sparse map if a column is missing from the segment.

The decoded segment size is based on the longest column it sees while decoding. This change was needed to make the PR work in the case where:

columns=[] (i.e. just the index column[s] on the V2 API)

The data has a range index

Some processing (e.g. head or tail) are applied

Covered in these tests

IvoDD · 2025-11-18T16:53:55Z

cpp/arcticdb/pipeline/read_pipeline.hpp

-    if (!dynamic_schema || column_groups) {
-        get_column_bitset_in_context(query, pipeline_context);
-    }
+    get_column_bitset_in_context(query, pipeline_context);


Definetely not for this PR:

I think this will have a side effect of making something like read(columns=all_cols_but_one) slower for very short and wide dataframes.

It looks like there is a bunch of avoidable column name string munging operations (e.g. a bunch of copying here)

Possibly yeh. This is all a bit of a mess. We have 4 different fields in the pipeline context right now controlling slightly different aspects of this. The (hugely unnecessary) complexity of the code in query.hpp make it hard to change confidently, even though all it is doing is picking out row and column slices from the index to read.
I think there is a multi-stage refactor needed. One is to eliminate the pipeline context from everything after the data keys are read, and another to cleanup everything to do with column selection and index key filtering. I think the impetus will come from either:

The current index key filtering is too slow for huge data

We fully implement bucketize dynamic, when a lot of this code will need to be touched anyway

Performance 10588043604: Decode only required columns in processing p…

b92b799

…ipeline with dynamic schema

alexowens90 self-assigned this Nov 18, 2025

alexowens90 requested review from IvoDD and poodlewars as code owners November 18, 2025 16:07

alexowens90 added patch Small change, should increase patch version performance labels Nov 18, 2025

alexowens90 mentioned this pull request Nov 18, 2025

WIP: Investigate query perf #2764

Closed

5 tasks

IvoDD approved these changes Nov 18, 2025

View reviewed changes

vasil-pashov approved these changes Nov 19, 2025

View reviewed changes

alexowens90 merged commit d69bc33 into master Nov 19, 2025
370 of 372 checks passed

alexowens90 deleted the perf/10588043604/decode-only-required-columns-in-processing-pipeline-with-dynamic-schema branch November 19, 2025 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance 10588043604: Decode only required columns in processing pipeline with dynamic schema #2766

Performance 10588043604: Decode only required columns in processing pipeline with dynamic schema #2766

Uh oh!

alexowens90 commented Nov 18, 2025

Uh oh!

IvoDD Nov 18, 2025

Uh oh!

alexowens90 Nov 18, 2025

Uh oh!

IvoDD Nov 18, 2025

Uh oh!

alexowens90 Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Performance 10588043604: Decode only required columns in processing pipeline with dynamic schema #2766

Performance 10588043604: Decode only required columns in processing pipeline with dynamic schema #2766

Uh oh!

Conversation

alexowens90 commented Nov 18, 2025

Reference Issues/PRs

What does this implement or fix?

Uh oh!

IvoDD Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

alexowens90 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

IvoDD Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

alexowens90 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants