Skip to content

Conversation

@alexowens90
Copy link
Collaborator

Reference Issues/PRs

10588043604

What does this implement or fix?

PipelineContext::overall_column_bitset_ is used to construct a hash set of columns that are to be decoded in the processing pipeline path.
Prior to this change, this bitset was only populated for static schema and (unsupported) bucketize dynamic schema data. This resulted in all columns being decoded with dynamic schema, which becomes a performance bottleneck with very wide dataframes where only a few columns are required for the processing and output.

Test coverage already fairly comprehensive in test_basic_version_store.py::test_dynamic_schema_read_columns, and read_index behaviour in test_read_index.py.

@alexowens90 alexowens90 self-assigned this Nov 18, 2025
@alexowens90 alexowens90 added patch Small change, should increase patch version performance labels Nov 18, 2025
@alexowens90 alexowens90 mentioned this pull request Nov 18, 2025
5 tasks
ARCTICDB_TRACE(log::codec(), "Creating segment");
SegmentInMemory segment_in_memory(std::move(descriptor));
decode_into_memory_segment(seg, hdr, segment_in_memory, desc);
segment_in_memory.set_row_data(std::max(segment_in_memory.row_count() - 1, ranges_and_key_.row_range().diff() - 1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can the row_range.diff() be different than the decoded segment size?

Also out of curiosity was this change needed to make this PR work or it's just a general improvement? I agree it is more correct to e.g. populate the sparse map if a column is missing from the segment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decoded segment size is based on the longest column it sees while decoding. This change was needed to make the PR work in the case where:

  • columns=[] (i.e. just the index column[s] on the V2 API)
  • The data has a range index
  • Some processing (e.g. head or tail) are applied

Covered in these tests

if (!dynamic_schema || column_groups) {
get_column_bitset_in_context(query, pipeline_context);
}
get_column_bitset_in_context(query, pipeline_context);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definetely not for this PR:

I think this will have a side effect of making something like read(columns=all_cols_but_one) slower for very short and wide dataframes.

It looks like there is a bunch of avoidable column name string munging operations (e.g. a bunch of copying here)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly yeh. This is all a bit of a mess. We have 4 different fields in the pipeline context right now controlling slightly different aspects of this. The (hugely unnecessary) complexity of the code in query.hpp make it hard to change confidently, even though all it is doing is picking out row and column slices from the index to read.
I think there is a multi-stage refactor needed. One is to eliminate the pipeline context from everything after the data keys are read, and another to cleanup everything to do with column selection and index key filtering. I think the impetus will come from either:

  1. The current index key filtering is too slow for huge data
  2. We fully implement bucketize dynamic, when a lot of this code will need to be touched anyway

@alexowens90 alexowens90 merged commit d69bc33 into master Nov 19, 2025
370 of 372 checks passed
@alexowens90 alexowens90 deleted the perf/10588043604/decode-only-required-columns-in-processing-pipeline-with-dynamic-schema branch November 19, 2025 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Small change, should increase patch version performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants