-
Notifications
You must be signed in to change notification settings - Fork 155
Performance 10588043604: Decode only required columns in processing pipeline with dynamic schema #2766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ipeline with dynamic schema
| ARCTICDB_TRACE(log::codec(), "Creating segment"); | ||
| SegmentInMemory segment_in_memory(std::move(descriptor)); | ||
| decode_into_memory_segment(seg, hdr, segment_in_memory, desc); | ||
| segment_in_memory.set_row_data(std::max(segment_in_memory.row_count() - 1, ranges_and_key_.row_range().diff() - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can the row_range.diff() be different than the decoded segment size?
Also out of curiosity was this change needed to make this PR work or it's just a general improvement? I agree it is more correct to e.g. populate the sparse map if a column is missing from the segment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decoded segment size is based on the longest column it sees while decoding. This change was needed to make the PR work in the case where:
columns=[](i.e. just the index column[s] on the V2 API)- The data has a range index
- Some processing (e.g. head or tail) are applied
Covered in these tests
| if (!dynamic_schema || column_groups) { | ||
| get_column_bitset_in_context(query, pipeline_context); | ||
| } | ||
| get_column_bitset_in_context(query, pipeline_context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definetely not for this PR:
I think this will have a side effect of making something like read(columns=all_cols_but_one) slower for very short and wide dataframes.
It looks like there is a bunch of avoidable column name string munging operations (e.g. a bunch of copying here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly yeh. This is all a bit of a mess. We have 4 different fields in the pipeline context right now controlling slightly different aspects of this. The (hugely unnecessary) complexity of the code in query.hpp make it hard to change confidently, even though all it is doing is picking out row and column slices from the index to read.
I think there is a multi-stage refactor needed. One is to eliminate the pipeline context from everything after the data keys are read, and another to cleanup everything to do with column selection and index key filtering. I think the impetus will come from either:
- The current index key filtering is too slow for huge data
- We fully implement bucketize dynamic, when a lot of this code will need to be touched anyway
Reference Issues/PRs
10588043604
What does this implement or fix?
PipelineContext::overall_column_bitset_is used to construct a hash set of columns that are to be decoded in the processing pipeline path.Prior to this change, this bitset was only populated for static schema and (unsupported) bucketize dynamic schema data. This resulted in all columns being decoded with dynamic schema, which becomes a performance bottleneck with very wide dataframes where only a few columns are required for the processing and output.
Test coverage already fairly comprehensive in
test_basic_version_store.py::test_dynamic_schema_read_columns, andread_indexbehaviour intest_read_index.py.