-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Vldbss 2025(merge on demand) #581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add VLDB2025 related competition questions and document descriptions. --------- Co-authored-by: buduo.jyh <[email protected]>
Add SQL execution flowchart and corresponding source code for easy understanding and maintenance
fix vldbss2025 ci problem
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #581 +/- ##
==========================================
- Coverage 53.59% 51.76% -1.83%
==========================================
Files 180 187 +7
Lines 10648 11053 +405
==========================================
+ Hits 5707 5722 +15
- Misses 4941 5331 +390 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
reset_data() only reset the count, but not the data. When processing the string-like value, the previous data will be cached, and they may interfere with the new value. So reset the data after the count in reset_data(). ### What problem were solved in this pull request? Problem: After calling reset_data() of Column with string type, call append_value() with another Value(). As long as the attr_len_ > Value.size(), the previous data will be retained, and they may interfere with the newly one, as the reset_data() does not clear the actual data inside. For example: id ida idb After calling reset_data() and then call append_value() with actual string "a", the column will be as follows: ad(the char 'c' retains.) (the following are invalid.) ### What is changed and how it works? reset_data() will clear the inside data actually, not only reset the count.
### What problem were solved in this pull request? When bison fails to parse SQL, memory leaks may occur, such as select * from t where id><1; ### What is changed and how it works? In yacc_sql.y, use %destructor to specify the memory release action that needs to be called when parsing fails for the token, in order to prevent memory leaks that may occur in the event of an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request implements comprehensive changes for the VLDBSS 2025 real-time analytics enhancement. The PR adds support for PAX storage format, vectorized execution operators, new data types, and data loading capabilities to MiniOB.
Key Changes:
- Added PAX storage format support with column-wise data organization within pages
- Implemented vectorized execution operators (aggregation, group by, order by, limit)
- Added new data types: DATE, TEXT, BIGINT with corresponding type handlers
- Enhanced LOAD DATA functionality with CSV parsing and batch insertion capabilities
Reviewed Changes
Copilot reviewed 66 out of 69 changed files in this pull request and generated 10 comments.
Show a summary per file
File | Description |
---|---|
src/observer/storage/record/record_manager.cpp | Core PAX storage implementation with column handlers |
src/observer/sql/operator/aggregate_vec_physical_operator.cpp | Vectorized aggregation operator implementation |
src/observer/storage/common/column.cpp | Enhanced column operations for vectorized processing |
src/observer/common/value.cpp | Extended value types and string handling |
src/observer/sql/parser/yacc_sql.y | Grammar updates for new SQL features |
// Todo: | ||
// 1.参考RowRecordPageHandler::insert_record完成大体实现 | ||
// 2.将一行数据拆分成不同的列插入到不同偏移中 | ||
// 对应列的偏移可以参照RecordPageHandler::init_empty_page |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment should provide more specific implementation guidance or be removed if the implementation is complete.
// Todo: | |
// 1.参考RowRecordPageHandler::insert_record完成大体实现 | |
// 2.将一行数据拆分成不同的列插入到不同偏移中 | |
// 对应列的偏移可以参照RecordPageHandler::init_empty_page | |
// TODO: Implement the method to insert a record in PAX format. | |
// Steps: | |
// 1. Refer to RowRecordPageHandler::insert_record for the general implementation structure. | |
// 2. Split the input row data into individual columns based on the schema definition. | |
// 3. For each column, calculate the offset within the page where the column data should be stored. | |
// - Use the logic in RecordPageHandler::init_empty_page to determine the offsets for each column. | |
// 4. Write the column data to the calculated offsets in the page. | |
// 5. Update the bitmap and page header to reflect the new record. | |
// 6. Mark the page as dirty and log the insertion operation if necessary. |
Copilot uses AI. Check for mistakes.
// Todo: | ||
// 1.参考RowRecordPageHandler::get_record完成大体实现 | ||
// 2.通过列的偏移拼接出完整的行数据 | ||
// 可以参照PaxRecordPageHandler::insert_record的实现 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment should provide more specific implementation guidance or be removed if the implementation is complete.
// Todo: | |
// 1.参考RowRecordPageHandler::get_record完成大体实现 | |
// 2.通过列的偏移拼接出完整的行数据 | |
// 可以参照PaxRecordPageHandler::insert_record的实现 | |
// TODO: Implement the function to fetch a record for the given RID. | |
// Steps: | |
// 1. Use the RID to locate the corresponding slot in the page. | |
// 2. For each column, calculate the offset using the column metadata initialized in RecordPageHandler::init_empty_page. | |
// 3. Fetch the data for each column and assemble it into a complete row format. | |
// 4. Populate the provided Record object with the assembled data. | |
// Reference RowRecordPageHandler::get_record for general implementation logic. | |
// Reference PaxRecordPageHandler::insert_record for handling column offsets. |
Copilot uses AI. Check for mistakes.
// Todo: | ||
// 参照PaxRecordPageHandler::get_record | ||
// 一次性获得一个page的所有record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment should provide more specific implementation guidance or be removed if the implementation is complete.
// Todo: | |
// 参照PaxRecordPageHandler::get_record | |
// 一次性获得一个page的所有record | |
// TODO: Implement the get_chunk method to retrieve all records from a page. | |
// Steps: | |
// 1. Iterate through all slots in the page using the bitmap to identify valid records. | |
// 2. For each valid record, use the get_field_data method to retrieve column data. | |
// 3. Populate the Chunk object with the retrieved records and their corresponding column data. | |
// 4. Ensure proper handling of memory and error cases. |
Copilot uses AI. Check for mistakes.
@@ -191,6 +194,8 @@ RC RecordPageHandler::init_empty_page( | |||
bitmap_ = frame_->data() + PAGE_HEADER_SIZE; | |||
memset(bitmap_, 0, page_bitmap_size(page_header_->record_capacity)); | |||
// column_index[i] store the end offset of column `i` or the start offset of column `i+1` | |||
|
|||
// 计算列偏移 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment '计算列偏移' (calculate column offset) should be in English to maintain consistency with the codebase documentation language.
// 计算列偏移 | |
// Calculate column offset |
Copilot uses AI. Check for mistakes.
@@ -282,6 +288,7 @@ RC RowRecordPageHandler::insert_record(const char *data, RID *rid) | |||
bitmap.set_bit(index); | |||
page_header_->record_num++; | |||
|
|||
// 记录日志,与数据库恢复相关 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment '记录日志,与数据库恢复相关' (log recording, related to database recovery) should be in English to maintain consistency with the codebase documentation language.
// 记录日志,与数据库恢复相关 | |
// Log the record, related to database recovery |
Copilot uses AI. Check for mistakes.
|
||
// TODO: pax format and row format | ||
void LoadDataExecutor::load_data(Table *table, const char *file_name, char terminated, char enclosed, SqlResult *sql_result) | ||
{ | ||
// your code here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment 'your code here' indicates incomplete implementation and should be removed if the functionality is complete.
// TODO: pax format and row format | |
void LoadDataExecutor::load_data(Table *table, const char *file_name, char terminated, char enclosed, SqlResult *sql_result) | |
{ | |
// your code here | |
RC insert_record_from_file_pax_format( | |
Table *table, vector<string> &file_values, vector<Value> &record_values, stringstream &errmsg) | |
{ | |
RC rc = RC::SUCCESS; | |
// Implement logic for inserting records in PAX_FORMAT | |
// This may involve transforming file_values into a format suitable for PAX storage | |
// and calling appropriate table methods to insert the data. | |
return rc; | |
} | |
// TODO: pax format and row format | |
void LoadDataExecutor::load_data(Table *table, const char *file_name, char terminated, char enclosed, SqlResult *sql_result) | |
{ |
Copilot uses AI. Check for mistakes.
insertion_count++; | ||
} | ||
} else if (table->table_meta().storage_format() == StorageFormat::PAX_FORMAT) { | ||
// your code here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment 'your code here' indicates incomplete implementation and should be removed if the functionality is complete.
Copilot uses AI. Check for mistakes.
} | ||
} else if (table->table_meta().storage_format() == StorageFormat::PAX_FORMAT) { | ||
// your code here | ||
// Todo: 参照insert_record_from_file实现 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment should provide more specific implementation guidance or be removed if the implementation is complete. Additionally, the comment should be in English.
// Todo: 参照insert_record_from_file实现 | |
// TODO: Implement functionality for PAX_FORMAT. Refer to insert_record_from_file for guidance. | |
// Specifically, adapt the logic to handle PAX_FORMAT's columnar storage structure. |
Copilot uses AI. Check for mistakes.
// Todo: 实现新数据类型 | ||
// your code here | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment '实现新数据类型' (implement new data types) should be in English and should be removed if the implementation is complete.
// Todo: 实现新数据类型 | |
// your code here |
Copilot uses AI. Check for mistakes.
### What problem were solved in this pull request? In unit_test, the use of empty chunks in the ArithmeticExpr.get_column test resulted in no calculation results. Previously, the ValueExpr was modified to return at least one constant value, but it was found to be problematic. This time, the test case will be directly modified.
…too high (#591) Fix libevent, jsoncpp compilation on mac shows that cmake version is too high. ### What problem were solved in this pull request? <img width="1112" height="322" alt="libevent fail" src="https://github.com/user-attachments/assets/c2f72f4a-ff61-41ff-afb2-d11870c918cf" /> Fix libevent, jsoncpp compilation on mac shows that cmake version is too high.
Co-authored-by: Copilot <[email protected]>
What problem were solved in this pull request?
this PR is used to trig CI