Skip to content

Vldbss 2025(merge on demand) #581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Vldbss 2025(merge on demand) #581

wants to merge 11 commits into from

Conversation

nautaa
Copy link
Member

@nautaa nautaa commented Jul 4, 2025

What problem were solved in this pull request?

this PR is used to trig CI

evtrouble and others added 3 commits July 3, 2025 14:52
Add VLDB2025 related competition questions and document descriptions.

---------

Co-authored-by: buduo.jyh <[email protected]>
Add SQL execution flowchart and corresponding source code for easy
understanding and maintenance
fix vldbss2025 ci problem
Copy link

codecov bot commented Jul 7, 2025

Codecov Report

Attention: Patch coverage is 8.61678% with 403 lines in your changes missing coverage. Please review.

Project coverage is 51.76%. Comparing base (d482f5e) to head (fd66920).

Files with missing lines Patch % Lines
src/observer/sql/expr/aggregate_state.cpp 0.00% 104 Missing ⚠️
src/observer/sql/expr/aggregate_hash_table.cpp 0.00% 36 Missing ⚠️
src/observer/storage/common/arena_allocator.cpp 0.00% 33 Missing ⚠️
src/observer/sql/expr/expression.cpp 8.82% 31 Missing ⚠️
src/observer/sql/parser/yacc_sql.y 0.00% 26 Missing ⚠️
src/observer/storage/common/column.cpp 33.33% 22 Missing ⚠️
src/observer/common/type/string_t.h 0.00% 20 Missing ⚠️
src/observer/sql/expr/aggregate_state.h 0.00% 14 Missing ⚠️
src/observer/common/value.cpp 13.33% 13 Missing ⚠️
src/observer/sql/expr/expression.h 0.00% 13 Missing ⚠️
... and 20 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #581      +/-   ##
==========================================
- Coverage   53.59%   51.76%   -1.83%     
==========================================
  Files         180      187       +7     
  Lines       10648    11053     +405     
==========================================
+ Hits         5707     5722      +15     
- Misses       4941     5331     +390     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hnwyllmm and others added 3 commits July 14, 2025 17:05
reset_data() only reset the count, but not the data. When processing the
string-like value, the previous data will be cached, and they may
interfere with the new value. So reset the data after the count in
reset_data().

### What problem were solved in this pull request?

Problem:

After calling reset_data() of Column with string type, call
append_value() with another Value(). As long as the attr_len_ >
Value.size(), the previous data will be retained, and they may interfere
with the newly one, as the reset_data() does not clear the actual data
inside.

For example:
id
ida
idb

After calling reset_data() and then call append_value() with actual
string "a", the column will be as follows:
ad(the char 'c' retains.)
(the following are invalid.)

### What is changed and how it works?

reset_data() will clear the inside data actually, not only reset the
count.
### What problem were solved in this pull request?
When bison fails to parse SQL, memory leaks may occur,  such as 
select * from t where id><1;

### What is changed and how it works?
In yacc_sql.y, use %destructor to specify the memory release action that
needs to be called when parsing fails for the token, in order to prevent
memory leaks that may occur in the event of an error.
@hnwyllmm hnwyllmm requested a review from Copilot July 19, 2025 02:41
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request implements comprehensive changes for the VLDBSS 2025 real-time analytics enhancement. The PR adds support for PAX storage format, vectorized execution operators, new data types, and data loading capabilities to MiniOB.

Key Changes:

  • Added PAX storage format support with column-wise data organization within pages
  • Implemented vectorized execution operators (aggregation, group by, order by, limit)
  • Added new data types: DATE, TEXT, BIGINT with corresponding type handlers
  • Enhanced LOAD DATA functionality with CSV parsing and batch insertion capabilities

Reviewed Changes

Copilot reviewed 66 out of 69 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/observer/storage/record/record_manager.cpp Core PAX storage implementation with column handlers
src/observer/sql/operator/aggregate_vec_physical_operator.cpp Vectorized aggregation operator implementation
src/observer/storage/common/column.cpp Enhanced column operations for vectorized processing
src/observer/common/value.cpp Extended value types and string handling
src/observer/sql/parser/yacc_sql.y Grammar updates for new SQL features

Comment on lines +427 to +430
// Todo:
// 1.参考RowRecordPageHandler::insert_record完成大体实现
// 2.将一行数据拆分成不同的列插入到不同偏移中
// 对应列的偏移可以参照RecordPageHandler::init_empty_page
Copy link
Preview

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment should provide more specific implementation guidance or be removed if the implementation is complete.

Suggested change
// Todo:
// 1.参考RowRecordPageHandler::insert_record完成大体实现
// 2.将一行数据拆分成不同的列插入到不同偏移中
// 对应列的偏移可以参照RecordPageHandler::init_empty_page
// TODO: Implement the method to insert a record in PAX format.
// Steps:
// 1. Refer to RowRecordPageHandler::insert_record for the general implementation structure.
// 2. Split the input row data into individual columns based on the schema definition.
// 3. For each column, calculate the offset within the page where the column data should be stored.
// - Use the logic in RecordPageHandler::init_empty_page to determine the offsets for each column.
// 4. Write the column data to the calculated offsets in the page.
// 5. Update the bitmap and page header to reflect the new record.
// 6. Mark the page as dirty and log the insertion operation if necessary.

Copilot uses AI. Check for mistakes.

Comment on lines +467 to +470
// Todo:
// 1.参考RowRecordPageHandler::get_record完成大体实现
// 2.通过列的偏移拼接出完整的行数据
// 可以参照PaxRecordPageHandler::insert_record的实现
Copy link
Preview

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment should provide more specific implementation guidance or be removed if the implementation is complete.

Suggested change
// Todo:
// 1.参考RowRecordPageHandler::get_record完成大体实现
// 2.通过列的偏移拼接出完整的行数据
// 可以参照PaxRecordPageHandler::insert_record的实现
// TODO: Implement the function to fetch a record for the given RID.
// Steps:
// 1. Use the RID to locate the corresponding slot in the page.
// 2. For each column, calculate the offset using the column metadata initialized in RecordPageHandler::init_empty_page.
// 3. Fetch the data for each column and assemble it into a complete row format.
// 4. Populate the provided Record object with the assembled data.
// Reference RowRecordPageHandler::get_record for general implementation logic.
// Reference PaxRecordPageHandler::insert_record for handling column offsets.

Copilot uses AI. Check for mistakes.

Comment on lines +478 to +480
// Todo:
// 参照PaxRecordPageHandler::get_record
// 一次性获得一个page的所有record
Copy link
Preview

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment should provide more specific implementation guidance or be removed if the implementation is complete.

Suggested change
// Todo:
// 参照PaxRecordPageHandler::get_record
// 一次性获得一个page的所有record
// TODO: Implement the get_chunk method to retrieve all records from a page.
// Steps:
// 1. Iterate through all slots in the page using the bitmap to identify valid records.
// 2. For each valid record, use the get_field_data method to retrieve column data.
// 3. Populate the Chunk object with the retrieved records and their corresponding column data.
// 4. Ensure proper handling of memory and error cases.

Copilot uses AI. Check for mistakes.

@@ -191,6 +194,8 @@ RC RecordPageHandler::init_empty_page(
bitmap_ = frame_->data() + PAGE_HEADER_SIZE;
memset(bitmap_, 0, page_bitmap_size(page_header_->record_capacity));
// column_index[i] store the end offset of column `i` or the start offset of column `i+1`

// 计算列偏移
Copy link
Preview

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment '计算列偏移' (calculate column offset) should be in English to maintain consistency with the codebase documentation language.

Suggested change
// 计算列偏移
// Calculate column offset

Copilot uses AI. Check for mistakes.

@@ -282,6 +288,7 @@ RC RowRecordPageHandler::insert_record(const char *data, RID *rid)
bitmap.set_bit(index);
page_header_->record_num++;

// 记录日志,与数据库恢复相关
Copy link
Preview

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment '记录日志,与数据库恢复相关' (log recording, related to database recovery) should be in English to maintain consistency with the codebase documentation language.

Suggested change
// 记录日志,与数据库恢复相关
// Log the record, related to database recovery

Copilot uses AI. Check for mistakes.

Comment on lines +84 to +88

// TODO: pax format and row format
void LoadDataExecutor::load_data(Table *table, const char *file_name, char terminated, char enclosed, SqlResult *sql_result)
{
// your code here
Copy link
Preview

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment 'your code here' indicates incomplete implementation and should be removed if the functionality is complete.

Suggested change
// TODO: pax format and row format
void LoadDataExecutor::load_data(Table *table, const char *file_name, char terminated, char enclosed, SqlResult *sql_result)
{
// your code here
RC insert_record_from_file_pax_format(
Table *table, vector<string> &file_values, vector<Value> &record_values, stringstream &errmsg)
{
RC rc = RC::SUCCESS;
// Implement logic for inserting records in PAX_FORMAT
// This may involve transforming file_values into a format suitable for PAX storage
// and calling appropriate table methods to insert the data.
return rc;
}
// TODO: pax format and row format
void LoadDataExecutor::load_data(Table *table, const char *file_name, char terminated, char enclosed, SqlResult *sql_result)
{

Copilot uses AI. Check for mistakes.

insertion_count++;
}
} else if (table->table_meta().storage_format() == StorageFormat::PAX_FORMAT) {
// your code here
Copy link
Preview

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment 'your code here' indicates incomplete implementation and should be removed if the functionality is complete.

Copilot uses AI. Check for mistakes.

}
} else if (table->table_meta().storage_format() == StorageFormat::PAX_FORMAT) {
// your code here
// Todo: 参照insert_record_from_file实现
Copy link
Preview

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment should provide more specific implementation guidance or be removed if the implementation is complete. Additionally, the comment should be in English.

Suggested change
// Todo: 参照insert_record_from_file实现
// TODO: Implement functionality for PAX_FORMAT. Refer to insert_record_from_file for guidance.
// Specifically, adapt the logic to handle PAX_FORMAT's columnar storage structure.

Copilot uses AI. Check for mistakes.

Comment on lines +17 to +19
// Todo: 实现新数据类型
// your code here

Copy link
Preview

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment '实现新数据类型' (implement new data types) should be in English and should be removed if the implementation is complete.

Suggested change
// Todo: 实现新数据类型
// your code here

Copilot uses AI. Check for mistakes.

evtrouble and others added 5 commits July 21, 2025 14:27
### What problem were solved in this pull request?
In unit_test, the use of empty chunks in the ArithmeticExpr.get_column
test resulted in no calculation results. Previously, the ValueExpr was
modified to return at least one constant value, but it was found to be
problematic. This time, the test case will be directly modified.
…too high (#591)

Fix libevent, jsoncpp compilation on mac shows that cmake version is too
high.

### What problem were solved in this pull request?
<img width="1112" height="322" alt="libevent fail"
src="https://github.com/user-attachments/assets/c2f72f4a-ff61-41ff-afb2-d11870c918cf"
/>
Fix libevent, jsoncpp compilation on mac shows that cmake version is too
high.
Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants