-
Notifications
You must be signed in to change notification settings - Fork 81
feat(clp-s): Add support for losslessly compressing floating-point numbers; Bump archive format version to v0.4.1. #1176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds lossless float-format retention: new NodeTypes (FormattedFloat, DictionaryFloat), a 16-bit float-format encoding API (get_float_encoding, restore_encoded_float), per-element format/dictionary storage and wiring across ParsedMessage, readers/writers, JSON parser/serializer, CLI flag --retain-float-format, build changes, and tests/fixtures. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as "clp-s CLI"
participant Parser as "JsonParser"
participant PM as "ParsedMessage"
participant CW as "ColumnWriter"
participant AW as "ArchiveWriter"
User->>CLI: clp-s --retain-float-format ...
CLI->>Parser: JsonParserOption{retain_float_format}
Parser->>Parser: detect numeric token
alt retain_float_format AND encodable AND round-trip matches
Parser->>PM: add_value(node_id, double, format)
PM->>CW: variable_t(pair<double,float_format_t>)
CW->>CW: FormattedFloatColumnWriter.add_value
else retain_float_format AND encodable but round-trip fails
Parser->>PM: add_value(node_id, string)
PM->>CW: variable_t(string)
CW->>CW: DictionaryFloatColumnWriter.add_value (dict insert)
else retain_float_format off
Parser->>PM: add_value(node_id, double)
PM->>CW: variable_t(double)
CW->>CW: FloatColumnWriter.add_value
end
CW->>AW: store serialized blocks (values, formats or dict IDs)
AW->>User: archive stored
sequenceDiagram
autonumber
participant AR as "ArchiveReader"
participant CR as "ColumnReader"
participant SR as "SchemaReader"
participant JS as "JsonSerializer"
participant Out as "Output"
AR->>SR: generate_json_template()
SR->>JS: emit ops (AddFormattedFloat* included)
loop per row
AR->>CR: read column by NodeType
alt FormattedFloat
CR->>CR: load values + formats
CR->>JS: restore_encoded_float -> formatted string
else DictionaryFloat
CR->>CR: load dict IDs
CR->>JS: dictionary lookup -> string
else Float
CR->>JS: numeric double
end
JS->>Out: JSON line
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One important comment -- this new float encoding should be a new NodeType
w/ a new column reader and writer. We should avoid changing how the encoding for an existing NodeType
works.
…zero fraction, the exponent will be trimed incorrectly; Update the core-clp-s.md for the new flag and mention the limitation
One thing we'll need to figure out is whether we should add support to KV-IR streams, since we use them to view logs in the log viewer. E.g., if the user's original JSON log contained a float that was printed with a specific format (e.g., "-0.0e-0"), when converted to IR, that format wouldn't be retained). This wouldn't be a breaking change, so it seems doable. @LinZhihao-723 Any thoughts? |
From a format perspective, I think that should be feasible. We don't need to introduce any more types in the scheme tree, but add a CLP-encoded float type (so overall it works like the String type, where the encoded value can be a variable string or an encoded text AST). But we probably need to handle the case where the float can't be encoded into an 8-byte variable. In that case, we probably need to serialize it as a string. From the serializer's perspective, it probably requires more work: currently the serializer only takes msgpack. To trigger the CLP-float encoding, we may need to serialize the float as a string (which will have performance overhead), or we define a customized type (for which I don't know the details of how to do this. Need time to figure out). |
…lp into v0.4.0-dev-retain-float-format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 22
🔭 Outside diff range comments (3)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
244-253
: Gate--retain-float-format
behind a config flagUnconditionally enabling this flag changes the output semantics for all CLP_S jobs and can break environments where the binary hasn’t been updated. Only append
--retain-float-format
when the user explicitly opts in (default False).• components/job-orchestration/job_orchestration/executor/compress/compression_task.py (around lines 244–253):
compression_cmd = [ str(clp_home / "bin" / "clp-s"), "c", str(archive_output_dir), "--print-archive-stats", "--target-encoded-size", str(clp_config.output.target_segment_size + clp_config.output.target_dictionaries_size), "--compression-level", str(clp_config.output.compression_level), - "--retain-float-format", ] # fmt: on + # only include the flag when retain_float_format is true (default: False) + if getattr(clp_config.output, "retain_float_format", False): + compression_cmd.append("--retain-float-format")• Configuration schema (e.g. in scheduler/config or your config definitions): add
output: retain_float_format: false• Ensure your
ClpIoConfig
(or equivalent) exposesretain_float_format: bool
with a default of False.components/core/src/clp_s/SchemaReader.cpp (1)
37-44
: FormattedFloat timestamps not handled (potential crash or incorrect metadata)mark_column_as_timestamp handles Float but not FormattedFloat. If a timestamp column can be encoded as FormattedFloat under the new flag, m_get_timestamp won’t be set, leading to incorrect timestamps or dereferencing an uninitialized functor in get_next_message_with_metadata.
Consider handling FormattedFloat analogously to Float:
} else if (m_timestamp_column->get_type() == NodeType::Float) { m_get_timestamp = [this]() { return static_cast<epochtime_t>( std::get<double>(static_cast<FloatColumnReader*>(m_timestamp_column) ->extract_value(m_cur_message)) ); }; + } else if (m_timestamp_column->get_type() == NodeType::FormattedFloat) { + m_get_timestamp = [this]() { + return static_cast<epochtime_t>( + std::get<double>(static_cast<FormattedFloatColumnReader*>(m_timestamp_column) + ->extract_value(m_cur_message)) + ); + }; }If FormattedFloatColumnReader does not expose extract_value as a double, we should add an accessor that returns the underlying numeric value rather than parsing the formatted string.
components/core/tests/test-clp_s-end_to_end.cpp (1)
110-142
: Security concern: Command injection vulnerability in shell commandsThe
literallyCompare
function constructs shell commands using string interpolation without proper escaping of file paths. If file paths contain special characters (spaces, semicolons, backticks, etc.), this could lead to command injection.Use proper escaping or consider using C++ file operations instead of shell commands:
+#include <fstream> +#include <algorithm> +#include <sstream> + +// Helper function to escape shell arguments +std::string escape_shell_arg(const std::string& arg) { + std::string escaped = "'"; + for (char c : arg) { + if (c == '\'') { + escaped += "'\\''"; + } else { + escaped += c; + } + } + escaped += "'"; + return escaped; +} + void literallyCompare( std::filesystem::path const& expected_output_json_path, std::filesystem::path const& extracted_json_path ) { auto command = fmt::format( - "cat {} | awk 'NF' | tr -d ' ' | sort > {}", - expected_output_json_path.string(), - cTestEndToEndExpectedOutputSortedFile + "cat {} | awk 'NF' | tr -d ' ' | sort > {}", + escape_shell_arg(expected_output_json_path.string()), + escape_shell_arg(std::string(cTestEndToEndExpectedOutputSortedFile)) );
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (30)
components/core/src/clp_s/ArchiveReader.cpp
(2 hunks)components/core/src/clp_s/ArchiveWriter.cpp
(1 hunks)components/core/src/clp_s/ColumnReader.cpp
(3 hunks)components/core/src/clp_s/ColumnReader.hpp
(1 hunks)components/core/src/clp_s/ColumnWriter.cpp
(1 hunks)components/core/src/clp_s/ColumnWriter.hpp
(2 hunks)components/core/src/clp_s/CommandLineArguments.cpp
(1 hunks)components/core/src/clp_s/CommandLineArguments.hpp
(2 hunks)components/core/src/clp_s/FloatFormatEncoding.hpp
(1 hunks)components/core/src/clp_s/JsonParser.cpp
(5 hunks)components/core/src/clp_s/JsonParser.hpp
(2 hunks)components/core/src/clp_s/JsonSerializer.hpp
(1 hunks)components/core/src/clp_s/SchemaReader.cpp
(4 hunks)components/core/src/clp_s/SchemaTree.cpp
(1 hunks)components/core/src/clp_s/SchemaTree.hpp
(1 hunks)components/core/src/clp_s/clp-s.cpp
(1 hunks)components/core/tests/clp_s_test_utils.cpp
(2 hunks)components/core/tests/clp_s_test_utils.hpp
(2 hunks)components/core/tests/test-clp_s-delta-encode-log-order.cpp
(1 hunks)components/core/tests/test-clp_s-end_to_end.cpp
(4 hunks)components/core/tests/test-clp_s-range_index.cpp
(1 hunks)components/core/tests/test-clp_s-search.cpp
(5 hunks)components/core/tests/test_log_files/test_rounded_formatted_float.jsonl
(1 hunks)components/core/tests/test_log_files/test_rounded_formatted_float_rounded.jsonl
(1 hunks)components/core/tests/test_log_files/test_search_formatted_float.jsonl
(1 hunks)components/core/tests/test_log_files/test_valid_formatted_float.jsonl
(1 hunks)components/job-orchestration/job_orchestration/executor/compress/compression_task.py
(1 hunks)docs/src/dev-guide/design-retain-float-format.md
(1 hunks)docs/src/dev-guide/index.md
(1 hunks)docs/src/user-guide/core-clp-s.md
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit Configuration File
- Prefer
false == <expression>
rather than!<expression>
.
Files:
components/core/src/clp_s/JsonParser.hpp
components/core/src/clp_s/ArchiveWriter.cpp
components/core/src/clp_s/CommandLineArguments.cpp
components/core/src/clp_s/SchemaTree.hpp
components/core/tests/clp_s_test_utils.cpp
components/core/src/clp_s/SchemaTree.cpp
components/core/src/clp_s/FloatFormatEncoding.hpp
components/core/src/clp_s/CommandLineArguments.hpp
components/core/src/clp_s/ColumnWriter.hpp
components/core/src/clp_s/ArchiveReader.cpp
components/core/src/clp_s/clp-s.cpp
components/core/src/clp_s/JsonSerializer.hpp
components/core/src/clp_s/ColumnReader.hpp
components/core/src/clp_s/ColumnWriter.cpp
components/core/tests/test-clp_s-search.cpp
components/core/src/clp_s/SchemaReader.cpp
components/core/tests/clp_s_test_utils.hpp
components/core/tests/test-clp_s-end_to_end.cpp
components/core/tests/test-clp_s-delta-encode-log-order.cpp
components/core/tests/test-clp_s-range_index.cpp
components/core/src/clp_s/JsonParser.cpp
components/core/src/clp_s/ColumnReader.cpp
🧠 Learnings (7)
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/tests/clp_s_test_utils.cpp
components/core/tests/test-clp_s-search.cpp
components/core/tests/clp_s_test_utils.hpp
components/core/tests/test-clp_s-end_to_end.cpp
components/core/tests/test-clp_s-delta-encode-log-order.cpp
components/core/tests/test-clp_s-range_index.cpp
📚 Learning: 2025-08-06T23:57:49.762Z
Learnt from: gibber9809
PR: y-scope/clp#1163
File: components/core/src/clp_s/ColumnWriter.hpp:120-122
Timestamp: 2025-08-06T23:57:49.762Z
Learning: In the CLP codebase, the `encoded_log_dict_id_t` type alias in `components/core/src/clp_s/ColumnWriter.hpp` is appropriately placed as a class-scoped alias because the ColumnWriter.hpp header is already included in all places where this type is used, eliminating the need to move it to a common header.
Applied to files:
components/core/src/clp_s/ColumnWriter.hpp
📚 Learning: 2024-12-04T15:50:37.827Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.
Applied to files:
components/core/tests/test-clp_s-range_index.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
🧬 Code Graph Analysis (13)
components/core/src/clp_s/ArchiveWriter.cpp (2)
components/core/src/clp_s/ColumnReader.hpp (1)
FormattedFloat
(161-161)components/core/src/clp_s/ArchiveWriter.hpp (1)
writer
(278-278)
components/core/src/clp_s/CommandLineArguments.cpp (1)
components/core/src/clp_s/CommandLineArguments.hpp (1)
m_retain_float_format
(105-105)
components/core/src/clp_s/SchemaTree.hpp (1)
components/core/src/clp_s/ColumnReader.hpp (1)
FormattedFloat
(161-161)
components/core/src/clp_s/SchemaTree.cpp (1)
components/core/src/clp_s/ColumnReader.hpp (1)
FormattedFloat
(161-161)
components/core/src/clp_s/ArchiveReader.cpp (2)
components/core/src/clp_s/ColumnReader.hpp (1)
FormattedFloat
(161-161)components/core/src/clp_s/SchemaReader.hpp (5)
column_reader
(122-122)column_reader
(128-128)column_reader
(204-204)column_reader
(209-211)column_reader
(209-209)
components/core/src/clp_s/clp-s.cpp (1)
components/core/src/clp_s/ArchiveWriter.hpp (1)
option
(146-146)
components/core/src/clp_s/ColumnReader.hpp (1)
components/core/src/clp_s/Utils.hpp (2)
start
(223-225)start
(223-223)
components/core/src/clp_s/ColumnWriter.cpp (2)
components/core/src/clp_s/ParsedMessage.hpp (4)
add_value
(35-37)add_value
(35-35)value
(64-66)value
(64-64)components/core/src/clp_s/ColumnWriter.hpp (9)
value
(29-29)value
(59-59)value
(76-76)value
(94-94)value
(111-111)value
(129-129)value
(156-156)value
(213-213)value
(231-231)
components/core/tests/test-clp_s-search.cpp (3)
components/core/tests/test-clp_s-range_index.cpp (6)
get_test_input_path_relative_to_tests_dir
(36-36)get_test_input_path_relative_to_tests_dir
(52-54)get_test_input_path_relative_to_tests_dir
(52-52)get_test_input_local_path
(37-37)get_test_input_local_path
(56-60)get_test_input_local_path
(56-56)components/core/tests/clp_s_test_utils.cpp (2)
compress_archive
(13-59)compress_archive
(13-20)components/core/tests/clp_s_test_utils.hpp (1)
compress_archive
(23-30)
components/core/src/clp_s/SchemaReader.cpp (6)
components/core/src/clp_s/JsonSerializer.hpp (4)
column
(133-136)column
(133-133)column
(138-143)column
(139-139)components/core/src/clp_s/search/Projection.hpp (1)
column
(42-42)components/core/src/clp_s/search/SchemaMatch.hpp (4)
column
(94-94)column
(101-101)column
(154-154)column
(161-161)components/core/src/clp_s/search/ast/ColumnDescriptor.hpp (1)
static_cast
(243-243)components/core/src/clp_s/SchemaReader.hpp (2)
m_cur_message
(236-236)m_columns
(130-130)components/core/src/clp_s/ColumnReader.hpp (1)
FormattedFloat
(161-161)
components/core/tests/test-clp_s-end_to_end.cpp (3)
components/core/tests/test-clp_s-search.cpp (6)
get_test_input_path_relative_to_tests_dir
(46-47)get_test_input_path_relative_to_tests_dir
(62-65)get_test_input_path_relative_to_tests_dir
(62-63)get_test_input_local_path
(48-48)get_test_input_local_path
(67-71)get_test_input_local_path
(67-67)components/core/tests/test-clp_s-range_index.cpp (6)
get_test_input_path_relative_to_tests_dir
(36-36)get_test_input_path_relative_to_tests_dir
(52-54)get_test_input_path_relative_to_tests_dir
(52-52)get_test_input_local_path
(37-37)get_test_input_local_path
(56-60)get_test_input_local_path
(56-56)components/core/tests/clp_s_test_utils.cpp (2)
compress_archive
(13-59)compress_archive
(13-20)
components/core/src/clp_s/JsonParser.cpp (5)
components/core/src/clp_s/ArchiveWriter.hpp (2)
parent_node_id
(171-171)parent_node_id
(179-179)components/core/src/clp_s/CommandLineArguments.hpp (1)
m_retain_float_format
(105-105)components/core/src/clp_s/ColumnReader.hpp (2)
FormattedFloat
(161-161)Float
(138-138)components/core/src/clp_s/SchemaTree.hpp (1)
parent_node_id
(131-131)components/core/src/clp_s/JsonParser.hpp (3)
line
(95-95)line
(178-178)line
(185-185)
components/core/src/clp_s/ColumnReader.cpp (2)
components/core/src/clp_s/ColumnReader.hpp (9)
reader
(35-35)reader
(80-80)reader
(103-103)reader
(136-136)reader
(159-159)reader
(219-219)reader
(251-251)reader
(301-301)reader
(338-338)components/core/src/clp_s/Utils.hpp (4)
start
(223-225)start
(223-223)i
(217-221)i
(217-217)
🪛 LanguageTool
docs/src/user-guide/core-clp-s.md
[style] ~82-~82: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...at` flag to output decompressed data in exactly the same format it had before compression. For e...
(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)
docs/src/dev-guide/design-retain-float-format.md
[uncategorized] ~29-~29: Loose punctuation mark.
Context: ...ponent is denoted by E
or e
. * 00
: Not scientific * 01
: Scientific notat...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~40-~40: Loose punctuation mark.
Context: ...whether the exponent has a sign: * 00
: No sign * 01
: +
* 10
: -
For ex...
(UNLIKELY_OPENING_PUNCTUATION)
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/core-clp-s.md
83-83: Trailing spaces
Expected: 0 or 2; Actual: 1
(MD009, no-trailing-spaces)
180-180: Trailing spaces
Expected: 0 or 2; Actual: 1
(MD009, no-trailing-spaces)
181-181: Trailing spaces
Expected: 0 or 2; Actual: 1
(MD009, no-trailing-spaces)
docs/src/dev-guide/design-retain-float-format.md
18-18: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
29-29: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
30-30: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
31-31: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
40-40: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
41-41: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
42-42: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
50-50: Trailing spaces
Expected: 0 or 2; Actual: 1
(MD009, no-trailing-spaces)
73-73: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
74-74: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
75-75: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
76-76: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
🔇 Additional comments (36)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
201-225
: No action needed –clp
doesn’t yet support--retain-float-format
I’ve confirmed that only the CLP_S path (_make_clp_s_command_and_env) appends--retain-float-format
and that theclp
binary (used by _make_clp_command_and_env) does not currently accept this flag. Once upstream adds support for--retain-float-format
toclp
, we can mirror the same gated logic in_make_clp_command_and_env
.components/core/src/clp_s/CommandLineArguments.hpp (2)
105-106
: LGTM: Getter for retain-float-format flagAccessor aligns with existing getter patterns and provides the necessary plumbing for the new CLI option.
186-186
: LGTM: Default-initialized retain-float-format flagDefaulting to false preserves existing behaviour unless the flag is explicitly set.
components/core/tests/test_log_files/test_search_formatted_float.jsonl (1)
1-13
: LGTM: Good coverage of formatted-float search casesWell-chosen values to exercise signs, exponent formatting (including leading zeros), rounding boundaries, and small/large magnitudes.
components/core/src/clp_s/JsonSerializer.hpp (1)
19-27
: Confirm enum Op value stability for backward compatibilityAddFormattedFloatField and AddFormattedFloatValue were inserted mid-enum in JsonSerializer.hpp. If any component persists or transmits Op by its numeric value (for example, on-disk templates, cached op streams or IPC), inserting new entries in the middle will shift all subsequent values and break compatibility. If Op is used strictly in-memory within a single process, this change is safe. Otherwise, new ops should be appended at the end or given explicit values to preserve existing ordinal assignments.
Files to review:
- components/core/src/clp_s/JsonSerializer.hpp (enum Op, around lines 19–27)
Suggested diff for safer ordering:
AddNullField, AddIntValue, AddFloatValue, AddFormattedFloatValue, AddBoolValue, AddStringField, AddArrayField, AddNullField, AddIntValue, AddFloatValue, - AddFormattedFloatField, - AddBoolValue, - AddStringField, - AddArrayField, - AddNullField, - AddIntValue, - AddFloatValue, - AddFormattedFloatValue, - AddBoolValue, + // New float formatting ops appended to preserve existing values + AddFormattedFloatField, + AddFormattedFloatValue,Please confirm that Op values are never serialized or persisted by numeric value.
components/core/src/clp_s/SchemaTree.cpp (1)
17-17
: LGTM! FormattedFloat maps to FloatT correctly.The addition of the
FormattedFloat
case to map toLiteralType::FloatT
is appropriate since formatted floats are still floating-point literals from a search perspective, even though they carry additional formatting metadata.components/core/tests/test_log_files/test_rounded_formatted_float.jsonl (1)
1-42
: Comprehensive test coverage for float formatting edge cases.The test data covers a good range of precision and formatting edge cases including:
- Values exceeding precision limits
- Leading/trailing zeros
- Scientific notation with various exponent formats
- Negative values and signs
- Mantissas greater than 10
This will help validate the FormattedFloat feature's ability to preserve formatting across these challenging scenarios.
components/core/tests/clp_s_test_utils.hpp (1)
17-17
: LGTM! Parameter addition is well-documented.The
retain_float_format
parameter is properly documented and positioned logically in the parameter list after the archive directory.Also applies to: 26-26
components/core/src/clp_s/ArchiveWriter.cpp (1)
319-321
: Add FormattedFloat column writer case — looks correctSwitch case cleanly mirrors existing Float handling and appends the correct ColumnWriter type, with a break to avoid fall-through.
docs/src/dev-guide/index.md (1)
86-87
: Documentation file confirmed
Verified thatdocs/src/dev-guide/design-retain-float-format.md
exists and is correctly referenced indocs/src/dev-guide/index.md
.components/core/src/clp_s/JsonParser.hpp (1)
228-229
: m_retain_float_format correctly assigned and usedThe constructor in JsonParser.cpp (line 94) initializes
m_retain_float_format
fromoption.retain_float_format
, and both parsing paths respect this flag:
- In
parse()
/parse_from_ir()
, the flag guards whether to capture the raw JSON token or the numeric value (lines 205, 293).- In the IR-to-archive logic, it switches between
NodeType::FormattedFloat
andNodeType::Float
based on the flag (line 409).No changes required.
components/core/src/clp_s/clp-s.cpp (2)
107-108
: Plumbing retain_float_format from CLI — LGTMCorrectly propagates the CLI flag into JsonParserOption; no behavioural issues spotted.
107-108
: CLI flag --retain-float-format is correctly defined and documentedI’ve verified that:
get_retain_float_format()
is implemented in components/core/src/clp_s/CommandLineArguments.hpp- The
--retain-float-format
option is registered in components/core/src/clp_s/CommandLineArguments.cpp withpo::bool_switch(&m_retain_float_format)
- The help text "Store the format of float numbers for lossless decompression." is included in the usage output
No further changes required.
components/core/tests/test-clp_s-delta-encode-log-order.cpp (2)
67-74
: Adapt test to new compress_archive signature — LGTMNew boolean parameter set to false fits the default for this test and maintains previous behaviour.
67-74
: compress_archive signature and calls are consistent
Thecompress_archive
helper now takes six parameters(file_path, archive_directory, retain_float_format, single_file_archive, structurize_arrays, file_type)
, and all test callsites in:
- test-clp_s-search.cpp
- test-clp_s-range_index.cpp
- test-clp_s-end_to_end.cpp
- test-clp_s-delta-encode-log-order.cpp
are updated to pass the
retain_float_format
boolean in the third position. No further changes needed.components/core/src/clp_s/CommandLineArguments.cpp (2)
251-253
: Add CLI flag to retain float format — LGTMThe addition is consistent with surrounding options and uses the correct bool_switch pattern. Nice placement and concise help text.
251-254
: Clarify or enforce--retain-float-format
scope for KV-IRIt isn’t obvious whether
--retain-float-format
is supported when--file-type kv-ir
. We should either update the help text to call out JSON-only behaviour or add a parse-time check to error out on KV-IR, to avoid user confusion.Suggestions:
- If KV-IR does not support retaining the original float syntax, tweak the help string:
components/core/src/clp_s/CommandLineArguments.cpp ──────────────────────────────────────────────────────────────── - "Store the format of float numbers for lossless decompression." + "Store the original textual format of floating-point numbers for lossless decompression (JSON input only)."- Alternatively, enforce at parse time:
if (m_retain_float_format && m_file_type == FileType::KV_IR) { throw po::validation_error{ po::validation_error::invalid_option_value, "retain-float-format", "only valid for JSON input" }; }- If KV-IR does already honour this flag, feel free to ignore this comment.
components/core/src/clp_s/ArchiveReader.cpp (2)
197-199
: Support for FormattedFloat in ordered columns — LGTMThe new case mirrors Float handling and correctly instantiates FormattedFloatColumnReader.
253-255
: Support for FormattedFloat in unordered columns — LGTMConsistent with the ordered-path addition and other column types.
components/core/tests/test-clp_s-range_index.cpp (1)
223-231
: Updated compress_archive call site to include retain_float_format — LGTMPassing false keeps previous behaviour for this test while aligning with the new signature.
components/core/tests/clp_s_test_utils.cpp (1)
16-20
: Plumbing retain_float_format through test helper — LGTMSignature and the internal option wiring look correct and maintain existing defaults at call sites that pass false.
Also applies to: 40-41
components/core/src/clp_s/SchemaReader.cpp (4)
130-149
: Ensure restored format yields valid JSON numeric tokensAppending the restore_format result directly as a JSON value is correct for standard numeric tokens. Please verify that restore_format never returns non-JSON numerics (e.g., NaN/Inf forms) unquoted; otherwise, output may become invalid JSON.
467-471
: Template support for FormattedFloat in structured arrays — LGTMAdds the correct serializer op and column push to keep ordering consistent.
557-561
: Template support for FormattedFloat in structured objects — LGTMMatches existing Float handling using the new formatted ops.
671-675
: Template support for FormattedFloat in top-level JSON generation — LGTMConsistent with array/object cases and integrates with projection/local-tree generation.
components/core/src/clp_s/JsonParser.cpp (5)
205-218
: LGTM! Consistent implementation of formatted float handling.The implementation correctly handles the new
retain_float_format
option by storing the raw JSON token for floats when the flag is enabled, and using the newFormattedFloat
node type.
293-302
: LGTM! Consistent handling in array parsing.The array parsing logic mirrors the object parsing implementation, correctly handling formatted floats when the flag is enabled.
409-410
: LGTM! Clean integration with existing type resolution.The type resolution logic cleanly integrates the new
FormattedFloat
type whenretain_float_format
is enabled.
431-436
: LGTM! Proper handling of formatted float values.The implementation correctly stores the raw token string for formatted floats while preserving numeric value handling for timestamp ingestion.
1056-1076
: LGTM! Good addition of user-defined metadata support.The implementation properly checks for user-defined metadata in the IR stream and adds each key-value pair to the current range, with appropriate error handling.
components/core/tests/test-clp_s-search.cpp (3)
46-48
: LGTM! Good parameterization of test input paths.The new helper functions improve test maintainability by allowing different test input files to be specified dynamically.
122-128
: LGTM! Proper handling of empty expected results.The validation logic correctly handles the case where no results are expected, improving test coverage for negative test cases.
259-295
: LGTM! Comprehensive test coverage for formatted floats.The new test case provides thorough coverage of formatted float search functionality with various query types and edge cases. The test correctly uses the
retain_float_format
flag when compressing the test archive.components/core/src/clp_s/ColumnWriter.hpp (1)
102-118
: LGTM! Well-structured column writer for formatted floats.The
FormattedFloatColumnWriter
class follows the established pattern of other column writers and properly stores both the numeric values and their format information. The dual-vector approach (m_values and m_format) is appropriate for maintaining per-value formatting.components/core/src/clp_s/ColumnWriter.cpp (1)
53-63
: Inefficient string trimming algorithmThe current approach modifies the string in-place using
erase
andremove_if
, which requires shifting all subsequent characters. This is O(n²) in the worst case for a string with many characters to remove.Consider building a new string with only valid characters instead:
- // Trim the raw string - float_str.erase( - std::remove_if( - float_str.begin(), - float_str.end(), - [](char c) { - return !std::isdigit(static_cast<unsigned char>(c)) && '.' != c && '+' != c - && '-' != c && 'E' != c && 'e' != c; - } - ), - float_str.end() - ); + // Build a trimmed string with only valid characters + std::string trimmed_str; + trimmed_str.reserve(float_str.length()); + for (char c : float_str) { + if (std::isdigit(static_cast<unsigned char>(c)) || '.' == c || '+' == c + || '-' == c || 'E' == c || 'e' == c) { + trimmed_str.push_back(c); + } + } + float_str = std::move(trimmed_str);Likely an incorrect or invalid review comment.
components/core/src/clp_s/ColumnReader.cpp (1)
201-204
: Preferfalse == std::isdigit
pattern per coding guidelinesThe code uses
false == std::isdigit(sci_str[0])
which follows the project's coding guidelines for boolean comparisons.
b19a245
to
0090052
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🔭 Outside diff range comments (1)
components/core/tests/test-clp_s-end_to_end.cpp (1)
110-142
: Avoid usingtr -d ' '
which removes all spaces from JSON valuesThe
literallyCompare
function usestr -d ' '
to remove all spaces, which could mask issues where JSON string values legitimately contain spaces. This approach might cause false positives in test validation.Consider using
jq
for proper JSON normalization instead:void literallyCompare( std::filesystem::path const& expected_output_json_path, std::filesystem::path const& extracted_json_path ) { + // Use jq to normalize JSON format while preserving string values auto command = fmt::format( - "cat {} | awk 'NF' | tr -d ' ' | sort > {}", + "jq --sort-keys --compact-output '.' {} | sort > {}", expected_output_json_path.string(), cTestEndToEndExpectedOutputSortedFile ); auto result = std::system(command.c_str()); REQUIRE((0 == result)); REQUIRE((false == std::filesystem::is_empty(cTestEndToEndExpectedOutputSortedFile))); command = fmt::format( - "cat {} | awk 'NF' | tr -d ' ' | sort > {}", + "jq --sort-keys --compact-output '.' {} | sort > {}", extracted_json_path.string(), cTestEndToEndOutputSortedJson ); result = std::system(command.c_str()); REQUIRE((0 == result)); REQUIRE((false == std::filesystem::is_empty(cTestEndToEndOutputSortedJson)));
♻️ Duplicate comments (3)
components/core/tests/test_log_files/test_rounded_formatted_float_rounded.jsonl (1)
28-35
: Fix inconsistent JSON formattingLines 28-35 use compact JSON formatting without spaces around colons, while the rest of the file uses spaced formatting. This inconsistency could affect parsing tests if the JSON parser is sensitive to whitespace variations.
Apply this diff to maintain consistent formatting:
-{ "scientificWithGreaterThanTenMantissa":1.234567891234567E11 } -{ "negativeScientificWithGreaterThanTenMantissa":-1.234567891234567E11 } -{ "negativeScientificWithGreaterThanTenMantissaExceedingPrecision":-1.234567891234568E11 } -{ "scientificWithGreaterThanTenMantissaExceedingPrecision":1.234567891234568E11 } +{ "scientificWithGreaterThanTenMantissa": 1.234567891234567E11 } +{ "negativeScientificWithGreaterThanTenMantissa": -1.234567891234567E11 } +{ "negativeScientificWithGreaterThanTenMantissaExceedingPrecision": -1.234567891234568E11 } +{ "scientificWithGreaterThanTenMantissaExceedingPrecision": 1.234567891234568E11 }-{ "scientificWithGreaterThanTenMantissa3":1.234567891234567E+11 } -{ "negativeScientificWithGreaterThanTenMantissa3":-1.234567891234567E+11 } +{ "scientificWithGreaterThanTenMantissa3": 1.234567891234567E+11 } +{ "negativeScientificWithGreaterThanTenMantissa3": -1.234567891234567E+11 }components/core/tests/test-clp_s-end_to_end.cpp (2)
184-193
: Consider validating compression return valueThe return value from
compress_archive
is ignored, but it likely contains important statistics about the compression operation that should be validated to ensure the test is meaningful.Consider checking the returned statistics:
- REQUIRE_NOTHROW( - std::ignore = compress_archive( + auto archive_stats = compress_archive( get_test_input_local_path(cTestEndToEndValidFormattedFloatInputFile), std::string{cTestEndToEndArchiveDirectory}, true, single_file_archive, structurize_arrays, clp_s::FileType::Json - ) ); + REQUIRE_FALSE(archive_stats.empty());
213-222
: Consider validating compression return valueSimilar to the previous test, the return value should be validated here as well.
- REQUIRE_NOTHROW( - std::ignore = compress_archive( + auto archive_stats = compress_archive( get_test_input_local_path(cTestEndToEndRoundedFormattedFloatInputFile), std::string{cTestEndToEndArchiveDirectory}, true, single_file_archive, structurize_arrays, clp_s::FileType::Json - ) ); + REQUIRE_FALSE(archive_stats.empty());
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (9)
components/core/src/clp_s/ColumnReader.cpp
(3 hunks)components/core/src/clp_s/ColumnReader.hpp
(1 hunks)components/core/src/clp_s/ColumnWriter.cpp
(1 hunks)components/core/src/clp_s/CommandLineArguments.hpp
(2 hunks)components/core/tests/test-clp_s-end_to_end.cpp
(4 hunks)components/core/tests/test_log_files/test_rounded_formatted_float_rounded.jsonl
(1 hunks)components/core/tests/test_log_files/test_valid_formatted_float.jsonl
(1 hunks)docs/src/dev-guide/design-retain-float-format.md
(1 hunks)docs/src/user-guide/core-clp-s.md
(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit Configuration File
- Prefer
false == <expression>
rather than!<expression>
.
Files:
components/core/src/clp_s/CommandLineArguments.hpp
components/core/src/clp_s/ColumnWriter.cpp
components/core/src/clp_s/ColumnReader.hpp
components/core/src/clp_s/ColumnReader.cpp
components/core/tests/test-clp_s-end_to_end.cpp
🧠 Learnings (10)
📚 Learning: 2025-08-13T15:25:40.805Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:49-127
Timestamp: 2025-08-13T15:25:40.805Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, the FormattedFloatColumnWriter::add_value method intentionally allows std::stod to throw exceptions (std::invalid_argument or std::out_of_range) without catching them, as this is designed to terminate decompression when encountering invalid float strings for data integrity purposes.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
components/core/src/clp_s/ColumnReader.hpp
components/core/src/clp_s/ColumnReader.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-12-19T01:40:13.272Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/src/clp/streaming_compression/lzma/Compressor.hpp:65-75
Timestamp: 2024-12-19T01:40:13.272Z
Learning: "try_get_pos" method in `Compressor.hpp` does not throw exceptions, as clarified by Bill-hbrhbr.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
PR: y-scope/clp#570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-12-04T15:50:37.827Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-14T01:03:27.144Z
Learnt from: LinZhihao-723
PR: y-scope/clp#450
File: components/core/src/clp/streaming_archive/reader/Segment.cpp:57-58
Timestamp: 2024-11-14T01:03:27.144Z
Learning: In `components/core/src/clp/streaming_archive/reader/Segment.cpp`, the method `m_decompressor.open()` does not have a return value.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-29T22:51:00.355Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:141-142
Timestamp: 2024-11-29T22:51:00.355Z
Learning: In 'components/core/tests/test-clp_s-end_to_end.cpp', using `WIFEXITED` and `WEXITSTATUS` with the return value of `std::system` is acceptable for checking the exit status of system commands, as per the C++ documentation.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
🧬 Code Graph Analysis (3)
components/core/src/clp_s/ColumnWriter.cpp (3)
components/core/src/clp_s/ParsedMessage.hpp (4)
add_value
(35-37)add_value
(35-35)value
(64-66)value
(64-64)components/core/src/clp_s/ColumnWriter.hpp (18)
value
(29-29)value
(59-59)value
(76-76)value
(94-94)value
(111-111)value
(129-129)value
(156-156)value
(213-213)value
(231-231)compressor
(35-35)compressor
(61-61)compressor
(78-78)compressor
(96-96)compressor
(113-113)compressor
(131-131)compressor
(158-158)compressor
(215-215)compressor
(233-233)components/core/src/clp_s/SchemaWriter.cpp (2)
store
(29-33)store
(29-29)
components/core/src/clp_s/ColumnReader.hpp (1)
components/core/src/clp_s/Utils.hpp (2)
start
(223-225)start
(223-223)
components/core/src/clp_s/ColumnReader.cpp (1)
components/core/src/clp_s/Utils.hpp (4)
start
(223-225)start
(223-223)i
(217-221)i
(217-217)
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/core-clp-s.md
181-181: Trailing spaces
Expected: 0 or 2; Actual: 1
(MD009, no-trailing-spaces)
🪛 LanguageTool
docs/src/dev-guide/design-retain-float-format.md
[uncategorized] ~29-~29: Loose punctuation mark.
Context: ...ponent is denoted by E
or e
. - 00
: Not scientific - 01
: Scientific notat...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~40-~40: Loose punctuation mark.
Context: ...whether the exponent has a sign: - 00
: No sign - 01
: +
- 10
: -
For ex...
(UNLIKELY_OPENING_PUNCTUATION)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
🔇 Additional comments (4)
components/core/tests/test_log_files/test_valid_formatted_float.jsonl (1)
1-41
: LGTM!The test data file provides comprehensive coverage of formatted float edge cases, including standard decimal notation, scientific notation with various exponent formats, negative values, and nested structures.
components/core/src/clp_s/CommandLineArguments.hpp (1)
105-105
: LGTM!The getter is properly marked with
[[nodiscard]]
to match other getters in the class, preventing accidental ignoring of the return value.docs/src/user-guide/core-clp-s.md (2)
29-30
: LGTM!The new
--retain-float-format
option is clearly documented with appropriate placement in the compression options list.
81-85
: LGTM!The tip provides a clear and concise explanation of the
--retain-float-format
feature with helpful examples.
components/core/tests/test_log_files/test_invalid_formatted_float.jsonl
Outdated
Show resolved
Hide resolved
…oat.jsonl Co-authored-by: Xiaochong(Eddy) Wei <[email protected]>
Co-authored-by: Xiaochong(Eddy) Wei <[email protected]>
LGTM, but I can't approve the PR created by me. @kirkrodrigues @LinZhihao-723 could either of you approve this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deffering to @anlowee review of my changes to his branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (15)
components/core/CMakeLists.txt (1)
362-363
: Avoid duplicate compilation/ODR: remove FloatFormatEncoding from unit test sources or stop linking the production targetThese sources appear to already be compiled into the production clp_s library; compiling them again into unitTest risks duplicate symbols/ODR errors. Keep them only in one place (preferred: in the production library, and remove here).
Apply this diff to drop the duplicate entries from SOURCE_FILES_clp_s_unitTest:
- src/clp_s/FloatFormatEncoding.cpp - src/clp_s/FloatFormatEncoding.hppIf, for any reason, they’re not in the production target, move them there instead and keep them out of unitTest’s sources.
components/core/tests/test_log_files/test_invalid_formatted_float.jsonl (3)
18-25
: Expand exponent normalisation coverage: add lowercase ‘e’ variants and explicit ±0 exponents.Mirror Lines 18–25 with lowercase ‘e’, and add explicit zero-exponent cases with leading zeros for both signs. This aligns with prior feedback and hardens parser paths that normalise exponent forms.
+{ "scientificPositiveExponentExceedingDigitLimitWithLeadingZeros_lowercase": 1.234567891234567e00008 } +{ "negativeScientificPositiveExponentExceedingDigitLimitWithLeadingZeros_lowercase": -1.234567891234567e00008 } +{ "scientificPositiveExponentExceedingDigitLimitWithLeadingZerosWithPlus_lowercase": 1.234567891234567e+00008 } +{ "negativeScientificPositiveExponentExceedingDigitLimitWithLeadingZerosWithPlus_lowercase": -1.234567891234567e+00008 } +{ "scientificNegativeExponentExceedingDigitLimitWithLeadingZeros_lowercase": 1.234567891234567e-00008 } +{ "negativeScientificNegativeExponentExceedingDigitLimitWithLeadingZeros_lowercase": -1.234567891234567e-00008 } + +{ "scientificExponentZeroWithLeadingZerosAndPlus": 1.23E+00000 } +{ "negativeScientificExponentZeroWithLeadingZerosAndPlus": -1.23E+00000 } +{ "scientificExponentZeroWithLeadingZerosAndMinus": 1.23E-00000 } +{ "negativeScientificExponentZeroWithLeadingZerosAndMinus": -1.23E-00000 } +{ "scientificExponentZeroWithLeadingZerosAndPlus_lowercase": 1.23e+00000 } +{ "negativeScientificExponentZeroWithLeadingZerosAndPlus_lowercase": -1.23e+00000 } +{ "scientificExponentZeroWithLeadingZerosAndMinus_lowercase": 1.23e-00000 } +{ "negativeScientificExponentZeroWithLeadingZerosAndMinus_lowercase": -1.23e-00000 }
34-39
: Rename “ZeroMantissa” keys for accuracy (mantissa is unnormalised < 1).These aren’t zero mantissae; they’re < 1 and unnormalised. Improves readability and intent.
-{ "nonZeroScientificWithZeroMantissa": 0.00000001234567891234567E-3 } -{ "negativeNonZeroScientificWithZeroMantissa": -0.00000001234567891234567E-3 } -{ "nonZeroScientificWithZeroMantissaExceedingPrecision": 0.0000000123456789123456789E-3 } -{ "negativeNonZeroScientificWithZeroMantissaExceedingPrecision": -0.0000000123456789123456789E-3 } -{ "nonZeroScientificWithZeroMantissa2": 0.00000001234567891234567E-01 } -{ "negativeNonZeroScientificWithZeroMantissa2": -0.00000001234567891234567E-01 } +{ "unnormalisedMantissaLessThanOne": 0.00000001234567891234567E-3 } +{ "negativeUnnormalisedMantissaLessThanOne": -0.00000001234567891234567E-3 } +{ "unnormalisedMantissaLessThanOneExceedingPrecision": 0.0000000123456789123456789E-3 } +{ "negativeUnnormalisedMantissaLessThanOneExceedingPrecision": -0.0000000123456789123456789E-3 } +{ "unnormalisedMantissaLessThanOne2": 0.00000001234567891234567E-01 } +{ "negativeUnnormalisedMantissaLessThanOne2": -0.00000001234567891234567E-01 }
40-43
: Clarify intent for “non754ScientificFloat” and assert the specific failure path.If these are meant to be “cannot encode into 8-byte variable form → must serialise as string”, reflect that in names and ensure tests assert that rejection reason (not a generic parse failure).
-{ "non754ScientificFloat": 1.2345678912345679e+13} -{ "negativeNon754ScientificFloat": -1.2345678912345679e+13} -{ "non754ScientificFloatWithNegativeExponent": 1.2345678912345679e-13} -{ "negativeNon754ScientificFloatWithNegativeExponent": -1.2345678912345679e-13} +{ "notEncodableToEightByteVarFloat": 1.2345678912345679e+13 } +{ "negativeNotEncodableToEightByteVarFloat": -1.2345678912345679e+13 } +{ "notEncodableToEightByteVarFloatWithNegativeExponent": 1.2345678912345679e-13 } +{ "negativeNotEncodableToEightByteVarFloatWithNegativeExponent": -1.2345678912345679e-13 }Action: confirm the e2e test that consumes this fixture checks for the encoder’s specific “fallback-to-string” path or error code, not a broad “invalid float” outcome.
components/core/tests/test-clp_s-search.cpp (8)
237-244
: Name boolean/flag args in compress_archive for readability.Inline names prevent argument-order mistakes and match style elsewhere.
REQUIRE_NOTHROW( std::ignore = compress_archive( get_test_input_local_path(cTestSearchInputFile), std::string{cTestSearchArchiveDirectory}, std::string{cTestIdxKey}, - false, - single_file_archive, - structurize_arrays, + /*retain_float_format=*/false, + /*single_file_archive=*/single_file_archive, + /*structurize_arrays=*/structurize_arrays, clp_s::FileType::Json ) );
48-50
: Avoid duplicating helpers; add const to params.These forward declarations duplicate helpers already available; also prefer const-correctness.
Apply locally (and consider centralising in clp_s_test_utils.hpp/.cpp):
-auto get_test_input_path_relative_to_tests_dir(std::string_view test_input_path) +auto get_test_input_path_relative_to_tests_dir(std::string_view const test_input_path) -> std::filesystem::path; -auto get_test_input_local_path(std::string_view test_input_path) -> std::string; +auto get_test_input_local_path(std::string_view const test_input_path) -> std::string;
64-67
: Const-correctness (and consider centralising).Match other tests by making the parameter const; prefer moving these helpers to the shared test utils to avoid drift.
-auto get_test_input_path_relative_to_tests_dir(std::string_view test_input_path) +auto get_test_input_path_relative_to_tests_dir(std::string_view const test_input_path) -> std::filesystem::path { return std::filesystem::path{cTestInputFileDirectory} / test_input_path; }
69-73
: Const-correctness (and consider centralising).Same here; align with the pattern used in other suites.
-auto get_test_input_local_path(std::string_view test_input_path) -> std::string { +auto get_test_input_local_path(std::string_view const test_input_path) -> std::string { std::filesystem::path const current_file_path{__FILE__}; auto const tests_dir{current_file_path.parent_path()}; return (tests_dir / get_test_input_path_relative_to_tests_dir(test_input_path)).string(); }
257-268
: Extend coverage: add -0.0 and uppercase-exponent numeric cases.Unquoted literals are numeric; add negative zero and 1E-16 to harden semantics.
std::vector<std::pair<std::string, std::vector<int64_t>>> queries_and_results{ {R"aa(NOT formattedFloatValue: 0)aa", {0, 1, 2, 6, 7, 8, 9, 10, 11, 12}}, {R"aa(formattedFloatValue: 0)aa", {3, 4, 5}}, {R"aa(formattedFloatValue: 1e-16)aa", {6, 7}}, + {R"aa(formattedFloatValue: 1E-16)aa", {6, 7}}, + {R"aa(formattedFloatValue: -0.0)aa", {3, 4, 5}}, {R"aa(formattedFloatValue > 0.00)aa", {6, 7, 8, 9, 10, 11, 12}},
274-282
: Name boolean/flag args (formatted-float case) for clarity.Keep calls self-documenting.
std::ignore = compress_archive( get_test_input_local_path(cTestSearchFormattedFloatFile), std::string{cTestSearchArchiveDirectory}, std::nullopt, - true, - single_file_archive, - false, + /*retain_float_format=*/true, + /*single_file_archive=*/single_file_archive, + /*structurize_arrays=*/false, clp_s::FileType::Json )
285-288
: Capture generated params to aid triage on failures.Include single_file_archive in the failure context.
for (auto const& [query, expected_results] : queries_and_results) { - CAPTURE(query); + CAPTURE(query); + CAPTURE(single_file_archive); REQUIRE_NOTHROW(search(query, false, expected_results)); }
271-277
: Parametrisation: isolate archive output per run to avoid collisions.Parallel/parameterised runs can interfere when sharing one directory. Use a unique archive_dir per combo and pass it through.
- TestOutputCleaner const test_cleanup{{std::string{cTestSearchArchiveDirectory}}}; + auto const archive_dir = fmt::format("{}-ff-f{}", cTestSearchArchiveDirectory, single_file_archive); + TestOutputCleaner const test_cleanup{{archive_dir}};And below:
- std::string{cTestSearchArchiveDirectory}, + archive_dir,Note: parameterise search() to consume archive_dir instead of the global constant if you take this.
components/core/src/clp_s/ColumnWriter.cpp (3)
52-57
: Strong exception-safety for parallel vectors.If the second push throws, vectors can diverge. Make the operation atomic.
size_t FormattedFloatColumnWriter::add_value(ParsedMessage::variable_t& value) { auto const& [float_value, format]{std::get<std::pair<double, float_format_t>>(value)}; - m_values.push_back(float_value); - m_formats.push_back(format); + m_values.emplace_back(float_value); + try { + m_formats.emplace_back(format); + } catch (...) { + m_values.pop_back(); + throw; + } return sizeof(double) + sizeof(float_format_t); }
67-72
: Match VariableString writer: consider reservation/atomic push.Mirror the exception-safety approach (reserve or rollback on failure) to keep IDs in sync if an allocation throws.
- m_var_dict->add_entry(std::get<std::string>(value), id); - m_var_dict_ids.push_back(id); + m_var_dict->add_entry(std::get<std::string>(value), id); + try { + m_var_dict_ids.emplace_back(id); + } catch (...) { + // add_entry mutates dict but cannot be rolled back here; consider reserving capacity up front. + throw; + }(Optional: reserve expected capacity in the constructor when row count is known.)
59-65
: Add a runtime size check in release builds.Assert is compiled out; guard before writing to prevent mismatched arrays.
void FormattedFloatColumnWriter::store(ZstdCompressor& compressor) { - assert(m_formats.size() == m_values.size()); + assert(m_formats.size() == m_values.size()); + if (m_formats.size() != m_values.size()) { + throw OperationFailed(ErrorCode_Failure, __FILE__, __LINE__); + } auto const values_size = m_values.size() * sizeof(double); auto const format_size = m_formats.size() * sizeof(float_format_t); compressor.write(reinterpret_cast<char const*>(m_values.data()), values_size); compressor.write(reinterpret_cast<char const*>(m_formats.data()), format_size); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
components/core/CMakeLists.txt
(2 hunks)components/core/src/clp_s/ColumnWriter.cpp
(2 hunks)components/core/tests/test-clp_s-search.cpp
(5 hunks)components/core/tests/test_log_files/test_invalid_formatted_float.jsonl
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>
rather than!<expression>
.
Files:
components/core/src/clp_s/ColumnWriter.cpp
components/core/tests/test-clp_s-search.cpp
🧠 Learnings (15)
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/core/CMakeLists.txt
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/CMakeLists.txt
components/core/tests/test-clp_s-search.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/CMakeLists.txt
📚 Learning: 2025-08-13T15:25:40.805Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:49-127
Timestamp: 2025-08-13T15:25:40.805Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, the FormattedFloatColumnWriter::add_value method intentionally allows std::stod to throw exceptions (std::invalid_argument or std::out_of_range) without catching them, as this is designed to terminate decompression when encountering invalid float strings for data integrity purposes.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-12-19T01:40:13.272Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/src/clp/streaming_compression/lzma/Compressor.hpp:65-75
Timestamp: 2024-12-19T01:40:13.272Z
Learning: "try_get_pos" method in `Compressor.hpp` does not throw exceptions, as clarified by Bill-hbrhbr.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
PR: y-scope/clp#570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2025-08-06T23:57:49.762Z
Learnt from: gibber9809
PR: y-scope/clp#1163
File: components/core/src/clp_s/ColumnWriter.hpp:120-122
Timestamp: 2025-08-06T23:57:49.762Z
Learning: In the CLP codebase, the `encoded_log_dict_id_t` type alias in `components/core/src/clp_s/ColumnWriter.hpp` is appropriately placed as a class-scoped alias because the ColumnWriter.hpp header is already included in all places where this type is used, eliminating the need to move it to a common header.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2025-01-16T16:48:30.016Z
Learnt from: LinZhihao-723
PR: y-scope/clp#653
File: components/core/src/clp/ffi/ir_stream/ir_unit_deserialization_methods.cpp:612-613
Timestamp: 2025-01-16T16:48:30.016Z
Learning: When using structured bindings with values that will be moved later (e.g., into a constructor), prefer using references (`auto&`) to avoid unnecessary copies, even when binding to the result of a temporary object that lives until the end of the full expression.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-10-13T09:29:56.553Z
Learnt from: LinZhihao-723
PR: y-scope/clp#557
File: components/core/tests/test-ir_encoding_methods.cpp:1283-1286
Timestamp: 2024-10-13T09:29:56.553Z
Learning: `clp::ffi::SchemaTree::Node::id_t` is of type `uint32_t`, so adding `1` to `INT32_MAX` is safe and does not cause integer overflow.
Applied to files:
components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/core/tests/test-clp_s-search.cpp
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.
Applied to files:
components/core/tests/test-clp_s-search.cpp
🧬 Code graph analysis (2)
components/core/src/clp_s/ColumnWriter.cpp (3)
components/core/src/clp_s/ParsedMessage.hpp (6)
add_value
(41-43)add_value
(41-41)value
(80-82)value
(80-80)value
(90-92)value
(90-90)components/core/src/clp_s/ColumnWriter.hpp (22)
value
(29-29)value
(59-59)value
(76-76)value
(94-94)value
(111-111)value
(131-131)value
(149-149)value
(176-176)value
(233-233)value
(251-251)compressor
(35-35)compressor
(61-61)compressor
(78-78)compressor
(96-96)compressor
(113-113)compressor
(133-133)compressor
(151-151)compressor
(178-178)compressor
(235-235)compressor
(253-253)id
(205-208)id
(206-206)components/core/src/clp_s/ArchiveReader.hpp (1)
m_var_dict
(105-105)
components/core/tests/test-clp_s-search.cpp (2)
components/core/tests/test-clp_s-end_to_end.cpp (6)
get_test_input_path_relative_to_tests_dir
(41-42)get_test_input_path_relative_to_tests_dir
(52-55)get_test_input_path_relative_to_tests_dir
(52-53)get_test_input_local_path
(43-43)get_test_input_local_path
(57-61)get_test_input_local_path
(57-57)components/core/tests/clp_s_test_utils.cpp (2)
compress_archive
(14-64)compress_archive
(14-22)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (7)
components/core/CMakeLists.txt (1)
703-703
: LGTM: add test-FloatFormatEncoding to unit testsTest file addition looks good and fits existing Catch2 wiring.
components/core/tests/test_log_files/test_invalid_formatted_float.jsonl (2)
30-31
: Rename without leading hyphen looks good.Thanks for removing the leading dash; this avoids jq key-escaping hassles.
44-44
: Verify end-of-file has no stray literal.The diff shows a lone “44”. Please confirm the fixture ends with a newline and no extra token.
components/core/tests/test-clp_s-search.cpp (2)
44-44
: LGTM: Test fixture constant added.The per-file fixture constant is clear and consistent.
215-219
: LGTM: Using per-file path helper in assertion.The filename filter now references the parameterised path; matches expectations.
components/core/src/clp_s/ColumnWriter.cpp (2)
3-5
: LGTM: Missing headers added.Including , , makes the TU self-contained.
74-77
: LGTM: Store path mirrors VariableString writer.Size computation and write are correct.
{ "scientificWithGreaterThanTenMantissa": 123456789.1234567E3 } | ||
{ "negativeScientificWithGreaterThanTenMantissa": -123456789.1234567E3 } | ||
{ "scientificWithGreaterThanTenMantissaExceedingPrecision": 123456789.123456789E3 } | ||
{ "negativeScientificWithGreaterThanTenMantissaExceedingPrecision": -123456789.123456789E3 } | ||
{ "scientificWithGreaterThanTenMantissa2": 12.34567891234567E01 } | ||
{ "negativeScientificWithGreaterThanTenMantissa2": -12.34567891234567E01 } | ||
{ "scientificWithGreaterThanTenMantissa3": 123456789.1234567E+3 } | ||
{ "negativeScientificWithGreaterThanTenMantissa3": -123456789.1234567E+3 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Add mantissa boundary cases at exactly 10.0 to exercise canonicalisation.
Ensure rejection/normalisation when mantissa == 10 (often canonicalised to 1.x × 10^(e+1)).
+{ "scientificMantissaBoundaryEqualToTen": 10E0 }
+{ "negativeScientificMantissaBoundaryEqualToTen": -10E0 }
+{ "scientificMantissaBoundaryEqualToTenWithFraction": 10.000E+0 }
+{ "negativeScientificMantissaBoundaryEqualToTenWithFraction": -10.000E+0 }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
{ "scientificWithGreaterThanTenMantissa": 123456789.1234567E3 } | |
{ "negativeScientificWithGreaterThanTenMantissa": -123456789.1234567E3 } | |
{ "scientificWithGreaterThanTenMantissaExceedingPrecision": 123456789.123456789E3 } | |
{ "negativeScientificWithGreaterThanTenMantissaExceedingPrecision": -123456789.123456789E3 } | |
{ "scientificWithGreaterThanTenMantissa2": 12.34567891234567E01 } | |
{ "negativeScientificWithGreaterThanTenMantissa2": -12.34567891234567E01 } | |
{ "scientificWithGreaterThanTenMantissa3": 123456789.1234567E+3 } | |
{ "negativeScientificWithGreaterThanTenMantissa3": -123456789.1234567E+3 } | |
{ "scientificWithGreaterThanTenMantissa": 123456789.1234567E3 } | |
{ "negativeScientificWithGreaterThanTenMantissa": -123456789.1234567E3 } | |
{ "scientificWithGreaterThanTenMantissaExceedingPrecision": 123456789.123456789E3 } | |
{ "negativeScientificWithGreaterThanTenMantissaExceedingPrecision": -123456789.123456789E3 } | |
{ "scientificWithGreaterThanTenMantissa2": 12.34567891234567E01 } | |
{ "negativeScientificWithGreaterThanTenMantissa2": -12.34567891234567E01 } | |
{ "scientificWithGreaterThanTenMantissa3": 123456789.1234567E+3 } | |
{ "negativeScientificWithGreaterThanTenMantissa3": -123456789.1234567E+3 } | |
{ "scientificMantissaBoundaryEqualToTen": 10E0 } | |
{ "negativeScientificMantissaBoundaryEqualToTen": -10E0 } | |
{ "scientificMantissaBoundaryEqualToTenWithFraction": 10.000E+0 } | |
{ "negativeScientificMantissaBoundaryEqualToTenWithFraction": -10.000E+0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
components/core/src/clp_s/SingleFileArchiveDefs.hpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>
rather than!<expression>
.
Files:
components/core/src/clp_s/SingleFileArchiveDefs.hpp
🧠 Learnings (1)
📚 Learning: 2025-01-28T03:02:30.542Z
Learnt from: davemarco
PR: y-scope/clp#698
File: components/core/src/clp/streaming_archive/Constants.hpp:9-9
Timestamp: 2025-01-28T03:02:30.542Z
Learning: In the CLP project's archive version format (semver), the patch version uses 16 bits (uint16_t), while major version uses 8 bits and minor version uses 16 bits.
Applied to files:
components/core/src/clp_s/SingleFileArchiveDefs.hpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
🔇 Additional comments (1)
components/core/src/clp_s/SingleFileArchiveDefs.hpp (1)
14-14
: Confirm archive‑compatibility: is a patch bump sufficient?
- Version constants: 0.4.1 — components/core/src/clp_s/SingleFileArchiveDefs.hpp.
- ArchiveWriter packs header.version as (major<<24)|(minor<<16)|patch — components/core/src/clp_s/ArchiveWriter.cpp.
- FormattedFloat / DictionaryFloat are only emitted when retain_float_format is enabled (components/core/src/clp_s/JsonParser.cpp); writers/readers for these types are present (ArchiveWriter.cpp, ArchiveReader.cpp, ColumnWriter/Reader, SchemaReader/SchemaTree).
- I did not find reader-side logic that explicitly validates ArchiveHeader.version and rejects newer archive formats.
Action: confirm whether released/older readers validate the archive header and will refuse archives containing unknown NodeType. If they do not, this is a backward‑incompatible format change and requires at least a minor version bump; if they do, a patch bump is acceptable.
constexpr uint8_t cArchiveMajorVersion = 0; | ||
constexpr uint8_t cArchiveMinorVersion = 4; | ||
constexpr uint16_t cArchivePatchVersion = 0; | ||
constexpr uint16_t cArchivePatchVersion = 1; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Document and enforce the version bit layout; add a packed constant.
This header stores version
as uint32_t
. Please document the 8/8/16 layout (major/minor/patch) and provide a single packed constant to keep writer/reader logic in sync. Also, confirm the intentional difference from the streaming archive’s learnt convention (which reportedly uses a 16‑bit minor); if that applies here, adjust accordingly.
Apply:
constexpr uint8_t cArchiveMajorVersion = 0;
-constexpr uint8_t cArchiveMinorVersion = 4;
-constexpr uint16_t cArchivePatchVersion = 1;
+constexpr uint8_t cArchiveMinorVersion = 4;
+constexpr uint16_t cArchivePatchVersion = 1;
+// Version packing: [8 bits major][8 bits minor][16 bits patch] -> uint32_t
+// Keep this in sync with reader/writer checks.
+constexpr uint32_t cArchiveVersionPacked =
+ (static_cast<uint32_t>(cArchiveMajorVersion) << 24) |
+ (static_cast<uint32_t>(cArchiveMinorVersion) << 16) |
+ static_cast<uint32_t>(cArchivePatchVersion);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
components/core/CMakeLists.txt (1)
360-361
: Prevent ODR/duplicate‑symbol errors: don’t compile FloatFormatEncoding into unitTest if already part of a linked clp_s target.These sources appear to be compiled into production (per prior review); compiling them again into unitTest risks duplicate symbols. Prefer linking the production target that owns FloatFormatEncoding and remove these from SOURCE_FILES_clp_s_unitTest.
Apply this diff:
set(SOURCE_FILES_clp_s_unitTest @@ - src/clp_s/FloatFormatEncoding.cpp - src/clp_s/FloatFormatEncoding.hppRun to verify ownership and linkages:
#!/bin/bash set -euo pipefail echo "Occurrences of FloatFormatEncoding in CMake:" rg -n --type cmake -C2 'FloatFormatEncoding\.(cpp|hpp)' components/core echo -e "\nProduction target(s) that compile it:" rg -nPU '(?s)add_(library|executable)\([^)]*\).*?FloatFormatEncoding\.cpp' components/core/src/clp_s/CMakeLists.txt || true echo -e "\nunitTest link interface (confirm we link the owning target instead of compiling the TU again):" sed -n '730,774p' components/core/CMakeLists.txt
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/CMakeLists.txt
(2 hunks)components/core/src/clp_s/CMakeLists.txt
(4 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/CMakeLists.txt
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/core/CMakeLists.txt
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/CMakeLists.txt
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/core/src/clp_s/CMakeLists.txt
📚 Learning: 2025-08-25T07:27:19.449Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1262
File: docs/src/dev-guide/components-core/index.md:50-50
Timestamp: 2025-08-25T07:27:19.449Z
Learning: In the CLP project, nlohmann_json version 3.11.3 is installed through their dependency management process, but they don't enforce version constraints in CMake find_package calls - they control versions through installation rather than CMake version pinning.
Applied to files:
components/core/src/clp_s/CMakeLists.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
🔇 Additional comments (3)
components/core/CMakeLists.txt (1)
705-705
: LGTM: new FloatFormatEncoding test is wired into unitTest.Once the duplicate‑compilation above is resolved, ensure unitTest links the production target that provides FloatFormatEncoding so this test still builds.
components/core/src/clp_s/CMakeLists.txt (2)
273-273
: Keep ystdlib::error_handling as PUBLICFloatFormatEncoding.hpp publicly includes <ystdlib/error_handling/Result.hpp> and exposes ystdlib::error_handling::Result in its API (components/core/src/clp_s/FloatFormatEncoding.hpp:9,54,57), so clp_s_archive_writer must retain PUBLIC linkage.
Likely an incorrect or invalid review comment.
337-337
: Keep ystdlib::error_handling PUBLIC — it's part of the reader's public APIcomponents/core/src/clp_s/FloatFormatEncoding.hpp includes <ystdlib/error_handling/Result.hpp> and exposes ystdlib::error_handling::Result in public signatures (components/core/src/clp_s/FloatFormatEncoding.hpp: lines 9, 54–57); do not change to PRIVATE.
Likely an incorrect or invalid review comment.
FloatFormatEncoding.cpp | ||
FloatFormatEncoding.hpp | ||
JsonFileIterator.cpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick
Adding FloatFormatEncoding to writer sources: SGTM
Good to see the encoding module compiled into the writer.
If this module will be shared broadly (writer, reader, json constructor, etc.), consider extracting it into a tiny common target (e.g., clp_s::float_format) and linking both writer/reader to it to avoid repeated compilation. Not urgent given existing project patterns.
🤖 Prompt for AI Agents
In components/core/src/clp_s/CMakeLists.txt around lines 229-231,
FloatFormatEncoding sources were added directly to the writer target; to avoid
duplicate compilation when this module is used by multiple targets, create a
small common library target (e.g., clp_s::float_format) that contains
FloatFormatEncoding.cpp/.hpp and any related sources/headers, set its include
directories and target properties, then remove those files from the writer
sources and link both the writer and any other consumers (reader, json
constructor) against clp_s::float_format using target_link_libraries so they
share a single compiled implementation.
FloatFormatEncoding.cpp | ||
FloatFormatEncoding.hpp | ||
JsonSerializer.hpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick
Adding FloatFormatEncoding to reader sources: SGTM
Mirrors the writer change; necessary for decode/restore path.
Same optional note as above: consider a small shared target if this module expands or is reused elsewhere.
🤖 Prompt for AI Agents
In components/core/src/clp_s/CMakeLists.txt around lines 299 to 301, the
FloatFormatEncoding source files are missing from the reader target; add
FloatFormatEncoding.cpp and FloatFormatEncoding.hpp to the reader sources (and
keep JsonSerializer.hpp) so the decode/restore path builds correctly, and
consider refactoring into a small shared/static library target if this module
will be reused to avoid duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
components/core/src/clp_s/CommandLineArguments.cpp (1)
886-889
: Expose the new flag in compression usage.Add a succinct line so users see the option without scrolling the full list.
Apply:
void CommandLineArguments::print_compression_usage() const { std::cerr << "Usage: " << m_program_name << " c [OPTIONS] ARCHIVES_DIR [FILE/DIR ...]" << std::endl; + std::cerr << " --retain-float-format Preserve original JSON float formatting" + << std::endl; }components/core/tests/test-clp_s-search.cpp (1)
160-162
: Fix designated initializers for clp_s::Path (use ‘=’, not brace-init)
.source{…}
/.path{…}
is invalid; use C++20 designated initializers.- auto archive_path = clp_s::Path{ - .source{clp_s::InputSource::Filesystem}, - .path{entry.path().string()} - }; + auto archive_path = clp_s::Path{ + .source = clp_s::InputSource::Filesystem, + .path = entry.path().string() + };components/core/tests/test-clp_s-end_to_end.cpp (1)
110-113
: Fix designated initializers for clp_s::Path in extract()Use
=
in designated initializers; brace‑init is invalid and will fail to compile.- constructor_option.archive_path = clp_s::Path{ - .source{clp_s::InputSource::Filesystem}, - .path{entry.path().string()} - }; + constructor_option.archive_path = clp_s::Path{ + .source = clp_s::InputSource::Filesystem, + .path = entry.path().string() + };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (12)
components/core/src/clp_s/CMakeLists.txt
(4 hunks)components/core/src/clp_s/CommandLineArguments.cpp
(1 hunks)components/core/src/clp_s/CommandLineArguments.hpp
(2 hunks)components/core/src/clp_s/JsonParser.cpp
(7 hunks)components/core/src/clp_s/JsonParser.hpp
(2 hunks)components/core/src/clp_s/clp-s.cpp
(1 hunks)components/core/tests/clp_s_test_utils.cpp
(2 hunks)components/core/tests/clp_s_test_utils.hpp
(2 hunks)components/core/tests/test-clp_s-delta-encode-log-order.cpp
(1 hunks)components/core/tests/test-clp_s-end_to_end.cpp
(5 hunks)components/core/tests/test-clp_s-range_index.cpp
(1 hunks)components/core/tests/test-clp_s-search.cpp
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>
rather than!<expression>
.
Files:
components/core/src/clp_s/clp-s.cpp
components/core/tests/clp_s_test_utils.hpp
components/core/src/clp_s/CommandLineArguments.hpp
components/core/src/clp_s/CommandLineArguments.cpp
components/core/tests/clp_s_test_utils.cpp
components/core/src/clp_s/JsonParser.hpp
components/core/src/clp_s/JsonParser.cpp
components/core/tests/test-clp_s-search.cpp
components/core/tests/test-clp_s-delta-encode-log-order.cpp
components/core/tests/test-clp_s-range_index.cpp
components/core/tests/test-clp_s-end_to_end.cpp
🧠 Learnings (16)
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/tests/clp_s_test_utils.hpp
components/core/tests/clp_s_test_utils.cpp
components/core/tests/test-clp_s-search.cpp
components/core/tests/test-clp_s-delta-encode-log-order.cpp
components/core/tests/test-clp_s-range_index.cpp
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2025-08-25T07:27:19.449Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1262
File: docs/src/dev-guide/components-core/index.md:50-50
Timestamp: 2025-08-25T07:27:19.449Z
Learning: In the CLP project, nlohmann_json version 3.11.3 is installed through their dependency management process, but they don't enforce version constraints in CMake find_package calls - they control versions through installation rather than CMake version pinning.
Applied to files:
components/core/src/clp_s/CMakeLists.txt
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.
Applied to files:
components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/core/tests/test-clp_s-search.cpp
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.
Applied to files:
components/core/tests/test-clp_s-search.cpp
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-12-04T15:50:37.827Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.
Applied to files:
components/core/tests/test-clp_s-delta-encode-log-order.cpp
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2025-01-14T16:06:54.692Z
Learnt from: haiqi96
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.
Applied to files:
components/core/tests/test-clp_s-delta-encode-log-order.cpp
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
PR: y-scope/clp#570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-14T01:03:27.144Z
Learnt from: LinZhihao-723
PR: y-scope/clp#450
File: components/core/src/clp/streaming_archive/reader/Segment.cpp:57-58
Timestamp: 2024-11-14T01:03:27.144Z
Learning: In `components/core/src/clp/streaming_archive/reader/Segment.cpp`, the method `m_decompressor.open()` does not have a return value.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-29T22:51:00.355Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:141-142
Timestamp: 2024-11-29T22:51:00.355Z
Learning: In 'components/core/tests/test-clp_s-end_to_end.cpp', using `WIFEXITED` and `WEXITSTATUS` with the return value of `std::system` is acceptable for checking the exit status of system commands, as per the C++ documentation.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
PR: y-scope/clp#557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
🧬 Code graph analysis (4)
components/core/src/clp_s/clp-s.cpp (1)
components/core/src/clp_s/ArchiveWriter.hpp (1)
option
(146-146)
components/core/src/clp_s/JsonParser.cpp (6)
components/core/src/clp_s/ParsedMessage.hpp (10)
value
(80-82)value
(80-80)value
(90-92)value
(90-90)node_id
(45-47)node_id
(45-45)node_id
(55-57)node_id
(55-55)node_id
(65-67)node_id
(65-65)components/core/src/clp_s/FloatFormatEncoding.cpp (4)
restore_encoded_float
(263-328)restore_encoded_float
(263-264)get_float_encoding
(136-261)get_float_encoding
(136-137)components/core/src/clp_s/JsonParser.hpp (5)
JsonParser
(59-59)JsonParser
(62-62)line
(112-112)line
(195-195)line
(202-202)components/core/src/clp_s/ArchiveWriter.hpp (3)
option
(146-146)parent_node_id
(171-171)parent_node_id
(179-179)components/core/src/clp_s/ColumnReader.hpp (4)
FormattedFloat
(164-164)DictionaryFloat
(200-200)Float
(141-141)Integer
(85-85)components/core/src/clp_s/SchemaTree.hpp (1)
parent_node_id
(132-132)
components/core/tests/test-clp_s-search.cpp (2)
components/core/tests/test-clp_s-end_to_end.cpp (6)
get_test_input_path_relative_to_tests_dir
(41-42)get_test_input_path_relative_to_tests_dir
(52-55)get_test_input_path_relative_to_tests_dir
(52-53)get_test_input_local_path
(43-43)get_test_input_local_path
(57-61)get_test_input_local_path
(57-57)components/core/tests/clp_s_test_utils.cpp (2)
compress_archive
(14-55)compress_archive
(14-21)
components/core/tests/test-clp_s-end_to_end.cpp (5)
components/core/tests/test-clp_s-search.cpp (6)
get_test_input_path_relative_to_tests_dir
(48-49)get_test_input_path_relative_to_tests_dir
(64-67)get_test_input_path_relative_to_tests_dir
(64-65)get_test_input_local_path
(50-50)get_test_input_local_path
(69-73)get_test_input_local_path
(69-69)components/core/src/clp_s/SchemaTree.hpp (2)
child_id
(113-113)child_id
(113-113)components/core/tests/clp_s_test_utils.cpp (2)
compress_archive
(14-55)compress_archive
(14-21)components/core/tests/clp_s_test_utils.hpp (1)
compress_archive
(24-31)components/core/src/clp_s/ColumnReader.hpp (2)
FormattedFloat
(164-164)DictionaryFloat
(200-200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: lint-check (ubuntu-24.04)
- GitHub Check: lint-check (macos-15)
🔇 Additional comments (30)
components/core/src/clp_s/CMakeLists.txt (4)
235-236
: Avoid duplicate compilation: extract FloatFormatEncoding into a small shared target.These sources are added here and again to the reader below. Create a tiny library (e.g., clp_s::float_format) and link both writer/reader to it to cut rebuild time and keep deps centralised.
Apply this diff in this block:
- FloatFormatEncoding.cpp - FloatFormatEncoding.hppThen add a small target (outside this block):
add_library(clp_s_float_format STATIC FloatFormatEncoding.cpp ) add_library(clp_s::float_format ALIAS clp_s_float_format) target_compile_features(clp_s_float_format PRIVATE cxx_std_20) target_include_directories(clp_s_float_format PUBLIC ../) target_link_libraries(clp_s_float_format PUBLIC ystdlib::error_handling) # Link it target_link_libraries(clp_s_archive_writer PRIVATE clp_s::float_format) target_link_libraries(clp_s_archive_reader PRIVATE clp_s::float_format)
305-306
: Deduplicate FloatFormatEncoding build by sharing a library.Same as writer: move these into clp_s::float_format and link reader to it.
Apply this diff in this block:
- FloatFormatEncoding.cpp - FloatFormatEncoding.hpp(Linking instructions provided in the earlier comment.)
279-279
: Keep ystdlib::error_handling PUBLIC — it’s used in clp_s public headersFloatFormatEncoding.hpp (includes Result) and kv_ir_search.hpp (includes ErrorCode, Result) expose ystdlib::error_handling types in the public API, so changing the link to PRIVATE is incorrect. Keep PUBLIC here; if retained, remove the redundant ystdlib::error_handling entry from the clp-s executable link list (around line 443).
Likely an incorrect or invalid review comment.
343-343
: Make ystdlib::error_handling PRIVATE unless clp_s public headers require itRepository scans produced no evidence that clp_s public headers expose ystdlib (rg returned no matches; a subsequent rg attempt errored on the 'hpp' file type). Change PUBLIC → PRIVATE in the target_link_libraries unless a public header includes or exposes ystdlib symbols/macros.
File: components/core/src/clp_s/CMakeLists.txt (around line 343)
- PUBLIC + PRIVATE absl::flat_hash_map clp::string_utils clp_s::io msgpack-cxx nlohmann_json::nlohmann_json - ystdlib::error_handling + ystdlib::error_handlingQuick verification command to run locally:
rg -n -C2 -e '\bystdlib\b' -e '#\sinclude\s<ystdlib' -e 'error_handling' -e '\bY_THROW\b' -e '\bY_EXPECT\b' components/core/src/clp_s/components/core/src/clp_s/CommandLineArguments.hpp (2)
105-105
: Good: accessor is [[nodiscard]].Consistent with nearby getters. No further action.
184-184
: Good: default-off flag.Member is explicitly initialised to false; safe default.
components/core/src/clp_s/CommandLineArguments.cpp (1)
247-251
: CLI flag wiring — LGTM.Option is correctly bound via bool_switch to m_retain_float_format.
components/core/tests/clp_s_test_utils.cpp (1)
18-21
: Plumbing looks correct.Param added and propagated to parser_option.retain_float_format; matches header.
Also applies to: 41-42
components/core/src/clp_s/clp-s.cpp (1)
104-104
: Log the flag and warn for KV‑IR inputs.Improve UX by logging the effective setting and warning when it won’t apply to IR inputs.
Apply:
option.print_archive_stats = command_line_arguments.print_archive_stats(); option.retain_float_format = command_line_arguments.get_retain_float_format(); + SPDLOG_DEBUG("retain_float_format={}", option.retain_float_format); + if (option.retain_float_format) { + bool has_ir_input{false}; + for (auto const& p : option.input_paths) { + if (std::string::npos != p.path.find(clp::ir::cIrFileExtension)) { + has_ir_input = true; + break; + } + } + if (has_ir_input) { + SPDLOG_WARN("--retain-float-format currently has no effect for KV-IR inputs."); + } + }components/core/src/clp_s/JsonParser.hpp (1)
44-46
: Add brief docs for retain_float_format.Clarify scope/impact; mirrors public CLI knob.
Apply:
struct JsonParserOption { @@ - bool retain_float_format{false}; + // When true, preserve original JSON float text formatting (e.g., decimal/scientific, precision). + // No effect for KV-IR inputs; may increase CPU/memory during parse/serialize. + bool retain_float_format{false}; @@ - bool m_retain_float_format{false}; + // Mirrors option.retain_float_format. + bool m_retain_float_format{false};Also applies to: 244-245
components/core/tests/test-clp_s-delta-encode-log-order.cpp (1)
73-76
: Name boolean args for readability.Inline names prevent argument-order slips.
Apply:
- std::nullopt, - false, - true, - false + std::nullopt, + /*retain_float_format=*/false, + /*single_file_archive=*/true, + /*structurize_arrays=*/falsecomponents/core/tests/test-clp_s-range_index.cpp (1)
227-230
: Name boolean args for readability.Annotate booleans at the call site.
Apply:
- std::nullopt, - false, - single_file_archive, - false + std::nullopt, + /*retain_float_format=*/false, + single_file_archive, + /*structurize_arrays=*/falsecomponents/core/tests/clp_s_test_utils.hpp (1)
28-31
: Signature change — verify all call sites updatedAutomated scan located compress_archive usages but the verification script aborted; manually confirm each call supplies the three new bool params (retain_float_format, single_file_archive, structurize_arrays) in the correct order.
Example hits (verify these call-sites):
- components/core/tests/test-clp_s-search.cpp:236 — args_count=6; trailing args: false, single_file_archive, structurize_arrays.
- components/core/tests/test-clp_s-search.cpp:273 — args_count=7; trailing args: true, single_file_archive, false, clp_s::FileType::Json.
Also check: components/core/tests/test-clp_s-range_index.cpp, components/core/tests/test-clp_s-delta-encode-log-order.cpp, components/core/tests/test-clp_s-end_to_end.cpp, components/core/tests/clp_s_test_utils.cpp.
If any call is missing or has reordered boolean arguments, update the call to match the signature in components/core/tests/clp_s_test_utils.hpp.
components/core/tests/test-clp_s-search.cpp (7)
44-44
: LGTM: new formatted-float fixture constantThe constant is clear and follows existing naming.
64-67
: Const‑correctness for helper definitionsMake parameters const for consistency with other tests.
-auto get_test_input_path_relative_to_tests_dir(std::string_view test_input_path) +auto get_test_input_path_relative_to_tests_dir(std::string_view const test_input_path) -> std::filesystem::path { return std::filesystem::path{cTestInputFileDirectory} / test_input_path; } -auto get_test_input_local_path(std::string_view test_input_path) -> std::string { +auto get_test_input_local_path(std::string_view const test_input_path) -> std::string { std::filesystem::path const current_file_path{__FILE__}; auto const tests_dir{current_file_path.parent_path()}; return (tests_dir / get_test_input_path_relative_to_tests_dir(test_input_path)).string(); }Also applies to: 69-73
233-244
: Name boolean args and isolate per‑run archive directory
- Annotate booleans to prevent arg‑order mistakes.
- Use a unique archive dir per parametrisation to avoid cross‑run interference.
- TestOutputCleaner const test_cleanup{{std::string{cTestSearchArchiveDirectory}}}; + auto const archive_dir = fmt::format("{}-s{}_f{}", + cTestSearchArchiveDirectory, + structurize_arrays, + single_file_archive); + TestOutputCleaner const test_cleanup{{archive_dir}}; @@ - std::ignore = compress_archive( - get_test_input_local_path(cTestSearchInputFile), - std::string{cTestSearchArchiveDirectory}, - std::string{cTestIdxKey}, - false, - single_file_archive, - structurize_arrays - ) + std::ignore = compress_archive( + get_test_input_local_path(cTestSearchInputFile), + archive_dir, + std::string{cTestIdxKey}, + /*retain_float_format=*/false, + /*single_file_archive=*/single_file_archive, + /*structurize_arrays=*/structurize_arrays + )
246-249
: Capture generated params to aid triageInclude all generated values in failure output.
- CAPTURE(query); + CAPTURE(query); + CAPTURE(structurize_arrays); + CAPTURE(single_file_archive);
256-267
: Extend cases: negative zero and uppercase exponentNumeric equality should treat -0.0 == 0 and accept 1E-16.
{R"aa(formattedFloatValue: 1e-16)aa", {6, 7}}, + {R"aa(formattedFloatValue: 1E-16)aa", {6, 7}}, + {R"aa(formattedFloatValue: -0.0)aa", {3, 4, 5}},
48-50
: Const‑correctness for helper declarations (and centralise)Accept
std::string_view const
to match usage elsewhere and avoid accidental mutation. Consider moving these into clp_s_test_utils.{hpp,cpp}.-auto get_test_input_path_relative_to_tests_dir(std::string_view test_input_path) +auto get_test_input_path_relative_to_tests_dir(std::string_view const test_input_path) -> std::filesystem::path; -auto get_test_input_local_path(std::string_view test_input_path) -> std::string; +auto get_test_input_local_path(std::string_view const test_input_path) -> std::string;
270-283
: Name boolean args and use unique archive dirMatch the style used above and avoid shared output paths across runs.
- TestOutputCleaner const test_cleanup{{std::string{cTestSearchArchiveDirectory}}}; + auto const archive_dir = fmt::format("{}-f{}", + cTestSearchArchiveDirectory, + single_file_archive); + TestOutputCleaner const test_cleanup{{archive_dir}}; @@ - std::ignore = compress_archive( - get_test_input_local_path(cTestSearchFormattedFloatFile), - std::string{cTestSearchArchiveDirectory}, - std::nullopt, - true, - single_file_archive, - false, - clp_s::FileType::Json - ) + std::ignore = compress_archive( + get_test_input_local_path(cTestSearchFormattedFloatFile), + archive_dir, + std::nullopt, + /*retain_float_format=*/true, + /*single_file_archive=*/single_file_archive, + /*structurize_arrays=*/false, + clp_s::FileType::Json + )components/core/tests/test-clp_s-end_to_end.cpp (6)
129-133
: Quote paths in shell commandsUnquoted paths break on spaces. Quote both arguments.
- auto command = fmt::format( - "jq --sort-keys --compact-output '.' {} | sort > {}", - extracted_json_path.string(), - cTestEndToEndOutputSortedJson - ); + auto command = fmt::format( + "jq --sort-keys --compact-output '.' \"{}\" | sort > \"{}\"", + extracted_json_path.string(), + cTestEndToEndOutputSortedJson + ); @@ - command = fmt::format( - "diff --unified {} {} > /dev/null", - cTestEndToEndOutputSortedJson, - get_test_input_local_path(cTestEndToEndInputFile) - ); + command = fmt::format( + "diff --unified \"{}\" \"{}\" > /dev/null", + cTestEndToEndOutputSortedJson, + get_test_input_local_path(cTestEndToEndInputFile) + );Also applies to: 141-147
150-182
: Quote paths in literallyCompare()Apply consistent quoting throughout.
- auto command = fmt::format( - "cat {} | awk 'NF' | tr -d ' ' | sort > {}", - expected_output_json_path.string(), - cTestEndToEndExpectedOutputSortedFile - ); + auto command = fmt::format( + "cat \"{}\" | awk 'NF' | tr -d ' ' | sort > \"{}\"", + expected_output_json_path.string(), + cTestEndToEndExpectedOutputSortedFile + ); @@ - command = fmt::format( - "cat {} | awk 'NF' | tr -d ' ' | sort > {}", - extracted_json_path.string(), - cTestEndToEndOutputSortedJson - ); + command = fmt::format( + "cat \"{}\" | awk 'NF' | tr -d ' ' | sort > \"{}\"", + extracted_json_path.string(), + cTestEndToEndOutputSortedJson + ); @@ - command = fmt::format( - "diff --unified {} {} > /dev/null", - cTestEndToEndExpectedOutputSortedFile, - cTestEndToEndOutputSortedJson - ); + command = fmt::format( + "diff --unified \"{}\" \"{}\" > /dev/null", + cTestEndToEndExpectedOutputSortedFile, + cTestEndToEndOutputSortedJson + );
197-206
: Check compress_archive stats and name boolean argsDon’t discard stats; assert non‑empty and annotate flags.
- REQUIRE_NOTHROW( - std::ignore = compress_archive( + REQUIRE_NOTHROW( + [&]{ + auto stats = compress_archive( get_test_input_local_path(cTestEndToEndInputFile), std::string{cTestEndToEndArchiveDirectory}, std::nullopt, - false, - single_file_archive, - structurize_arrays - ) + /*retain_float_format=*/false, + /*single_file_archive=*/single_file_archive, + /*structurize_arrays=*/structurize_arrays + ); + REQUIRE(false == stats.empty()); + }() );
220-237
: Validate stats and name booleans (valid formatted‑floats test); isolate per‑run dirMirror the improvements; prevent parallel interference by using a unique archive dir.
- TestOutputCleaner const test_cleanup{ - {std::string{cTestEndToEndArchiveDirectory}, + auto const archive_dir = fmt::format("{}-valid-s{}_f{}", + cTestEndToEndArchiveDirectory, + structurize_arrays, + single_file_archive); + TestOutputCleaner const test_cleanup{ + {archive_dir, std::string{cTestEndToEndOutputDirectory}, std::string{cTestEndToEndOutputSortedJson}, std::string{cTestEndToEndExpectedOutputSortedFile}} }; @@ - REQUIRE_NOTHROW( - std::ignore = compress_archive( + REQUIRE_NOTHROW( + [&]{ + auto stats = compress_archive( get_test_input_local_path(cTestEndToEndValidFormattedFloatInputFile), - std::string{cTestEndToEndArchiveDirectory}, + archive_dir, std::nullopt, - true, - single_file_archive, - structurize_arrays, + /*retain_float_format=*/true, + /*single_file_archive=*/single_file_archive, + /*structurize_arrays=*/structurize_arrays, clp_s::FileType::Json - ) + ); + REQUIRE(false == stats.empty()); + }() );
271-281
: Validate stats and name booleans (invalid formatted‑floats test); isolate per‑run dirApply the same fixes here.
- TestOutputCleaner const test_cleanup{ - {std::string{cTestEndToEndArchiveDirectory}, + auto const archive_dir = fmt::format("{}-invalid-s{}_f{}", + cTestEndToEndArchiveDirectory, + structurize_arrays, + single_file_archive); + TestOutputCleaner const test_cleanup{ + {archive_dir, std::string{cTestEndToEndOutputDirectory}, std::string{cTestEndToEndOutputSortedJson}, std::string{cTestEndToEndExpectedOutputSortedFile}} }; @@ - REQUIRE_NOTHROW( - std::ignore = compress_archive( + REQUIRE_NOTHROW( + [&]{ + auto stats = compress_archive( get_test_input_local_path(cTestEndToEndInvalidFormattedFloatInputFile), - std::string{cTestEndToEndArchiveDirectory}, + archive_dir, std::nullopt, - true, - single_file_archive, - structurize_arrays, + /*retain_float_format=*/true, + /*single_file_archive=*/single_file_archive, + /*structurize_arrays=*/structurize_arrays, clp_s::FileType::Json - ) + ); + REQUIRE(false == stats.empty()); + }() );
41-61
: Deduplicate test‑path helpers across test filesThese helpers duplicate those in test‑clp_s‑search.cpp; prefer centralising in clp_s_test_utils to avoid drift.
components/core/src/clp_s/JsonParser.cpp (4)
258-276
: Make token view const; avoid accidental reuseConst‑qualify the raw token view for clarity and to prevent accidental reassignment.
- auto double_value_str{trim_trailing_whitespace(cur_value.raw_json_token())}; + auto const double_value_str{ + trim_trailing_whitespace(cur_value.raw_json_token()) + };
365-383
: Make token view const (array path too)Mirror the const binding here as well.
- auto double_value_str{trim_trailing_whitespace(cur_value.raw_json_token())}; + auto const double_value_str{ + trim_trailing_whitespace(cur_value.raw_json_token()) + };
520-543
: Make token view const (ordered/object path)Keep consistent const‑binding across all three sites.
- auto double_value_str{trim_trailing_whitespace(line.raw_json_token())}; + auto const double_value_str{ + trim_trailing_whitespace(line.raw_json_token()) + };
121-131
: Locale‑safe whitespace check: cast to unsigned charstd::isspace requires unsigned char to avoid UB for negative char values.
auto trim_trailing_whitespace(std::string_view str) -> std::string_view { size_t substr_size{str.size()}; for (auto it{str.rbegin()}; str.rend() != it; ++it) { - if (std::isspace(static_cast<int>(*it))) { + if (std::isspace(static_cast<unsigned char>(*it))) { --substr_size; } else { break; } } return str.substr(0ULL, substr_size); }
case simdjson::ondemand::json_type::number: { | ||
simdjson::ondemand::number number_value = cur_value.get_number(); | ||
if (true == number_value.is_double()) { | ||
double double_value = number_value.get_double(); | ||
m_current_parsed_message.add_unordered_value(double_value); | ||
node_id = m_archive_writer | ||
->add_node(node_id_stack.top(), NodeType::Float, cur_key); | ||
if (m_retain_float_format) { | ||
auto double_value_str{trim_trailing_whitespace(cur_value.raw_json_token())}; | ||
auto const float_format_result{get_float_encoding(double_value_str)}; | ||
if (false == float_format_result.has_error() | ||
&& round_trip_is_identical( | ||
double_value_str, | ||
number_value.get_double(), | ||
float_format_result.value() | ||
)) | ||
{ | ||
m_current_parsed_message.add_unordered_value( | ||
number_value.get_double(), | ||
float_format_result.value() | ||
); | ||
node_id = m_archive_writer->add_node( | ||
node_id_stack.top(), | ||
NodeType::FormattedFloat, | ||
cur_key | ||
); | ||
} else { | ||
m_current_parsed_message.add_unordered_value(double_value_str); | ||
node_id = m_archive_writer->add_node( | ||
node_id_stack.top(), | ||
NodeType::DictionaryFloat, | ||
cur_key | ||
); | ||
} | ||
} else { | ||
double double_value = number_value.get_double(); | ||
m_current_parsed_message.add_unordered_value(double_value); | ||
node_id = m_archive_writer | ||
->add_node(node_id_stack.top(), NodeType::Float, cur_key); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick
Deduplicate float‑handling into a helper
The FormattedFloat/DictionaryFloat decision logic is repeated in 3 places. Extract a small helper (e.g., process_double_token(...)) to compute {node_type, parsed_message_addition} and call it from parse_obj_in_array/parse_array/parse_line to reduce drift.
Would you like me to draft the helper signature and a minimal refactor patch?
Also applies to: 362-395, 518-549
🤖 Prompt for AI Agents
components/core/src/clp_s/JsonParser.cpp lines 254-289 (also apply same change
at 362-395 and 518-549): duplicate logic deciding between FormattedFloat vs
DictionaryFloat (and adding the corresponding value to m_current_parsed_message
+ choosing node type) appears in three places; extract a helper like
process_double_token(cur_value, cur_key, m_retain_float_format,
m_archive_writer, m_current_parsed_message) that returns a pair{NodeType,
optional<string|double>-value-addition} or directly performs the
m_current_parsed_message.add_unordered_value and returns the NodeType to add;
replace the repeated blocks in parse_obj_in_array/parse_array/parse_line with a
single call to that helper and then call
m_archive_writer->add_node(node_id_stack.top(), returnedNodeType, cur_key) to
keep behavior identical while centralizing the float-format detection and
round-trip check.
* @param timestamp_key | ||
* @param retain_float_format | ||
* @param single_file_archive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick
Document scope: JSON-only effect.
Note in the doc that retain_float_format has no effect for KV‑IR inputs.
Apply:
- * @param retain_float_format
+ * @param retain_float_format
+ * Preserve original JSON float text formatting (e.g., decimal/scientific, precision).
+ * Note: No effect when compressing KV-IR inputs.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
* @param timestamp_key | |
* @param retain_float_format | |
* @param single_file_archive | |
* @param timestamp_key | |
* @param retain_float_format | |
* Preserve original JSON float text formatting (e.g., decimal/scientific, precision). | |
* Note: No effect when compressing KV-IR inputs. | |
* @param single_file_archive |
🤖 Prompt for AI Agents
In components/core/tests/clp_s_test_utils.hpp around lines 18 to 20, the
parameter documentation for retain_float_format lacks the scope note; update the
comment to state explicitly that retain_float_format only applies to JSON inputs
and has no effect for KV-IR (key-value IR) inputs. Edit the @param description
for retain_float_format to add one short sentence such as "Applies to JSON-only;
no effect for KV-IR inputs" so readers know the parameter is ignored for KV‑IR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/core/tests/test-clp_s-end_to_end.cpp (1)
111-113
: Fix C++20 designated initializers for clp_s::Path (use '=')Brace-init
.source{...}
/.path{...}
is invalid; this won’t compile.- constructor_option.archive_path = clp_s::Path{ - .source{clp_s::InputSource::Filesystem}, - .path{entry.path().string()} - }; + constructor_option.archive_path = clp_s::Path{ + .source = clp_s::InputSource::Filesystem, + .path = entry.path().string() + };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
components/core/tests/test-clp_s-end_to_end.cpp
(5 hunks)components/core/tests/test-clp_s-search.cpp
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
⚙️ CodeRabbit configuration file
- Prefer
false == <expression>
rather than!<expression>
.
Files:
components/core/tests/test-clp_s-search.cpp
components/core/tests/test-clp_s-end_to_end.cpp
🧠 Learnings (10)
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/core/tests/test-clp_s-search.cpp
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/core/tests/test-clp_s-search.cpp
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.
Applied to files:
components/core/tests/test-clp_s-search.cpp
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
PR: y-scope/clp#570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-12-04T15:50:37.827Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-14T01:03:27.144Z
Learnt from: LinZhihao-723
PR: y-scope/clp#450
File: components/core/src/clp/streaming_archive/reader/Segment.cpp:57-58
Timestamp: 2024-11-14T01:03:27.144Z
Learning: In `components/core/src/clp/streaming_archive/reader/Segment.cpp`, the method `m_decompressor.open()` does not have a return value.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-29T22:51:00.355Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:141-142
Timestamp: 2024-11-29T22:51:00.355Z
Learning: In 'components/core/tests/test-clp_s-end_to_end.cpp', using `WIFEXITED` and `WEXITSTATUS` with the return value of `std::system` is acceptable for checking the exit status of system commands, as per the C++ documentation.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
PR: y-scope/clp#557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2025-01-14T16:06:54.692Z
Learnt from: haiqi96
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.
Applied to files:
components/core/tests/test-clp_s-end_to_end.cpp
🧬 Code graph analysis (2)
components/core/tests/test-clp_s-search.cpp (3)
components/core/tests/test-clp_s-end_to_end.cpp (6)
get_test_input_path_relative_to_tests_dir
(41-42)get_test_input_path_relative_to_tests_dir
(52-55)get_test_input_path_relative_to_tests_dir
(52-53)get_test_input_local_path
(43-43)get_test_input_local_path
(57-61)get_test_input_local_path
(57-57)components/core/tests/clp_s_test_utils.cpp (2)
compress_archive
(14-55)compress_archive
(14-21)components/core/tests/clp_s_test_utils.hpp (1)
compress_archive
(24-31)
components/core/tests/test-clp_s-end_to_end.cpp (4)
components/core/tests/test-clp_s-search.cpp (6)
get_test_input_path_relative_to_tests_dir
(48-49)get_test_input_path_relative_to_tests_dir
(64-67)get_test_input_path_relative_to_tests_dir
(64-65)get_test_input_local_path
(50-50)get_test_input_local_path
(69-73)get_test_input_local_path
(69-69)components/core/src/clp_s/SchemaTree.hpp (2)
child_id
(113-113)child_id
(113-113)components/core/tests/clp_s_test_utils.cpp (2)
compress_archive
(14-55)compress_archive
(14-21)components/core/src/clp_s/ColumnReader.hpp (2)
FormattedFloat
(164-164)DictionaryFloat
(200-200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: package-image
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
- GitHub Check: manylinux_2_28-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: musllinux_1_2-x86_64-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: lint-check (macos-15)
- GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (22)
components/core/tests/test-clp_s-search.cpp (13)
44-44
: Fixture constant LGTM.Clear naming; no issues.
218-218
: Use of parameterised input path LGTM.Good switch to per-file path resolver.
246-249
: Capture generated params for easier triage (repeat).Include parameter values in failure output.
- CAPTURE(query); + CAPTURE(query); + CAPTURE(structurize_arrays); + CAPTURE(single_file_archive); REQUIRE_NOTHROW(search(query, false, expected_results));
256-256
: New formatted-float test case LGTM.Good coverage of numeric operators and simple equality.
257-267
: Add -0.0 and uppercase exponent cases; clarify numeric vs lexical (repeat).Ensure numeric semantics handle -0.0 == 0 and exponent case-insensitivity.
{R"aa(formattedFloatValue: 1e-16)aa", {6, 7}}, + {R"aa(formattedFloatValue: -0.0)aa", {3, 4, 5}}, + {R"aa(formattedFloatValue: 1E-16)aa", {6, 7}},If lexical (exact-format) queries are also supported when quoted, consider adding quoted variants in a follow-up.
288-291
: Metadata first-record query reuse LGTM.Consistent with the other test; good sanity check.
48-50
: Deduplicate helpers; make params const (repeat).These helpers duplicate ones used elsewhere and lack const on parameters. Prefer centralising in clp_s_test_utils and apply const-correctness.
-auto get_test_input_path_relative_to_tests_dir(std::string_view test_input_path) +auto get_test_input_path_relative_to_tests_dir(std::string_view const test_input_path) -> std::filesystem::path; -auto get_test_input_local_path(std::string_view test_input_path) -> std::string; +auto get_test_input_local_path(std::string_view const test_input_path) -> std::string;
64-67
: Apply const-correctness to definition (repeat).Match signatures used in other tests and reduce accidental copies.
-auto get_test_input_path_relative_to_tests_dir(std::string_view test_input_path) +auto get_test_input_path_relative_to_tests_dir(std::string_view const test_input_path) -> std::filesystem::path { return std::filesystem::path{cTestInputFileDirectory} / test_input_path; }
69-73
: Apply const-correctness to definition (repeat).-auto get_test_input_local_path(std::string_view test_input_path) -> std::string { +auto get_test_input_local_path(std::string_view const test_input_path) -> std::string { std::filesystem::path const current_file_path{__FILE__}; auto const tests_dir{current_file_path.parent_path()}; return (tests_dir / get_test_input_path_relative_to_tests_dir(test_input_path)).string(); }
237-243
: Name boolean/flag args in compress_archive (repeat).Inline names prevent arg-order mistakes.
std::ignore = compress_archive( get_test_input_local_path(cTestSearchInputFile), std::string{cTestSearchArchiveDirectory}, - std::string{cTestIdxKey}, - false, - single_file_archive, - structurize_arrays + /*timestamp_key=*/std::string{cTestIdxKey}, + /*retain_float_format=*/false, + /*single_file_archive=*/single_file_archive, + /*structurize_arrays=*/structurize_arrays )
270-271
: Isolate archive dirs per parametrisation (repeat).Avoid cross-run interference in parallel runs by suffixing the directory.
- TestOutputCleaner const test_cleanup{{std::string{cTestSearchArchiveDirectory}}}; + auto archive_dir = fmt::format("{}-f{}", cTestSearchArchiveDirectory, single_file_archive); + TestOutputCleaner const test_cleanup{{archive_dir}}; @@ - std::string{cTestSearchArchiveDirectory}, + archive_dir,Also applies to: 274-276
272-281
: Name boolean/flag args in formatted-float compress_archive (repeat).std::ignore = compress_archive( get_test_input_local_path(cTestSearchFormattedFloatFile), - std::string{cTestSearchArchiveDirectory}, - std::nullopt, - true, - single_file_archive, - false + std::string{cTestSearchArchiveDirectory}, + /*timestamp_key=*/std::nullopt, + /*retain_float_format=*/true, + /*single_file_archive=*/single_file_archive, + /*structurize_arrays=*/false )
283-286
: Capture generated param in failure output.Add single_file_archive to CAPTURE to aid debugging.
- CAPTURE(query); + CAPTURE(query); + CAPTURE(single_file_archive); REQUIRE_NOTHROW(search(query, false, expected_results));components/core/tests/test-clp_s-end_to_end.cpp (9)
197-206
: Capture and assert compress_archive stats (no-floats test)Avoid discarding return value; assert non-empty stats to validate write-out.
- REQUIRE_NOTHROW( - std::ignore = compress_archive( + REQUIRE_NOTHROW([&]{ + auto stats = compress_archive( get_test_input_local_path(cTestEndToEndInputFile), std::string{cTestEndToEndArchiveDirectory}, std::nullopt, false, single_file_archive, structurize_arrays - ) - ); + ); + REQUIRE(false == stats.empty()); + }());
141-148
: Quote paths in diff invocationUnquoted paths break on spaces.
- command = fmt::format( - "diff --unified {} {} > /dev/null", - cTestEndToEndOutputSortedJson, - get_test_input_local_path(cTestEndToEndInputFile) - ); + command = fmt::format( + "diff --unified \"{}\" \"{}\" > /dev/null", + cTestEndToEndOutputSortedJson, + get_test_input_local_path(cTestEndToEndInputFile) + );
150-153
: Name is misleading: literallyCompare ignores whitespaceEither rename (e.g., compareIgnoringWhitespace) or add a brief comment stating it strips spaces and sorts lines.
155-178
: Quote all file paths in shell pipelines (cat/awk/tr/sort/diff)Consistently quote to avoid breakage with spaces/special chars.
- auto command = fmt::format( - "cat {} | awk 'NF' | tr -d ' ' | sort > {}", - expected_output_json_path.string(), - cTestEndToEndExpectedOutputSortedFile - ); + auto command = fmt::format( + "cat \"{}\" | awk 'NF' | tr -d ' ' | sort > \"{}\"", + expected_output_json_path.string(), + cTestEndToEndExpectedOutputSortedFile + ); @@ - command = fmt::format( - "cat {} | awk 'NF' | tr -d ' ' | sort > {}", - extracted_json_path.string(), - cTestEndToEndOutputSortedJson - ); + command = fmt::format( + "cat \"{}\" | awk 'NF' | tr -d ' ' | sort > \"{}\"", + extracted_json_path.string(), + cTestEndToEndOutputSortedJson + ); @@ - command = fmt::format( - "diff --unified {} {} > /dev/null", - cTestEndToEndExpectedOutputSortedFile, - cTestEndToEndOutputSortedJson - ); + command = fmt::format( + "diff --unified \"{}\" \"{}\" > /dev/null", + cTestEndToEndExpectedOutputSortedFile, + cTestEndToEndOutputSortedJson + );
227-236
: Also validate compress_archive stats (valid formatted-floats test)Mirror the earlier improvement.
- REQUIRE_NOTHROW( - std::ignore = compress_archive( + REQUIRE_NOTHROW([&]{ + auto stats = compress_archive( get_test_input_local_path(cTestEndToEndValidFormattedFloatInputFile), std::string{cTestEndToEndArchiveDirectory}, std::nullopt, true, single_file_archive, structurize_arrays - ) - ); + ); + REQUIRE(false == stats.empty()); + }());
270-279
: Also validate compress_archive stats (invalid formatted-floats test)Keep tests consistent and stronger.
- REQUIRE_NOTHROW( - std::ignore = compress_archive( + REQUIRE_NOTHROW([&]{ + auto stats = compress_archive( get_test_input_local_path(cTestEndToEndInvalidFormattedFloatInputFile), std::string{cTestEndToEndArchiveDirectory}, std::nullopt, true, single_file_archive, structurize_arrays - ) - ); + ); + REQUIRE(false == stats.empty()); + }());
247-251
: Pre-check extracted artefact exists before compare (minor safety)Small guard prior to file ops.
- auto extracted_json_path = extract(); + auto extracted_json_path = extract(); + REQUIRE(std::filesystem::exists(extracted_json_path)); + REQUIRE(std::filesystem::is_regular_file(extracted_json_path));
290-294
: Pre-check extracted artefact exists before compare (minor safety)Mirror the check in the other test.
- auto extracted_json_path = extract(); + auto extracted_json_path = extract(); + REQUIRE(std::filesystem::exists(extracted_json_path)); + REQUIRE(std::filesystem::is_regular_file(extracted_json_path));
52-61
: DRY: Share test-path helpers across testsThese helpers duplicate implementations in test-clp_s-search.cpp; move into a shared header/source (e.g., clp_s_test_utils.hpp/cpp).
void check_all_leaf_nodes_match_types(std::set<clp_s::NodeType> const& types) { | ||
clp_s::ArchiveReader archive_reader; | ||
for (auto const& entry : std::filesystem::directory_iterator(cTestEndToEndArchiveDirectory)) { | ||
REQUIRE_NOTHROW(archive_reader.open( | ||
clp_s::Path{ | ||
.source = clp_s::InputSource::Filesystem, | ||
.path = entry.path().string() | ||
}, | ||
clp_s::NetworkAuthOption{} | ||
)); | ||
auto const schema_tree{archive_reader.get_schema_tree()}; | ||
REQUIRE(nullptr != schema_tree); | ||
auto const root{schema_tree->get_object_subtree_node_id_for_namespace( | ||
clp_s::constants::cDefaultNamespace | ||
)}; | ||
REQUIRE(-1 != root); | ||
std::vector<int32_t> node_id_stack = {root}; | ||
while (false == node_id_stack.empty()) { | ||
auto const cur_node_id{node_id_stack.back()}; | ||
node_id_stack.pop_back(); | ||
auto const& cur_node{schema_tree->get_node(cur_node_id)}; | ||
|
||
if (false == cur_node.get_children_ids().empty()) { | ||
for (auto const child_id : cur_node.get_children_ids()) { | ||
node_id_stack.emplace_back(child_id); | ||
} | ||
continue; | ||
} | ||
REQUIRE(0 != types.count(cur_node.get_type())); | ||
} | ||
|
||
REQUIRE_NOTHROW(archive_reader.close()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick
Skip non-regular files when scanning the archive directory
Defensive guard prevents attempting to open dirs or temp files.
for (auto const& entry : std::filesystem::directory_iterator(cTestEndToEndArchiveDirectory)) {
+ if (false == std::filesystem::is_regular_file(entry.path())) {
+ continue;
+ }
REQUIRE_NOTHROW(archive_reader.open(
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
void check_all_leaf_nodes_match_types(std::set<clp_s::NodeType> const& types) { | |
clp_s::ArchiveReader archive_reader; | |
for (auto const& entry : std::filesystem::directory_iterator(cTestEndToEndArchiveDirectory)) { | |
REQUIRE_NOTHROW(archive_reader.open( | |
clp_s::Path{ | |
.source = clp_s::InputSource::Filesystem, | |
.path = entry.path().string() | |
}, | |
clp_s::NetworkAuthOption{} | |
)); | |
auto const schema_tree{archive_reader.get_schema_tree()}; | |
REQUIRE(nullptr != schema_tree); | |
auto const root{schema_tree->get_object_subtree_node_id_for_namespace( | |
clp_s::constants::cDefaultNamespace | |
)}; | |
REQUIRE(-1 != root); | |
std::vector<int32_t> node_id_stack = {root}; | |
while (false == node_id_stack.empty()) { | |
auto const cur_node_id{node_id_stack.back()}; | |
node_id_stack.pop_back(); | |
auto const& cur_node{schema_tree->get_node(cur_node_id)}; | |
if (false == cur_node.get_children_ids().empty()) { | |
for (auto const child_id : cur_node.get_children_ids()) { | |
node_id_stack.emplace_back(child_id); | |
} | |
continue; | |
} | |
REQUIRE(0 != types.count(cur_node.get_type())); | |
} | |
REQUIRE_NOTHROW(archive_reader.close()); | |
} | |
void check_all_leaf_nodes_match_types(std::set<clp_s::NodeType> const& types) { | |
clp_s::ArchiveReader archive_reader; | |
for (auto const& entry : std::filesystem::directory_iterator(cTestEndToEndArchiveDirectory)) { | |
if (false == std::filesystem::is_regular_file(entry.path())) { | |
continue; | |
} | |
REQUIRE_NOTHROW(archive_reader.open( | |
clp_s::Path{ | |
.source = clp_s::InputSource::Filesystem, | |
.path = entry.path().string() | |
}, | |
clp_s::NetworkAuthOption{} | |
)); | |
auto const schema_tree{archive_reader.get_schema_tree()}; | |
REQUIRE(nullptr != schema_tree); | |
auto const root{schema_tree->get_object_subtree_node_id_for_namespace( | |
clp_s::constants::cDefaultNamespace | |
)}; | |
REQUIRE(-1 != root); | |
std::vector<int32_t> node_id_stack = {root}; | |
while (false == node_id_stack.empty()) { | |
auto const cur_node_id{node_id_stack.back()}; | |
node_id_stack.pop_back(); | |
auto const& cur_node{schema_tree->get_node(cur_node_id)}; | |
if (false == cur_node.get_children_ids().empty()) { | |
for (auto const child_id : cur_node.get_children_ids()) { | |
node_id_stack.emplace_back(child_id); | |
} | |
continue; | |
} | |
REQUIRE(0 != types.count(cur_node.get_type())); | |
} | |
REQUIRE_NOTHROW(archive_reader.close()); | |
} | |
} |
🤖 Prompt for AI Agents
In components/core/tests/test-clp_s-end_to_end.cpp around lines 63–95, the
directory iteration attempts to open every directory entry and can fail on
non-regular files (directories, symlinks, temp files); add a defensive guard
immediately inside the loop to skip entries that are not regular files (e.g., if
(!std::filesystem::is_regular_file(entry.path())) continue;) so only regular
files are passed to archive_reader.open; place the check before calling
archive_reader.open and keep existing behavior for regular files unchanged.
auto single_file_archive = GENERATE(true, false); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick
Consider parametrising structurize_arrays here as well.
For parity with the first test, exercising both modes would catch schema-shape differences.
🤖 Prompt for AI Agents
In components/core/tests/test-clp_s-search.cpp around lines 268-269, the test
currently only parameterises single_file_archive; add a matching
parameterisation for structurize_arrays (e.g. auto structurize_arrays =
GENERATE(true, false)) so the test runs both modes and catches schema-shape
differences; ensure the rest of the test uses this variable where
structurize_arrays is consulted (or pass it into the function/setup) so both
combinations are exercised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deferring to @gibber9809's review.
Edited the PR title directly. Whoever wrote the previous PR title, watch out for misspellings, lol.
Description
This PR defines a new node type for archives, which is
FormattedFloat
. It stores encoded format information on top of the originalFloat
type. For the detailed encoding design, please check the doc changes.Checklist
breaking change.
Validation performed
Besides the existing datasets, I also added a pure float dataset (named
myown
in the tables) for corner case testing. I designed the following log lines for benchmarking:Explanation of each field:
”ts”
is the timestamp epoch, which is ensured to be ascending during the generation.”standardFloat”
is the standard expression of float, as in the JSON grammar: integer fraction. We can randomly generate the 1) sign (whether - for not); 2) integer digits; 3) fraction digits.”scientificFloat”
is the scientific notation of float, as in the JSON grammar: integer fraction exponent. We can randomly generate the 1) sign of mantissa (whether - or not) 2) the integer part of mantissa (1~9, or 0 if the value is 0); 3) the fraction part of mantissa; 4) the upper case E or e for the exponent; 5) sign of exponent (+, - or empty); 6) the exponent.”eitherStandardOrScientificFloat”
which is either random-generated”standardFloat”
or”scientificFloat”
.”unstructuredArray”
is a random size of”eitherStandardOrScientificFloat”
.“structuredArray”
is a random size of objects which contain one field ”float” which is generated in the same way as”eitherStandardOrScientificFloat”
.”msg”
is a randomly generated string, maybe we can get rid of this.I generated 1,000,000 loglines like this, the total size is 2.27GB.
Compression ratio results (the only flag other than
--retain-float-format
I specified was--timestamp-key
):Compression speed results:
I only benchmarked search performance for the MongoDB dataset with the queries in the papaer. I think the difference is just noise.
Here is the script I use to generate the float dataset:
For testing we:
FormattedFloat
and numbers we need to represent asDictionaryFloat
where we compress a test dataset, confirm that all of the data gets encoded asFormattedFloat
orDictionaryFloat
respectively, and verify that the decompression results are byte-accurateFormattedFloat
and confirm that they can be round-tripped correctlySummary by CodeRabbit
New Features
Tests
Chores