Skip to content

Conversation

anlowee
Copy link
Member

@anlowee anlowee commented Aug 8, 2025

Description

This PR defines a new node type for archives, which is FormattedFloat. It stores encoded format information on top of the original Float type. For the detailed encoding design, please check the doc changes.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Besides the existing datasets, I also added a pure float dataset (named myown in the tables) for corner case testing. I designed the following log lines for benchmarking:

{
  "ts": 1754934214000,
  "standardFloat": 1.234,
  "scientificFloat": 1.234E3,
  "eitherStandardOrScientificFloat": 1.234,
  "unstructuredArray": [1.234, 1.234E3],
  "structuredArray": [
    {
      "float": 1.234
    },
    {
      "float": 1.234E3
    }
  ],
  "msg": "abc"
}

Explanation of each field:

  • ”ts” is the timestamp epoch, which is ensured to be ascending during the generation.
  • ”standardFloat” is the standard expression of float, as in the JSON grammar: integer fraction. We can randomly generate the 1) sign (whether - for not); 2) integer digits; 3) fraction digits.
  • ”scientificFloat” is the scientific notation of float, as in the JSON grammar: integer fraction exponent. We can randomly generate the 1) sign of mantissa (whether - or not) 2) the integer part of mantissa (1~9, or 0 if the value is 0); 3) the fraction part of mantissa; 4) the upper case E or e for the exponent; 5) sign of exponent (+, - or empty); 6) the exponent.
  • ”eitherStandardOrScientificFloat” which is either random-generated ”standardFloat” or ”scientificFloat”.
  • ”unstructuredArray” is a random size of ”eitherStandardOrScientificFloat”.
  • “structuredArray” is a random size of objects which contain one field ”float” which is generated in the same way as ”eitherStandardOrScientificFloat”.
  • ”msg” is a randomly generated string, maybe we can get rid of this.

I generated 1,000,000 loglines like this, the total size is 2.27GB.

Compression ratio results (the only flag other than --retain-float-format I specified was --timestamp-key):

dataset new ratio w/ retain new ratio w/o retain old ratio
cockroachdb 27.97 27.97 27.97
mongodb 170.89 170.89 170.89
elasticsearch 142.72 142.72 142.72
spark-event-logs 57.65 57.65 57.65
postgresql 39.50 39.50 39.50
myown 2.45 2.46 2.46

Compression speed results:

dataset new time w/ retain (s) new time w/o retain (s) old time (s)
cockroachdb 138.902 140.63 137.484
mongodb 758.337 758.183 758.235
elasticsearch 73.781 74.48 73.793
spark-event-logs 20.502 20.495 20.42
postgresql 6.362 6.314 6.358
myown 83.059 80.793 80.701

I only benchmarked search performance for the MongoDB dataset with the queries in the papaer. I think the difference is just noise.

query index new time w/ retain (s) new time w/o retain (s) old time (s)
0 0.730 0.753 0.751
1 32.984 34.854 33.721
2 0.701 0.723 0.709
3 15.767 15.826 15.773
4 1.073 1.061 1.087
5 0.740 0.741 0.740

Here is the script I use to generate the float dataset:

import random
import string
import time

def gen_standard_float(rng: random.Random) -> str:
    sign = rng.choice(["", "-"])
    int_length = rng.randint(1, 15)
    int_part = str(rng.randint(1, 9)) + "".join(str(rng.randint(0, 9)) for _ in range(int_length - 1))
    frac_length = rng.randint(1, 16 - int_length)
    frac_part = "." + "".join(str(rng.randint(0, 9)) for _ in range(frac_length))
    return f"{sign}{int_part}{frac_part}"

def gen_scientific_float(rng: random.Random) -> str:
    exp_char = "E" if rng.random() < 0.5 else "e"
    exp_sign = rng.choice(["", "+", "-"])
    exp_part = rng.randint(0, 100)
    sign = rng.choice(["", "-"])
    if rng.random() < 0.01:
        frac_length = rng.randint(0, 15)
        frac_part = "" if 0 == frac_length else "." + "".join("0" for _ in range(frac_length))
        return f"{sign}0{frac_part}{exp_char}{exp_sign}{exp_part}"

    int_part = str(rng.randint(1, 9))
    frac_length = rng.randint(0, 15)
    frac_part = "" if 0 == frac_length else "." + "".join(str(rng.randint(0, 9)) for _ in range(frac_length))

    return f"{sign}{int_part}{frac_part}{exp_char}{exp_sign}{exp_part}"

def gen_either_float(rng: random.Random) -> str:
    return rng.choice([gen_standard_float(rng), gen_scientific_float(rng)])

def gen_msg(rng: random.Random, min_len=3, max_len=12) -> str:
    length = rng.randint(min_len, max_len)
    alphabet = string.ascii_letters + string.digits
    return "".join(rng.choice(alphabet) for _ in range(length))

def main():
    ap = argparse.ArgumentParser(
        description="Generate JSON log lines with standard/scientific floats for benchmarking."
    )
    ap.add_argument("-n", "--count", type=int, default=10, help="number of log lines (default: 10)")
    ap.add_argument("--start-ts-ms", type=int, default=None,
                    help="starting epoch timestamp in ms (default: now in ms)")
    ap.add_argument("--interval-ms", type=int, default=10,
                    help="fixed increment in ms between lines (default: 10)")
    ap.add_argument("--unstructured-min", type=int, default=0,
                    help="min size of unstructuredArray (default: 0)")
    ap.add_argument("--unstructured-max", type=int, default=100,
                    help="max size of unstructuredArray (default: 100)")
    ap.add_argument("--structured-min", type=int, default=0,
                    help="min size of structuredArray (default: 0)")
    ap.add_argument("--structured-max", type=int, default=100,
                    help="max size of structuredArray (default: 100)")
    args = ap.parse_args()

    rng = random.Random(time.time())

    start_ts = args.start_ts_ms if args.start_ts_ms is not None else int(time.time() * 1000)

    for i in range(args.count):
        ts_str = str(start_ts + i * args.interval_ms)

        standard_str = gen_standard_float(rng)
        scientific_str = gen_scientific_float(rng)
        either_str = gen_either_float(rng)

        unstructured_length = rng.randint(args.unstructured_min, max(args.unstructured_min, args.unstructured_max))
        unstructured_str = "[]" if 0 == unstructured_length else f"[ {gen_either_float(rng)} ]" if 1 == unstructured_length else f"[ {gen_either_float(rng)}" + "".join(f", {gen_either_float(rng)}" for _ in range(unstructured_length - 1)) + " ]"
        
        structured_length = rng.randint(args.structured_min, max(args.structured_min, args.structured_max))
        structured_str = "[]" if 0 == structured_length else f'[ {{ "float": {gen_either_float(rng)} }} ]' if 1 == structured_length else f'[ {{ "float": {gen_either_float(rng)} }}' + "".join(f', {{ "float": {gen_either_float(rng)} }}' for _ in range(structured_length - 1)) + " ]"

        msg_str = gen_msg(rng)

        obj_str = f'{{ "ts": {ts_str}, "standardFloat": {standard_str}, "scientificFloat": {scientific_str}, "eitherStandardOrScientificFloat": {either_str}, "unstructuredArray": {unstructured_str}, "structuredArray": {structured_str}, "msg": "{msg_str}" }}'
        
        print(obj_str)

if __name__ == "__main__":
    main()

For testing we:

  • Added end to end compression/decompression tests for both floating-poing numbers we can represent as FormattedFloat and numbers we need to represent as DictionaryFloat where we compress a test dataset, confirm that all of the data gets encoded as FormattedFloat or DictionaryFloat respectively, and verify that the decompression results are byte-accurate
  • Added fuzzing unit tests where we generate different kinds of floating-point numbers that should be representable as FormattedFloat and confirm that they can be round-tripped correctly

Summary by CodeRabbit

  • New Features

    • New option to retain original float text during compression; extraction can reproduce exact formatted or dictionary-backed float representations.
    • JSON parsing, serialization and archive I/O now preserve and emit formatted/dictionary float variants when enabled.
    • Compression orchestration now exposes and uses --retain-float-format.
  • Tests

    • Added unit, fuzz, end-to-end and search tests plus fixtures and helpers covering formatted and dictionary float handling.
  • Chores

    • Archive patch version increment and build dependency updates.

Copy link
Contributor

coderabbitai bot commented Aug 8, 2025

Walkthrough

Adds lossless float-format retention: new NodeTypes (FormattedFloat, DictionaryFloat), a 16-bit float-format encoding API (get_float_encoding, restore_encoded_float), per-element format/dictionary storage and wiring across ParsedMessage, readers/writers, JSON parser/serializer, CLI flag --retain-float-format, build changes, and tests/fixtures.

Changes

Cohort / File(s) Summary
Float format encoding module
components/core/src/clp_s/FloatFormatEncoding.hpp, components/core/src/clp_s/FloatFormatEncoding.cpp, components/core/tests/test-FloatFormatEncoding.cpp, components/core/CMakeLists.txt, components/core/src/clp_s/CMakeLists.txt, components/core/cmake/Options/options.cmake
Introduce float_format_t (uint16_t), bitfield layout constants, implement get_float_encoding and restore_encoded_float, add fuzz tests and CMake wiring, and require ystdlib error_handling.
Parser & CLI propagation
components/core/src/clp_s/JsonParser.hpp, components/core/src/clp_s/JsonParser.cpp, components/core/src/clp_s/CommandLineArguments.hpp, components/core/src/clp_s/CommandLineArguments.cpp, components/core/src/clp_s/clp-s.cpp, components/job-orchestration/job_orchestration/executor/compress/compression_task.py
Add CLI option --retain-float-format; JsonParser gains retain_float_format and emits FormattedFloat/DictionaryFloat (or Float) based on encoding + round‑trip checks; CLI and orchestration propagate flag.
Schema & serialization
components/core/src/clp_s/SchemaTree.hpp, components/core/src/clp_s/SchemaTree.cpp, components/core/src/clp_s/SchemaReader.cpp, components/core/src/clp_s/JsonSerializer.hpp
Add NodeType::FormattedFloat and ::DictionaryFloat; map them to float literal type; add JsonSerializer::Op entries AddFormattedFloatField/AddFormattedFloatValue and handle template generation and JSON output.
Archive integration
components/core/src/clp_s/ArchiveWriter.cpp, components/core/src/clp_s/ArchiveReader.cpp
Initialize and wire FormattedFloat and DictionaryFloat column writers/readers alongside existing Float handling; pass variable-dictionary or format state where applicable.
Column readers & writers
components/core/src/clp_s/ColumnWriter.hpp, components/core/src/clp_s/ColumnWriter.cpp, components/core/src/clp_s/ColumnReader.hpp, components/core/src/clp_s/ColumnReader.cpp
Add FormattedFloatColumnWriter/Reader and DictionaryFloatColumnWriter/Reader. Writers collect per-element formats or dictionary IDs and serialize blocks; readers load formats/IDs and extract formatted strings via restore_encoded_float or dictionary lookup.
ParsedMessage support
components/core/src/clp_s/ParsedMessage.hpp
Extend variable_t to include std::pair<double,float_format_t> and add overloads to store formatted floats ordered/unordered.
Tests & helpers
components/core/tests/clp_s_test_utils.hpp, components/core/tests/clp_s_test_utils.cpp, components/core/tests/test-clp_s-end_to_end.cpp, components/core/tests/test-clp_s-range_index.cpp, components/core/tests/test-clp_s-delta-encode-log-order.cpp, components/core/tests/test-clp_s-search.cpp
Update compress_archive signature to accept retain_float_format, adjust callers, add end‑to‑end and search tests for formatted/dictionary floats, add helper utilities and normalization/compare logic.
Test data fixtures
components/core/tests/test_log_files/test_valid_formatted_float.jsonl, components/core/tests/test_log_files/test_invalid_formatted_float.jsonl, components/core/tests/test_log_files/test_search_formatted_float.jsonl
Add/extend JSONL fixtures covering valid and invalid formatted float representations and search cases.
Miscellaneous build/version
components/core/src/clp_s/SingleFileArchiveDefs.hpp, components/core/src/clp_s/CMakeLists.txt
Bump archive patch version (0 → 1) and add FloatFormatEncoding sources to clp_s targets.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as "clp-s CLI"
  participant Parser as "JsonParser"
  participant PM as "ParsedMessage"
  participant CW as "ColumnWriter"
  participant AW as "ArchiveWriter"

  User->>CLI: clp-s --retain-float-format ...
  CLI->>Parser: JsonParserOption{retain_float_format}
  Parser->>Parser: detect numeric token
  alt retain_float_format AND encodable AND round-trip matches
    Parser->>PM: add_value(node_id, double, format)
    PM->>CW: variable_t(pair<double,float_format_t>)
    CW->>CW: FormattedFloatColumnWriter.add_value
  else retain_float_format AND encodable but round-trip fails
    Parser->>PM: add_value(node_id, string)
    PM->>CW: variable_t(string)
    CW->>CW: DictionaryFloatColumnWriter.add_value (dict insert)
  else retain_float_format off
    Parser->>PM: add_value(node_id, double)
    PM->>CW: variable_t(double)
    CW->>CW: FloatColumnWriter.add_value
  end
  CW->>AW: store serialized blocks (values, formats or dict IDs)
  AW->>User: archive stored
Loading
sequenceDiagram
  autonumber
  participant AR as "ArchiveReader"
  participant CR as "ColumnReader"
  participant SR as "SchemaReader"
  participant JS as "JsonSerializer"
  participant Out as "Output"

  AR->>SR: generate_json_template()
  SR->>JS: emit ops (AddFormattedFloat* included)
  loop per row
    AR->>CR: read column by NodeType
    alt FormattedFloat
      CR->>CR: load values + formats
      CR->>JS: restore_encoded_float -> formatted string
    else DictionaryFloat
      CR->>CR: load dict IDs
      CR->>JS: dictionary lookup -> string
    else Float
      CR->>JS: numeric double
    end
    JS->>Out: JSON line
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately summarizes the primary change—adding lossless floating‑point compression—and clearly notes the archive format bump; it is specific, concise, and directly related to the changes in the diff.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One important comment -- this new float encoding should be a new NodeType w/ a new column reader and writer. We should avoid changing how the encoding for an existing NodeType works.

@kirkrodrigues
Copy link
Member

One thing we'll need to figure out is whether we should add support to KV-IR streams, since we use them to view logs in the log viewer. E.g., if the user's original JSON log contained a float that was printed with a specific format (e.g., "-0.0e-0"), when converted to IR, that format wouldn't be retained). This wouldn't be a breaking change, so it seems doable.

@LinZhihao-723 Any thoughts?

@LinZhihao-723
Copy link
Member

One thing we'll need to figure out is whether we should add support to KV-IR streams, since we use them to view logs in the log viewer. E.g., if the user's original JSON log contained a float that was printed with a specific format (e.g., "-0.0e-0"), when converted to IR, that format wouldn't be retained). This wouldn't be a breaking change, so it seems doable.

@LinZhihao-723 Any thoughts?

From a format perspective, I think that should be feasible. We don't need to introduce any more types in the scheme tree, but add a CLP-encoded float type (so overall it works like the String type, where the encoded value can be a variable string or an encoded text AST). But we probably need to handle the case where the float can't be encoded into an 8-byte variable. In that case, we probably need to serialize it as a string.

From the serializer's perspective, it probably requires more work: currently the serializer only takes msgpack. To trigger the CLP-float encoding, we may need to serialize the float as a string (which will have performance overhead), or we define a customized type (for which I don't know the details of how to do this. Need time to figure out).

@anlowee anlowee marked this pull request as ready for review August 12, 2025 15:28
@anlowee anlowee requested review from wraymo and a team as code owners August 12, 2025 15:28
@anlowee anlowee requested a review from kirkrodrigues August 12, 2025 15:38
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 22

🔭 Outside diff range comments (3)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)

244-253: Gate --retain-float-format behind a config flag

Unconditionally enabling this flag changes the output semantics for all CLP_S jobs and can break environments where the binary hasn’t been updated. Only append --retain-float-format when the user explicitly opts in (default False).

• components/job-orchestration/job_orchestration/executor/compress/compression_task.py (around lines 244–253):

     compression_cmd = [
         str(clp_home / "bin" / "clp-s"),
         "c", str(archive_output_dir),
         "--print-archive-stats",
         "--target-encoded-size",
         str(clp_config.output.target_segment_size + clp_config.output.target_dictionaries_size),
         "--compression-level", str(clp_config.output.compression_level),
-        "--retain-float-format",
     ]
     # fmt: on

+    # only include the flag when retain_float_format is true (default: False)
+    if getattr(clp_config.output, "retain_float_format", False):
+        compression_cmd.append("--retain-float-format")

• Configuration schema (e.g. in scheduler/config or your config definitions): add

output:
  retain_float_format: false

• Ensure your ClpIoConfig (or equivalent) exposes retain_float_format: bool with a default of False.

components/core/src/clp_s/SchemaReader.cpp (1)

37-44: FormattedFloat timestamps not handled (potential crash or incorrect metadata)

mark_column_as_timestamp handles Float but not FormattedFloat. If a timestamp column can be encoded as FormattedFloat under the new flag, m_get_timestamp won’t be set, leading to incorrect timestamps or dereferencing an uninitialized functor in get_next_message_with_metadata.

Consider handling FormattedFloat analogously to Float:

     } else if (m_timestamp_column->get_type() == NodeType::Float) {
         m_get_timestamp = [this]() {
             return static_cast<epochtime_t>(
                     std::get<double>(static_cast<FloatColumnReader*>(m_timestamp_column)
                                              ->extract_value(m_cur_message))
             );
         };
+    } else if (m_timestamp_column->get_type() == NodeType::FormattedFloat) {
+        m_get_timestamp = [this]() {
+            return static_cast<epochtime_t>(
+                    std::get<double>(static_cast<FormattedFloatColumnReader*>(m_timestamp_column)
+                                             ->extract_value(m_cur_message))
+            );
+        };
     }

If FormattedFloatColumnReader does not expose extract_value as a double, we should add an accessor that returns the underlying numeric value rather than parsing the formatted string.

components/core/tests/test-clp_s-end_to_end.cpp (1)

110-142: Security concern: Command injection vulnerability in shell commands

The literallyCompare function constructs shell commands using string interpolation without proper escaping of file paths. If file paths contain special characters (spaces, semicolons, backticks, etc.), this could lead to command injection.

Use proper escaping or consider using C++ file operations instead of shell commands:

+#include <fstream>
+#include <algorithm>
+#include <sstream>
+
+// Helper function to escape shell arguments
+std::string escape_shell_arg(const std::string& arg) {
+    std::string escaped = "'";
+    for (char c : arg) {
+        if (c == '\'') {
+            escaped += "'\\''";
+        } else {
+            escaped += c;
+        }
+    }
+    escaped += "'";
+    return escaped;
+}
+
 void literallyCompare(
         std::filesystem::path const& expected_output_json_path,
         std::filesystem::path const& extracted_json_path
 ) {
     auto command = fmt::format(
-            "cat {} | awk 'NF' | tr -d ' ' | sort > {}",
-            expected_output_json_path.string(),
-            cTestEndToEndExpectedOutputSortedFile
+            "cat {} | awk 'NF' | tr -d ' ' | sort > {}",
+            escape_shell_arg(expected_output_json_path.string()),
+            escape_shell_arg(std::string(cTestEndToEndExpectedOutputSortedFile))
     );
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 759f3b0 and 30c9b08.

📒 Files selected for processing (30)
  • components/core/src/clp_s/ArchiveReader.cpp (2 hunks)
  • components/core/src/clp_s/ArchiveWriter.cpp (1 hunks)
  • components/core/src/clp_s/ColumnReader.cpp (3 hunks)
  • components/core/src/clp_s/ColumnReader.hpp (1 hunks)
  • components/core/src/clp_s/ColumnWriter.cpp (1 hunks)
  • components/core/src/clp_s/ColumnWriter.hpp (2 hunks)
  • components/core/src/clp_s/CommandLineArguments.cpp (1 hunks)
  • components/core/src/clp_s/CommandLineArguments.hpp (2 hunks)
  • components/core/src/clp_s/FloatFormatEncoding.hpp (1 hunks)
  • components/core/src/clp_s/JsonParser.cpp (5 hunks)
  • components/core/src/clp_s/JsonParser.hpp (2 hunks)
  • components/core/src/clp_s/JsonSerializer.hpp (1 hunks)
  • components/core/src/clp_s/SchemaReader.cpp (4 hunks)
  • components/core/src/clp_s/SchemaTree.cpp (1 hunks)
  • components/core/src/clp_s/SchemaTree.hpp (1 hunks)
  • components/core/src/clp_s/clp-s.cpp (1 hunks)
  • components/core/tests/clp_s_test_utils.cpp (2 hunks)
  • components/core/tests/clp_s_test_utils.hpp (2 hunks)
  • components/core/tests/test-clp_s-delta-encode-log-order.cpp (1 hunks)
  • components/core/tests/test-clp_s-end_to_end.cpp (4 hunks)
  • components/core/tests/test-clp_s-range_index.cpp (1 hunks)
  • components/core/tests/test-clp_s-search.cpp (5 hunks)
  • components/core/tests/test_log_files/test_rounded_formatted_float.jsonl (1 hunks)
  • components/core/tests/test_log_files/test_rounded_formatted_float_rounded.jsonl (1 hunks)
  • components/core/tests/test_log_files/test_search_formatted_float.jsonl (1 hunks)
  • components/core/tests/test_log_files/test_valid_formatted_float.jsonl (1 hunks)
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1 hunks)
  • docs/src/dev-guide/design-retain-float-format.md (1 hunks)
  • docs/src/dev-guide/index.md (1 hunks)
  • docs/src/user-guide/core-clp-s.md (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit Configuration File

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/ArchiveWriter.cpp
  • components/core/src/clp_s/CommandLineArguments.cpp
  • components/core/src/clp_s/SchemaTree.hpp
  • components/core/tests/clp_s_test_utils.cpp
  • components/core/src/clp_s/SchemaTree.cpp
  • components/core/src/clp_s/FloatFormatEncoding.hpp
  • components/core/src/clp_s/CommandLineArguments.hpp
  • components/core/src/clp_s/ColumnWriter.hpp
  • components/core/src/clp_s/ArchiveReader.cpp
  • components/core/src/clp_s/clp-s.cpp
  • components/core/src/clp_s/JsonSerializer.hpp
  • components/core/src/clp_s/ColumnReader.hpp
  • components/core/src/clp_s/ColumnWriter.cpp
  • components/core/tests/test-clp_s-search.cpp
  • components/core/src/clp_s/SchemaReader.cpp
  • components/core/tests/clp_s_test_utils.hpp
  • components/core/tests/test-clp_s-end_to_end.cpp
  • components/core/tests/test-clp_s-delta-encode-log-order.cpp
  • components/core/tests/test-clp_s-range_index.cpp
  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/ColumnReader.cpp
🧠 Learnings (7)
📚 Learning: 2025-01-16T16:58:43.190Z
Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/tests/clp_s_test_utils.cpp
  • components/core/tests/test-clp_s-search.cpp
  • components/core/tests/clp_s_test_utils.hpp
  • components/core/tests/test-clp_s-end_to_end.cpp
  • components/core/tests/test-clp_s-delta-encode-log-order.cpp
  • components/core/tests/test-clp_s-range_index.cpp
📚 Learning: 2025-08-06T23:57:49.762Z
Learnt from: gibber9809
PR: y-scope/clp#1163
File: components/core/src/clp_s/ColumnWriter.hpp:120-122
Timestamp: 2025-08-06T23:57:49.762Z
Learning: In the CLP codebase, the `encoded_log_dict_id_t` type alias in `components/core/src/clp_s/ColumnWriter.hpp` is appropriately placed as a class-scoped alias because the ColumnWriter.hpp header is already included in all places where this type is used, eliminating the need to move it to a common header.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.hpp
📚 Learning: 2024-12-04T15:50:37.827Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.

Applied to files:

  • components/core/tests/test-clp_s-range_index.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
🧬 Code Graph Analysis (13)
components/core/src/clp_s/ArchiveWriter.cpp (2)
components/core/src/clp_s/ColumnReader.hpp (1)
  • FormattedFloat (161-161)
components/core/src/clp_s/ArchiveWriter.hpp (1)
  • writer (278-278)
components/core/src/clp_s/CommandLineArguments.cpp (1)
components/core/src/clp_s/CommandLineArguments.hpp (1)
  • m_retain_float_format (105-105)
components/core/src/clp_s/SchemaTree.hpp (1)
components/core/src/clp_s/ColumnReader.hpp (1)
  • FormattedFloat (161-161)
components/core/src/clp_s/SchemaTree.cpp (1)
components/core/src/clp_s/ColumnReader.hpp (1)
  • FormattedFloat (161-161)
components/core/src/clp_s/ArchiveReader.cpp (2)
components/core/src/clp_s/ColumnReader.hpp (1)
  • FormattedFloat (161-161)
components/core/src/clp_s/SchemaReader.hpp (5)
  • column_reader (122-122)
  • column_reader (128-128)
  • column_reader (204-204)
  • column_reader (209-211)
  • column_reader (209-209)
components/core/src/clp_s/clp-s.cpp (1)
components/core/src/clp_s/ArchiveWriter.hpp (1)
  • option (146-146)
components/core/src/clp_s/ColumnReader.hpp (1)
components/core/src/clp_s/Utils.hpp (2)
  • start (223-225)
  • start (223-223)
components/core/src/clp_s/ColumnWriter.cpp (2)
components/core/src/clp_s/ParsedMessage.hpp (4)
  • add_value (35-37)
  • add_value (35-35)
  • value (64-66)
  • value (64-64)
components/core/src/clp_s/ColumnWriter.hpp (9)
  • value (29-29)
  • value (59-59)
  • value (76-76)
  • value (94-94)
  • value (111-111)
  • value (129-129)
  • value (156-156)
  • value (213-213)
  • value (231-231)
components/core/tests/test-clp_s-search.cpp (3)
components/core/tests/test-clp_s-range_index.cpp (6)
  • get_test_input_path_relative_to_tests_dir (36-36)
  • get_test_input_path_relative_to_tests_dir (52-54)
  • get_test_input_path_relative_to_tests_dir (52-52)
  • get_test_input_local_path (37-37)
  • get_test_input_local_path (56-60)
  • get_test_input_local_path (56-56)
components/core/tests/clp_s_test_utils.cpp (2)
  • compress_archive (13-59)
  • compress_archive (13-20)
components/core/tests/clp_s_test_utils.hpp (1)
  • compress_archive (23-30)
components/core/src/clp_s/SchemaReader.cpp (6)
components/core/src/clp_s/JsonSerializer.hpp (4)
  • column (133-136)
  • column (133-133)
  • column (138-143)
  • column (139-139)
components/core/src/clp_s/search/Projection.hpp (1)
  • column (42-42)
components/core/src/clp_s/search/SchemaMatch.hpp (4)
  • column (94-94)
  • column (101-101)
  • column (154-154)
  • column (161-161)
components/core/src/clp_s/search/ast/ColumnDescriptor.hpp (1)
  • static_cast (243-243)
components/core/src/clp_s/SchemaReader.hpp (2)
  • m_cur_message (236-236)
  • m_columns (130-130)
components/core/src/clp_s/ColumnReader.hpp (1)
  • FormattedFloat (161-161)
components/core/tests/test-clp_s-end_to_end.cpp (3)
components/core/tests/test-clp_s-search.cpp (6)
  • get_test_input_path_relative_to_tests_dir (46-47)
  • get_test_input_path_relative_to_tests_dir (62-65)
  • get_test_input_path_relative_to_tests_dir (62-63)
  • get_test_input_local_path (48-48)
  • get_test_input_local_path (67-71)
  • get_test_input_local_path (67-67)
components/core/tests/test-clp_s-range_index.cpp (6)
  • get_test_input_path_relative_to_tests_dir (36-36)
  • get_test_input_path_relative_to_tests_dir (52-54)
  • get_test_input_path_relative_to_tests_dir (52-52)
  • get_test_input_local_path (37-37)
  • get_test_input_local_path (56-60)
  • get_test_input_local_path (56-56)
components/core/tests/clp_s_test_utils.cpp (2)
  • compress_archive (13-59)
  • compress_archive (13-20)
components/core/src/clp_s/JsonParser.cpp (5)
components/core/src/clp_s/ArchiveWriter.hpp (2)
  • parent_node_id (171-171)
  • parent_node_id (179-179)
components/core/src/clp_s/CommandLineArguments.hpp (1)
  • m_retain_float_format (105-105)
components/core/src/clp_s/ColumnReader.hpp (2)
  • FormattedFloat (161-161)
  • Float (138-138)
components/core/src/clp_s/SchemaTree.hpp (1)
  • parent_node_id (131-131)
components/core/src/clp_s/JsonParser.hpp (3)
  • line (95-95)
  • line (178-178)
  • line (185-185)
components/core/src/clp_s/ColumnReader.cpp (2)
components/core/src/clp_s/ColumnReader.hpp (9)
  • reader (35-35)
  • reader (80-80)
  • reader (103-103)
  • reader (136-136)
  • reader (159-159)
  • reader (219-219)
  • reader (251-251)
  • reader (301-301)
  • reader (338-338)
components/core/src/clp_s/Utils.hpp (4)
  • start (223-225)
  • start (223-223)
  • i (217-221)
  • i (217-217)
🪛 LanguageTool
docs/src/user-guide/core-clp-s.md

[style] ~82-~82: ‘exactly the same’ might be wordy. Consider a shorter alternative.
Context: ...at` flag to output decompressed data in exactly the same format it had before compression. For e...

(EN_WORDINESS_PREMIUM_EXACTLY_THE_SAME)

docs/src/dev-guide/design-retain-float-format.md

[uncategorized] ~29-~29: Loose punctuation mark.
Context: ...ponent is denoted by E or e. * 00: Not scientific * 01: Scientific notat...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~40-~40: Loose punctuation mark.
Context: ...whether the exponent has a sign: * 00: No sign * 01: + * 10: - For ex...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/core-clp-s.md

83-83: Trailing spaces
Expected: 0 or 2; Actual: 1

(MD009, no-trailing-spaces)


180-180: Trailing spaces
Expected: 0 or 2; Actual: 1

(MD009, no-trailing-spaces)


181-181: Trailing spaces
Expected: 0 or 2; Actual: 1

(MD009, no-trailing-spaces)

docs/src/dev-guide/design-retain-float-format.md

18-18: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


29-29: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


30-30: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


31-31: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


40-40: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


41-41: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


42-42: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


50-50: Trailing spaces
Expected: 0 or 2; Actual: 1

(MD009, no-trailing-spaces)


73-73: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


74-74: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


75-75: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


76-76: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
🔇 Additional comments (36)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)

201-225: No action needed – clp doesn’t yet support --retain-float-format
I’ve confirmed that only the CLP_S path (_make_clp_s_command_and_env) appends --retain-float-format and that the clp binary (used by _make_clp_command_and_env) does not currently accept this flag. Once upstream adds support for --retain-float-format to clp, we can mirror the same gated logic in _make_clp_command_and_env.

components/core/src/clp_s/CommandLineArguments.hpp (2)

105-106: LGTM: Getter for retain-float-format flag

Accessor aligns with existing getter patterns and provides the necessary plumbing for the new CLI option.


186-186: LGTM: Default-initialized retain-float-format flag

Defaulting to false preserves existing behaviour unless the flag is explicitly set.

components/core/tests/test_log_files/test_search_formatted_float.jsonl (1)

1-13: LGTM: Good coverage of formatted-float search cases

Well-chosen values to exercise signs, exponent formatting (including leading zeros), rounding boundaries, and small/large magnitudes.

components/core/src/clp_s/JsonSerializer.hpp (1)

19-27: Confirm enum Op value stability for backward compatibility

AddFormattedFloatField and AddFormattedFloatValue were inserted mid-enum in JsonSerializer.hpp. If any component persists or transmits Op by its numeric value (for example, on-disk templates, cached op streams or IPC), inserting new entries in the middle will shift all subsequent values and break compatibility. If Op is used strictly in-memory within a single process, this change is safe. Otherwise, new ops should be appended at the end or given explicit values to preserve existing ordinal assignments.

Files to review:

  • components/core/src/clp_s/JsonSerializer.hpp (enum Op, around lines 19–27)

Suggested diff for safer ordering:

     AddNullField,
     AddIntValue,
     AddFloatValue,
     AddFormattedFloatValue,
     AddBoolValue,
     AddStringField,
     AddArrayField,
     AddNullField,
     AddIntValue,
     AddFloatValue,
-    AddFormattedFloatField,
-    AddBoolValue,
-    AddStringField,
-    AddArrayField,
-    AddNullField,
-    AddIntValue,
-    AddFloatValue,
-    AddFormattedFloatValue,
-    AddBoolValue,
+    // New float formatting ops appended to preserve existing values
+    AddFormattedFloatField,
+    AddFormattedFloatValue,

Please confirm that Op values are never serialized or persisted by numeric value.

components/core/src/clp_s/SchemaTree.cpp (1)

17-17: LGTM! FormattedFloat maps to FloatT correctly.

The addition of the FormattedFloat case to map to LiteralType::FloatT is appropriate since formatted floats are still floating-point literals from a search perspective, even though they carry additional formatting metadata.

components/core/tests/test_log_files/test_rounded_formatted_float.jsonl (1)

1-42: Comprehensive test coverage for float formatting edge cases.

The test data covers a good range of precision and formatting edge cases including:

  • Values exceeding precision limits
  • Leading/trailing zeros
  • Scientific notation with various exponent formats
  • Negative values and signs
  • Mantissas greater than 10

This will help validate the FormattedFloat feature's ability to preserve formatting across these challenging scenarios.

components/core/tests/clp_s_test_utils.hpp (1)

17-17: LGTM! Parameter addition is well-documented.

The retain_float_format parameter is properly documented and positioned logically in the parameter list after the archive directory.

Also applies to: 26-26

components/core/src/clp_s/ArchiveWriter.cpp (1)

319-321: Add FormattedFloat column writer case — looks correct

Switch case cleanly mirrors existing Float handling and appends the correct ColumnWriter type, with a break to avoid fall-through.

docs/src/dev-guide/index.md (1)

86-87: Documentation file confirmed
Verified that docs/src/dev-guide/design-retain-float-format.md exists and is correctly referenced in docs/src/dev-guide/index.md.

components/core/src/clp_s/JsonParser.hpp (1)

228-229: m_retain_float_format correctly assigned and used

The constructor in JsonParser.cpp (line 94) initializes m_retain_float_format from option.retain_float_format, and both parsing paths respect this flag:

  • In parse()/parse_from_ir(), the flag guards whether to capture the raw JSON token or the numeric value (lines 205, 293).
  • In the IR-to-archive logic, it switches between NodeType::FormattedFloat and NodeType::Float based on the flag (line 409).

No changes required.

components/core/src/clp_s/clp-s.cpp (2)

107-108: Plumbing retain_float_format from CLI — LGTM

Correctly propagates the CLI flag into JsonParserOption; no behavioural issues spotted.


107-108: CLI flag --retain-float-format is correctly defined and documented

I’ve verified that:

  • get_retain_float_format() is implemented in components/core/src/clp_s/CommandLineArguments.hpp
  • The --retain-float-format option is registered in components/core/src/clp_s/CommandLineArguments.cpp with po::bool_switch(&m_retain_float_format)
  • The help text "Store the format of float numbers for lossless decompression." is included in the usage output

No further changes required.

components/core/tests/test-clp_s-delta-encode-log-order.cpp (2)

67-74: Adapt test to new compress_archive signature — LGTM

New boolean parameter set to false fits the default for this test and maintains previous behaviour.


67-74: compress_archive signature and calls are consistent
The compress_archive helper now takes six parameters (file_path, archive_directory, retain_float_format, single_file_archive, structurize_arrays, file_type), and all test callsites in:

  • test-clp_s-search.cpp
  • test-clp_s-range_index.cpp
  • test-clp_s-end_to_end.cpp
  • test-clp_s-delta-encode-log-order.cpp

are updated to pass the retain_float_format boolean in the third position. No further changes needed.

components/core/src/clp_s/CommandLineArguments.cpp (2)

251-253: Add CLI flag to retain float format — LGTM

The addition is consistent with surrounding options and uses the correct bool_switch pattern. Nice placement and concise help text.


251-254: Clarify or enforce --retain-float-format scope for KV-IR

It isn’t obvious whether --retain-float-format is supported when --file-type kv-ir. We should either update the help text to call out JSON-only behaviour or add a parse-time check to error out on KV-IR, to avoid user confusion.

Suggestions:

  • If KV-IR does not support retaining the original float syntax, tweak the help string:
     components/core/src/clp_s/CommandLineArguments.cpp
     ────────────────────────────────────────────────────────────────
     -                    "Store the format of float numbers for lossless decompression."
     +                    "Store the original textual format of floating-point numbers for lossless decompression (JSON input only)."
  • Alternatively, enforce at parse time:
     if (m_retain_float_format && m_file_type == FileType::KV_IR) {
         throw po::validation_error{
             po::validation_error::invalid_option_value,
             "retain-float-format", "only valid for JSON input"
         };
     }
  • If KV-IR does already honour this flag, feel free to ignore this comment.
components/core/src/clp_s/ArchiveReader.cpp (2)

197-199: Support for FormattedFloat in ordered columns — LGTM

The new case mirrors Float handling and correctly instantiates FormattedFloatColumnReader.


253-255: Support for FormattedFloat in unordered columns — LGTM

Consistent with the ordered-path addition and other column types.

components/core/tests/test-clp_s-range_index.cpp (1)

223-231: Updated compress_archive call site to include retain_float_format — LGTM

Passing false keeps previous behaviour for this test while aligning with the new signature.

components/core/tests/clp_s_test_utils.cpp (1)

16-20: Plumbing retain_float_format through test helper — LGTM

Signature and the internal option wiring look correct and maintain existing defaults at call sites that pass false.

Also applies to: 40-41

components/core/src/clp_s/SchemaReader.cpp (4)

130-149: Ensure restored format yields valid JSON numeric tokens

Appending the restore_format result directly as a JSON value is correct for standard numeric tokens. Please verify that restore_format never returns non-JSON numerics (e.g., NaN/Inf forms) unquoted; otherwise, output may become invalid JSON.


467-471: Template support for FormattedFloat in structured arrays — LGTM

Adds the correct serializer op and column push to keep ordering consistent.


557-561: Template support for FormattedFloat in structured objects — LGTM

Matches existing Float handling using the new formatted ops.


671-675: Template support for FormattedFloat in top-level JSON generation — LGTM

Consistent with array/object cases and integrates with projection/local-tree generation.

components/core/src/clp_s/JsonParser.cpp (5)

205-218: LGTM! Consistent implementation of formatted float handling.

The implementation correctly handles the new retain_float_format option by storing the raw JSON token for floats when the flag is enabled, and using the new FormattedFloat node type.


293-302: LGTM! Consistent handling in array parsing.

The array parsing logic mirrors the object parsing implementation, correctly handling formatted floats when the flag is enabled.


409-410: LGTM! Clean integration with existing type resolution.

The type resolution logic cleanly integrates the new FormattedFloat type when retain_float_format is enabled.


431-436: LGTM! Proper handling of formatted float values.

The implementation correctly stores the raw token string for formatted floats while preserving numeric value handling for timestamp ingestion.


1056-1076: LGTM! Good addition of user-defined metadata support.

The implementation properly checks for user-defined metadata in the IR stream and adds each key-value pair to the current range, with appropriate error handling.

components/core/tests/test-clp_s-search.cpp (3)

46-48: LGTM! Good parameterization of test input paths.

The new helper functions improve test maintainability by allowing different test input files to be specified dynamically.


122-128: LGTM! Proper handling of empty expected results.

The validation logic correctly handles the case where no results are expected, improving test coverage for negative test cases.


259-295: LGTM! Comprehensive test coverage for formatted floats.

The new test case provides thorough coverage of formatted float search functionality with various query types and edge cases. The test correctly uses the retain_float_format flag when compressing the test archive.

components/core/src/clp_s/ColumnWriter.hpp (1)

102-118: LGTM! Well-structured column writer for formatted floats.

The FormattedFloatColumnWriter class follows the established pattern of other column writers and properly stores both the numeric values and their format information. The dual-vector approach (m_values and m_format) is appropriate for maintaining per-value formatting.

components/core/src/clp_s/ColumnWriter.cpp (1)

53-63: Inefficient string trimming algorithm

The current approach modifies the string in-place using erase and remove_if, which requires shifting all subsequent characters. This is O(n²) in the worst case for a string with many characters to remove.

Consider building a new string with only valid characters instead:

-    // Trim the raw string
-    float_str.erase(
-            std::remove_if(
-                    float_str.begin(),
-                    float_str.end(),
-                    [](char c) {
-                        return !std::isdigit(static_cast<unsigned char>(c)) && '.' != c && '+' != c
-                               && '-' != c && 'E' != c && 'e' != c;
-                    }
-            ),
-            float_str.end()
-    );
+    // Build a trimmed string with only valid characters
+    std::string trimmed_str;
+    trimmed_str.reserve(float_str.length());
+    for (char c : float_str) {
+        if (std::isdigit(static_cast<unsigned char>(c)) || '.' == c || '+' == c 
+            || '-' == c || 'E' == c || 'e' == c) {
+            trimmed_str.push_back(c);
+        }
+    }
+    float_str = std::move(trimmed_str);

Likely an incorrect or invalid review comment.

components/core/src/clp_s/ColumnReader.cpp (1)

201-204: Prefer false == std::isdigit pattern per coding guidelines

The code uses false == std::isdigit(sci_str[0]) which follows the project's coding guidelines for boolean comparisons.

@anlowee anlowee force-pushed the xwei/formatted-double branch from b19a245 to 0090052 Compare August 13, 2025 16:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🔭 Outside diff range comments (1)
components/core/tests/test-clp_s-end_to_end.cpp (1)

110-142: Avoid using tr -d ' ' which removes all spaces from JSON values

The literallyCompare function uses tr -d ' ' to remove all spaces, which could mask issues where JSON string values legitimately contain spaces. This approach might cause false positives in test validation.

Consider using jq for proper JSON normalization instead:

 void literallyCompare(
         std::filesystem::path const& expected_output_json_path,
         std::filesystem::path const& extracted_json_path
 ) {
+    // Use jq to normalize JSON format while preserving string values
     auto command = fmt::format(
-            "cat {} | awk 'NF' | tr -d ' ' | sort > {}",
+            "jq --sort-keys --compact-output '.' {} | sort > {}",
             expected_output_json_path.string(),
             cTestEndToEndExpectedOutputSortedFile
     );
     auto result = std::system(command.c_str());
     REQUIRE((0 == result));
     REQUIRE((false == std::filesystem::is_empty(cTestEndToEndExpectedOutputSortedFile)));

     command = fmt::format(
-            "cat {} | awk 'NF' | tr -d ' ' | sort > {}",
+            "jq --sort-keys --compact-output '.' {} | sort > {}",
             extracted_json_path.string(),
             cTestEndToEndOutputSortedJson
     );
     result = std::system(command.c_str());
     REQUIRE((0 == result));
     REQUIRE((false == std::filesystem::is_empty(cTestEndToEndOutputSortedJson)));
♻️ Duplicate comments (3)
components/core/tests/test_log_files/test_rounded_formatted_float_rounded.jsonl (1)

28-35: Fix inconsistent JSON formatting

Lines 28-35 use compact JSON formatting without spaces around colons, while the rest of the file uses spaced formatting. This inconsistency could affect parsing tests if the JSON parser is sensitive to whitespace variations.

Apply this diff to maintain consistent formatting:

-{ "scientificWithGreaterThanTenMantissa":1.234567891234567E11 }
-{ "negativeScientificWithGreaterThanTenMantissa":-1.234567891234567E11 }
-{ "negativeScientificWithGreaterThanTenMantissaExceedingPrecision":-1.234567891234568E11 }
-{ "scientificWithGreaterThanTenMantissaExceedingPrecision":1.234567891234568E11 }
+{ "scientificWithGreaterThanTenMantissa": 1.234567891234567E11 }
+{ "negativeScientificWithGreaterThanTenMantissa": -1.234567891234567E11 }
+{ "negativeScientificWithGreaterThanTenMantissaExceedingPrecision": -1.234567891234568E11 }
+{ "scientificWithGreaterThanTenMantissaExceedingPrecision": 1.234567891234568E11 }
-{ "scientificWithGreaterThanTenMantissa3":1.234567891234567E+11 }
-{ "negativeScientificWithGreaterThanTenMantissa3":-1.234567891234567E+11 }
+{ "scientificWithGreaterThanTenMantissa3": 1.234567891234567E+11 }
+{ "negativeScientificWithGreaterThanTenMantissa3": -1.234567891234567E+11 }
components/core/tests/test-clp_s-end_to_end.cpp (2)

184-193: Consider validating compression return value

The return value from compress_archive is ignored, but it likely contains important statistics about the compression operation that should be validated to ensure the test is meaningful.

Consider checking the returned statistics:

-    REQUIRE_NOTHROW(
-            std::ignore = compress_archive(
+    auto archive_stats = compress_archive(
                     get_test_input_local_path(cTestEndToEndValidFormattedFloatInputFile),
                     std::string{cTestEndToEndArchiveDirectory},
                     true,
                     single_file_archive,
                     structurize_arrays,
                     clp_s::FileType::Json
-            )
     );
+    REQUIRE_FALSE(archive_stats.empty());

213-222: Consider validating compression return value

Similar to the previous test, the return value should be validated here as well.

-    REQUIRE_NOTHROW(
-            std::ignore = compress_archive(
+    auto archive_stats = compress_archive(
                     get_test_input_local_path(cTestEndToEndRoundedFormattedFloatInputFile),
                     std::string{cTestEndToEndArchiveDirectory},
                     true,
                     single_file_archive,
                     structurize_arrays,
                     clp_s::FileType::Json
-            )
     );
+    REQUIRE_FALSE(archive_stats.empty());
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b19a245 and 0090052.

📒 Files selected for processing (9)
  • components/core/src/clp_s/ColumnReader.cpp (3 hunks)
  • components/core/src/clp_s/ColumnReader.hpp (1 hunks)
  • components/core/src/clp_s/ColumnWriter.cpp (1 hunks)
  • components/core/src/clp_s/CommandLineArguments.hpp (2 hunks)
  • components/core/tests/test-clp_s-end_to_end.cpp (4 hunks)
  • components/core/tests/test_log_files/test_rounded_formatted_float_rounded.jsonl (1 hunks)
  • components/core/tests/test_log_files/test_valid_formatted_float.jsonl (1 hunks)
  • docs/src/dev-guide/design-retain-float-format.md (1 hunks)
  • docs/src/user-guide/core-clp-s.md (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit Configuration File

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/CommandLineArguments.hpp
  • components/core/src/clp_s/ColumnWriter.cpp
  • components/core/src/clp_s/ColumnReader.hpp
  • components/core/src/clp_s/ColumnReader.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
🧠 Learnings (10)
📚 Learning: 2025-08-13T15:25:40.805Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:49-127
Timestamp: 2025-08-13T15:25:40.805Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, the FormattedFloatColumnWriter::add_value method intentionally allows std::stod to throw exceptions (std::invalid_argument or std::out_of_range) without catching them, as this is designed to terminate decompression when encountering invalid float strings for data integrity purposes.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
  • components/core/src/clp_s/ColumnReader.hpp
  • components/core/src/clp_s/ColumnReader.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-10-07T21:38:35.979Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-07T21:38:35.979Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-12-19T01:40:13.272Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/src/clp/streaming_compression/lzma/Compressor.hpp:65-75
Timestamp: 2024-12-19T01:40:13.272Z
Learning: "try_get_pos" method in `Compressor.hpp` does not throw exceptions, as clarified by Bill-hbrhbr.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
PR: y-scope/clp#570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-12-04T15:50:37.827Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-14T01:03:27.144Z
Learnt from: LinZhihao-723
PR: y-scope/clp#450
File: components/core/src/clp/streaming_archive/reader/Segment.cpp:57-58
Timestamp: 2024-11-14T01:03:27.144Z
Learning: In `components/core/src/clp/streaming_archive/reader/Segment.cpp`, the method `m_decompressor.open()` does not have a return value.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-29T22:51:00.355Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:141-142
Timestamp: 2024-11-29T22:51:00.355Z
Learning: In 'components/core/tests/test-clp_s-end_to_end.cpp', using `WIFEXITED` and `WEXITSTATUS` with the return value of `std::system` is acceptable for checking the exit status of system commands, as per the C++ documentation.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
🧬 Code Graph Analysis (3)
components/core/src/clp_s/ColumnWriter.cpp (3)
components/core/src/clp_s/ParsedMessage.hpp (4)
  • add_value (35-37)
  • add_value (35-35)
  • value (64-66)
  • value (64-64)
components/core/src/clp_s/ColumnWriter.hpp (18)
  • value (29-29)
  • value (59-59)
  • value (76-76)
  • value (94-94)
  • value (111-111)
  • value (129-129)
  • value (156-156)
  • value (213-213)
  • value (231-231)
  • compressor (35-35)
  • compressor (61-61)
  • compressor (78-78)
  • compressor (96-96)
  • compressor (113-113)
  • compressor (131-131)
  • compressor (158-158)
  • compressor (215-215)
  • compressor (233-233)
components/core/src/clp_s/SchemaWriter.cpp (2)
  • store (29-33)
  • store (29-29)
components/core/src/clp_s/ColumnReader.hpp (1)
components/core/src/clp_s/Utils.hpp (2)
  • start (223-225)
  • start (223-223)
components/core/src/clp_s/ColumnReader.cpp (1)
components/core/src/clp_s/Utils.hpp (4)
  • start (223-225)
  • start (223-223)
  • i (217-221)
  • i (217-217)
🪛 markdownlint-cli2 (0.17.2)
docs/src/user-guide/core-clp-s.md

181-181: Trailing spaces
Expected: 0 or 2; Actual: 1

(MD009, no-trailing-spaces)

🪛 LanguageTool
docs/src/dev-guide/design-retain-float-format.md

[uncategorized] ~29-~29: Loose punctuation mark.
Context: ...ponent is denoted by E or e. - 00: Not scientific - 01: Scientific notat...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~40-~40: Loose punctuation mark.
Context: ...whether the exponent has a sign: - 00: No sign - 01: + - 10: - For ex...

(UNLIKELY_OPENING_PUNCTUATION)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
🔇 Additional comments (4)
components/core/tests/test_log_files/test_valid_formatted_float.jsonl (1)

1-41: LGTM!

The test data file provides comprehensive coverage of formatted float edge cases, including standard decimal notation, scientific notation with various exponent formats, negative values, and nested structures.

components/core/src/clp_s/CommandLineArguments.hpp (1)

105-105: LGTM!

The getter is properly marked with [[nodiscard]] to match other getters in the class, preventing accidental ignoring of the return value.

docs/src/user-guide/core-clp-s.md (2)

29-30: LGTM!

The new --retain-float-format option is clearly documented with appropriate placement in the compression options list.


81-85: LGTM!

The tip provides a clear and concise explanation of the --retain-float-format feature with helpful examples.

@anlowee
Copy link
Member Author

anlowee commented Sep 11, 2025

LGTM, but I can't approve the PR created by me. @kirkrodrigues @LinZhihao-723 could either of you approve this PR?

@gibber9809 gibber9809 self-requested a review September 11, 2025 19:44
gibber9809
gibber9809 previously approved these changes Sep 11, 2025
Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deffering to @anlowee review of my changes to his branch.

wraymo
wraymo previously approved these changes Sep 12, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (15)
components/core/CMakeLists.txt (1)

362-363: Avoid duplicate compilation/ODR: remove FloatFormatEncoding from unit test sources or stop linking the production target

These sources appear to already be compiled into the production clp_s library; compiling them again into unitTest risks duplicate symbols/ODR errors. Keep them only in one place (preferred: in the production library, and remove here).

Apply this diff to drop the duplicate entries from SOURCE_FILES_clp_s_unitTest:

-    src/clp_s/FloatFormatEncoding.cpp
-    src/clp_s/FloatFormatEncoding.hpp

If, for any reason, they’re not in the production target, move them there instead and keep them out of unitTest’s sources.

components/core/tests/test_log_files/test_invalid_formatted_float.jsonl (3)

18-25: Expand exponent normalisation coverage: add lowercase ‘e’ variants and explicit ±0 exponents.

Mirror Lines 18–25 with lowercase ‘e’, and add explicit zero-exponent cases with leading zeros for both signs. This aligns with prior feedback and hardens parser paths that normalise exponent forms.

+{ "scientificPositiveExponentExceedingDigitLimitWithLeadingZeros_lowercase": 1.234567891234567e00008 }
+{ "negativeScientificPositiveExponentExceedingDigitLimitWithLeadingZeros_lowercase": -1.234567891234567e00008 }
+{ "scientificPositiveExponentExceedingDigitLimitWithLeadingZerosWithPlus_lowercase": 1.234567891234567e+00008 }
+{ "negativeScientificPositiveExponentExceedingDigitLimitWithLeadingZerosWithPlus_lowercase": -1.234567891234567e+00008 }
+{ "scientificNegativeExponentExceedingDigitLimitWithLeadingZeros_lowercase": 1.234567891234567e-00008 }
+{ "negativeScientificNegativeExponentExceedingDigitLimitWithLeadingZeros_lowercase": -1.234567891234567e-00008 }
+
+{ "scientificExponentZeroWithLeadingZerosAndPlus": 1.23E+00000 }
+{ "negativeScientificExponentZeroWithLeadingZerosAndPlus": -1.23E+00000 }
+{ "scientificExponentZeroWithLeadingZerosAndMinus": 1.23E-00000 }
+{ "negativeScientificExponentZeroWithLeadingZerosAndMinus": -1.23E-00000 }
+{ "scientificExponentZeroWithLeadingZerosAndPlus_lowercase": 1.23e+00000 }
+{ "negativeScientificExponentZeroWithLeadingZerosAndPlus_lowercase": -1.23e+00000 }
+{ "scientificExponentZeroWithLeadingZerosAndMinus_lowercase": 1.23e-00000 }
+{ "negativeScientificExponentZeroWithLeadingZerosAndMinus_lowercase": -1.23e-00000 }

34-39: Rename “ZeroMantissa” keys for accuracy (mantissa is unnormalised < 1).

These aren’t zero mantissae; they’re < 1 and unnormalised. Improves readability and intent.

-{ "nonZeroScientificWithZeroMantissa": 0.00000001234567891234567E-3 }
-{ "negativeNonZeroScientificWithZeroMantissa": -0.00000001234567891234567E-3 }
-{ "nonZeroScientificWithZeroMantissaExceedingPrecision": 0.0000000123456789123456789E-3 }
-{ "negativeNonZeroScientificWithZeroMantissaExceedingPrecision": -0.0000000123456789123456789E-3 }
-{ "nonZeroScientificWithZeroMantissa2": 0.00000001234567891234567E-01 }
-{ "negativeNonZeroScientificWithZeroMantissa2": -0.00000001234567891234567E-01 }
+{ "unnormalisedMantissaLessThanOne": 0.00000001234567891234567E-3 }
+{ "negativeUnnormalisedMantissaLessThanOne": -0.00000001234567891234567E-3 }
+{ "unnormalisedMantissaLessThanOneExceedingPrecision": 0.0000000123456789123456789E-3 }
+{ "negativeUnnormalisedMantissaLessThanOneExceedingPrecision": -0.0000000123456789123456789E-3 }
+{ "unnormalisedMantissaLessThanOne2": 0.00000001234567891234567E-01 }
+{ "negativeUnnormalisedMantissaLessThanOne2": -0.00000001234567891234567E-01 }

40-43: Clarify intent for “non754ScientificFloat” and assert the specific failure path.

If these are meant to be “cannot encode into 8-byte variable form → must serialise as string”, reflect that in names and ensure tests assert that rejection reason (not a generic parse failure).

-{ "non754ScientificFloat": 1.2345678912345679e+13}
-{ "negativeNon754ScientificFloat": -1.2345678912345679e+13}
-{ "non754ScientificFloatWithNegativeExponent": 1.2345678912345679e-13}
-{ "negativeNon754ScientificFloatWithNegativeExponent": -1.2345678912345679e-13}
+{ "notEncodableToEightByteVarFloat": 1.2345678912345679e+13 }
+{ "negativeNotEncodableToEightByteVarFloat": -1.2345678912345679e+13 }
+{ "notEncodableToEightByteVarFloatWithNegativeExponent": 1.2345678912345679e-13 }
+{ "negativeNotEncodableToEightByteVarFloatWithNegativeExponent": -1.2345678912345679e-13 }

Action: confirm the e2e test that consumes this fixture checks for the encoder’s specific “fallback-to-string” path or error code, not a broad “invalid float” outcome.

components/core/tests/test-clp_s-search.cpp (8)

237-244: Name boolean/flag args in compress_archive for readability.

Inline names prevent argument-order mistakes and match style elsewhere.

     REQUIRE_NOTHROW(
             std::ignore = compress_archive(
                     get_test_input_local_path(cTestSearchInputFile),
                     std::string{cTestSearchArchiveDirectory},
                     std::string{cTestIdxKey},
-                    false,
-                    single_file_archive,
-                    structurize_arrays,
+                    /*retain_float_format=*/false,
+                    /*single_file_archive=*/single_file_archive,
+                    /*structurize_arrays=*/structurize_arrays,
                     clp_s::FileType::Json
             )
     );

48-50: Avoid duplicating helpers; add const to params.

These forward declarations duplicate helpers already available; also prefer const-correctness.

Apply locally (and consider centralising in clp_s_test_utils.hpp/.cpp):

-auto get_test_input_path_relative_to_tests_dir(std::string_view test_input_path)
+auto get_test_input_path_relative_to_tests_dir(std::string_view const test_input_path)
         -> std::filesystem::path;
-auto get_test_input_local_path(std::string_view test_input_path) -> std::string;
+auto get_test_input_local_path(std::string_view const test_input_path) -> std::string;

64-67: Const-correctness (and consider centralising).

Match other tests by making the parameter const; prefer moving these helpers to the shared test utils to avoid drift.

-auto get_test_input_path_relative_to_tests_dir(std::string_view test_input_path)
+auto get_test_input_path_relative_to_tests_dir(std::string_view const test_input_path)
         -> std::filesystem::path {
     return std::filesystem::path{cTestInputFileDirectory} / test_input_path;
 }

69-73: Const-correctness (and consider centralising).

Same here; align with the pattern used in other suites.

-auto get_test_input_local_path(std::string_view test_input_path) -> std::string {
+auto get_test_input_local_path(std::string_view const test_input_path) -> std::string {
     std::filesystem::path const current_file_path{__FILE__};
     auto const tests_dir{current_file_path.parent_path()};
     return (tests_dir / get_test_input_path_relative_to_tests_dir(test_input_path)).string();
 }

257-268: Extend coverage: add -0.0 and uppercase-exponent numeric cases.

Unquoted literals are numeric; add negative zero and 1E-16 to harden semantics.

     std::vector<std::pair<std::string, std::vector<int64_t>>> queries_and_results{
             {R"aa(NOT formattedFloatValue: 0)aa", {0, 1, 2, 6, 7, 8, 9, 10, 11, 12}},
             {R"aa(formattedFloatValue: 0)aa", {3, 4, 5}},
             {R"aa(formattedFloatValue: 1e-16)aa", {6, 7}},
+            {R"aa(formattedFloatValue: 1E-16)aa", {6, 7}},
+            {R"aa(formattedFloatValue: -0.0)aa", {3, 4, 5}},
             {R"aa(formattedFloatValue > 0.00)aa", {6, 7, 8, 9, 10, 11, 12}},

274-282: Name boolean/flag args (formatted-float case) for clarity.

Keep calls self-documenting.

             std::ignore = compress_archive(
                     get_test_input_local_path(cTestSearchFormattedFloatFile),
                     std::string{cTestSearchArchiveDirectory},
                     std::nullopt,
-                    true,
-                    single_file_archive,
-                    false,
+                    /*retain_float_format=*/true,
+                    /*single_file_archive=*/single_file_archive,
+                    /*structurize_arrays=*/false,
                     clp_s::FileType::Json
             )

285-288: Capture generated params to aid triage on failures.

Include single_file_archive in the failure context.

     for (auto const& [query, expected_results] : queries_and_results) {
-        CAPTURE(query);
+        CAPTURE(query);
+        CAPTURE(single_file_archive);
         REQUIRE_NOTHROW(search(query, false, expected_results));
     }

271-277: Parametrisation: isolate archive output per run to avoid collisions.

Parallel/parameterised runs can interfere when sharing one directory. Use a unique archive_dir per combo and pass it through.

-    TestOutputCleaner const test_cleanup{{std::string{cTestSearchArchiveDirectory}}};
+    auto const archive_dir = fmt::format("{}-ff-f{}", cTestSearchArchiveDirectory, single_file_archive);
+    TestOutputCleaner const test_cleanup{{archive_dir}};

And below:

-                    std::string{cTestSearchArchiveDirectory},
+                    archive_dir,

Note: parameterise search() to consume archive_dir instead of the global constant if you take this.

components/core/src/clp_s/ColumnWriter.cpp (3)

52-57: Strong exception-safety for parallel vectors.

If the second push throws, vectors can diverge. Make the operation atomic.

 size_t FormattedFloatColumnWriter::add_value(ParsedMessage::variable_t& value) {
     auto const& [float_value, format]{std::get<std::pair<double, float_format_t>>(value)};
-    m_values.push_back(float_value);
-    m_formats.push_back(format);
+    m_values.emplace_back(float_value);
+    try {
+        m_formats.emplace_back(format);
+    } catch (...) {
+        m_values.pop_back();
+        throw;
+    }
     return sizeof(double) + sizeof(float_format_t);
 }

67-72: Match VariableString writer: consider reservation/atomic push.

Mirror the exception-safety approach (reserve or rollback on failure) to keep IDs in sync if an allocation throws.

-    m_var_dict->add_entry(std::get<std::string>(value), id);
-    m_var_dict_ids.push_back(id);
+    m_var_dict->add_entry(std::get<std::string>(value), id);
+    try {
+        m_var_dict_ids.emplace_back(id);
+    } catch (...) {
+        // add_entry mutates dict but cannot be rolled back here; consider reserving capacity up front.
+        throw;
+    }

(Optional: reserve expected capacity in the constructor when row count is known.)


59-65: Add a runtime size check in release builds.

Assert is compiled out; guard before writing to prevent mismatched arrays.

 void FormattedFloatColumnWriter::store(ZstdCompressor& compressor) {
-    assert(m_formats.size() == m_values.size());
+    assert(m_formats.size() == m_values.size());
+    if (m_formats.size() != m_values.size()) {
+        throw OperationFailed(ErrorCode_Failure, __FILE__, __LINE__);
+    }
     auto const values_size = m_values.size() * sizeof(double);
     auto const format_size = m_formats.size() * sizeof(float_format_t);
     compressor.write(reinterpret_cast<char const*>(m_values.data()), values_size);
     compressor.write(reinterpret_cast<char const*>(m_formats.data()), format_size);
 }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0aca1e2 and f6e1024.

📒 Files selected for processing (4)
  • components/core/CMakeLists.txt (2 hunks)
  • components/core/src/clp_s/ColumnWriter.cpp (2 hunks)
  • components/core/tests/test-clp_s-search.cpp (5 hunks)
  • components/core/tests/test_log_files/test_invalid_formatted_float.jsonl (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/ColumnWriter.cpp
  • components/core/tests/test-clp_s-search.cpp
🧠 Learnings (15)
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/CMakeLists.txt
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/CMakeLists.txt
  • components/core/tests/test-clp_s-search.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/CMakeLists.txt
📚 Learning: 2025-08-13T15:25:40.805Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:49-127
Timestamp: 2025-08-13T15:25:40.805Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, the FormattedFloatColumnWriter::add_value method intentionally allows std::stod to throw exceptions (std::invalid_argument or std::out_of_range) without catching them, as this is designed to terminate decompression when encountering invalid float strings for data integrity purposes.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2025-09-11T13:15:39.930Z
Learnt from: anlowee
PR: y-scope/clp#1176
File: components/core/src/clp_s/ColumnWriter.cpp:67-72
Timestamp: 2025-09-11T13:15:39.930Z
Learning: In components/core/src/clp_s/ColumnWriter.cpp, when using sizeof with types like variable_dictionary_id_t, always use the fully qualified namespace (clp::variable_dictionary_id_t) to match the variable declaration and maintain consistency with other column writers in the codebase.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-12-19T01:40:13.272Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/src/clp/streaming_compression/lzma/Compressor.hpp:65-75
Timestamp: 2024-12-19T01:40:13.272Z
Learning: "try_get_pos" method in `Compressor.hpp` does not throw exceptions, as clarified by Bill-hbrhbr.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
PR: y-scope/clp#570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2025-08-06T23:57:49.762Z
Learnt from: gibber9809
PR: y-scope/clp#1163
File: components/core/src/clp_s/ColumnWriter.hpp:120-122
Timestamp: 2025-08-06T23:57:49.762Z
Learning: In the CLP codebase, the `encoded_log_dict_id_t` type alias in `components/core/src/clp_s/ColumnWriter.hpp` is appropriately placed as a class-scoped alias because the ColumnWriter.hpp header is already included in all places where this type is used, eliminating the need to move it to a common header.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2025-01-16T16:48:30.016Z
Learnt from: LinZhihao-723
PR: y-scope/clp#653
File: components/core/src/clp/ffi/ir_stream/ir_unit_deserialization_methods.cpp:612-613
Timestamp: 2025-01-16T16:48:30.016Z
Learning: When using structured bindings with values that will be moved later (e.g., into a constructor), prefer using references (`auto&`) to avoid unnecessary copies, even when binding to the result of a temporary object that lives until the end of the full expression.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2024-10-13T09:29:56.553Z
Learnt from: LinZhihao-723
PR: y-scope/clp#557
File: components/core/tests/test-ir_encoding_methods.cpp:1283-1286
Timestamp: 2024-10-13T09:29:56.553Z
Learning: `clp::ffi::SchemaTree::Node::id_t` is of type `uint32_t`, so adding `1` to `INT32_MAX` is safe and does not cause integer overflow.

Applied to files:

  • components/core/src/clp_s/ColumnWriter.cpp
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/core/tests/test-clp_s-search.cpp
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.

Applied to files:

  • components/core/tests/test-clp_s-search.cpp
🧬 Code graph analysis (2)
components/core/src/clp_s/ColumnWriter.cpp (3)
components/core/src/clp_s/ParsedMessage.hpp (6)
  • add_value (41-43)
  • add_value (41-41)
  • value (80-82)
  • value (80-80)
  • value (90-92)
  • value (90-90)
components/core/src/clp_s/ColumnWriter.hpp (22)
  • value (29-29)
  • value (59-59)
  • value (76-76)
  • value (94-94)
  • value (111-111)
  • value (131-131)
  • value (149-149)
  • value (176-176)
  • value (233-233)
  • value (251-251)
  • compressor (35-35)
  • compressor (61-61)
  • compressor (78-78)
  • compressor (96-96)
  • compressor (113-113)
  • compressor (133-133)
  • compressor (151-151)
  • compressor (178-178)
  • compressor (235-235)
  • compressor (253-253)
  • id (205-208)
  • id (206-206)
components/core/src/clp_s/ArchiveReader.hpp (1)
  • m_var_dict (105-105)
components/core/tests/test-clp_s-search.cpp (2)
components/core/tests/test-clp_s-end_to_end.cpp (6)
  • get_test_input_path_relative_to_tests_dir (41-42)
  • get_test_input_path_relative_to_tests_dir (52-55)
  • get_test_input_path_relative_to_tests_dir (52-53)
  • get_test_input_local_path (43-43)
  • get_test_input_local_path (57-61)
  • get_test_input_local_path (57-57)
components/core/tests/clp_s_test_utils.cpp (2)
  • compress_archive (14-64)
  • compress_archive (14-22)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (7)
components/core/CMakeLists.txt (1)

703-703: LGTM: add test-FloatFormatEncoding to unit tests

Test file addition looks good and fits existing Catch2 wiring.

components/core/tests/test_log_files/test_invalid_formatted_float.jsonl (2)

30-31: Rename without leading hyphen looks good.

Thanks for removing the leading dash; this avoids jq key-escaping hassles.


44-44: Verify end-of-file has no stray literal.

The diff shows a lone “44”. Please confirm the fixture ends with a newline and no extra token.

components/core/tests/test-clp_s-search.cpp (2)

44-44: LGTM: Test fixture constant added.

The per-file fixture constant is clear and consistent.


215-219: LGTM: Using per-file path helper in assertion.

The filename filter now references the parameterised path; matches expectations.

components/core/src/clp_s/ColumnWriter.cpp (2)

3-5: LGTM: Missing headers added.

Including , , makes the TU self-contained.


74-77: LGTM: Store path mirrors VariableString writer.

Size computation and write are correct.

Comment on lines +26 to +33
{ "scientificWithGreaterThanTenMantissa": 123456789.1234567E3 }
{ "negativeScientificWithGreaterThanTenMantissa": -123456789.1234567E3 }
{ "scientificWithGreaterThanTenMantissaExceedingPrecision": 123456789.123456789E3 }
{ "negativeScientificWithGreaterThanTenMantissaExceedingPrecision": -123456789.123456789E3 }
{ "scientificWithGreaterThanTenMantissa2": 12.34567891234567E01 }
{ "negativeScientificWithGreaterThanTenMantissa2": -12.34567891234567E01 }
{ "scientificWithGreaterThanTenMantissa3": 123456789.1234567E+3 }
{ "negativeScientificWithGreaterThanTenMantissa3": -123456789.1234567E+3 }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Add mantissa boundary cases at exactly 10.0 to exercise canonicalisation.

Ensure rejection/normalisation when mantissa == 10 (often canonicalised to 1.x × 10^(e+1)).

+{ "scientificMantissaBoundaryEqualToTen": 10E0 }
+{ "negativeScientificMantissaBoundaryEqualToTen": -10E0 }
+{ "scientificMantissaBoundaryEqualToTenWithFraction": 10.000E+0 }
+{ "negativeScientificMantissaBoundaryEqualToTenWithFraction": -10.000E+0 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{ "scientificWithGreaterThanTenMantissa": 123456789.1234567E3 }
{ "negativeScientificWithGreaterThanTenMantissa": -123456789.1234567E3 }
{ "scientificWithGreaterThanTenMantissaExceedingPrecision": 123456789.123456789E3 }
{ "negativeScientificWithGreaterThanTenMantissaExceedingPrecision": -123456789.123456789E3 }
{ "scientificWithGreaterThanTenMantissa2": 12.34567891234567E01 }
{ "negativeScientificWithGreaterThanTenMantissa2": -12.34567891234567E01 }
{ "scientificWithGreaterThanTenMantissa3": 123456789.1234567E+3 }
{ "negativeScientificWithGreaterThanTenMantissa3": -123456789.1234567E+3 }
{ "scientificWithGreaterThanTenMantissa": 123456789.1234567E3 }
{ "negativeScientificWithGreaterThanTenMantissa": -123456789.1234567E3 }
{ "scientificWithGreaterThanTenMantissaExceedingPrecision": 123456789.123456789E3 }
{ "negativeScientificWithGreaterThanTenMantissaExceedingPrecision": -123456789.123456789E3 }
{ "scientificWithGreaterThanTenMantissa2": 12.34567891234567E01 }
{ "negativeScientificWithGreaterThanTenMantissa2": -12.34567891234567E01 }
{ "scientificWithGreaterThanTenMantissa3": 123456789.1234567E+3 }
{ "negativeScientificWithGreaterThanTenMantissa3": -123456789.1234567E+3 }
{ "scientificMantissaBoundaryEqualToTen": 10E0 }
{ "negativeScientificMantissaBoundaryEqualToTen": -10E0 }
{ "scientificMantissaBoundaryEqualToTenWithFraction": 10.000E+0 }
{ "negativeScientificMantissaBoundaryEqualToTenWithFraction": -10.000E+0 }

@gibber9809 gibber9809 dismissed stale reviews from wraymo and themself via 66a7e8e September 15, 2025 18:00
@gibber9809 gibber9809 changed the title feat: Add format information for float numbers to support lossless decompression. feat(clp-s): Add support for losslessly compression floating point numbers; Bump archive formation version to v0.4.1. Sep 15, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f6e1024 and 66a7e8e.

📒 Files selected for processing (1)
  • components/core/src/clp_s/SingleFileArchiveDefs.hpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/SingleFileArchiveDefs.hpp
🧠 Learnings (1)
📚 Learning: 2025-01-28T03:02:30.542Z
Learnt from: davemarco
PR: y-scope/clp#698
File: components/core/src/clp/streaming_archive/Constants.hpp:9-9
Timestamp: 2025-01-28T03:02:30.542Z
Learning: In the CLP project's archive version format (semver), the patch version uses 16 bits (uint16_t), while major version uses 8 bits and minor version uses 16 bits.

Applied to files:

  • components/core/src/clp_s/SingleFileArchiveDefs.hpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
🔇 Additional comments (1)
components/core/src/clp_s/SingleFileArchiveDefs.hpp (1)

14-14: Confirm archive‑compatibility: is a patch bump sufficient?

  • Version constants: 0.4.1 — components/core/src/clp_s/SingleFileArchiveDefs.hpp.
  • ArchiveWriter packs header.version as (major<<24)|(minor<<16)|patch — components/core/src/clp_s/ArchiveWriter.cpp.
  • FormattedFloat / DictionaryFloat are only emitted when retain_float_format is enabled (components/core/src/clp_s/JsonParser.cpp); writers/readers for these types are present (ArchiveWriter.cpp, ArchiveReader.cpp, ColumnWriter/Reader, SchemaReader/SchemaTree).
  • I did not find reader-side logic that explicitly validates ArchiveHeader.version and rejects newer archive formats.

Action: confirm whether released/older readers validate the archive header and will refuse archives containing unknown NodeType. If they do not, this is a backward‑incompatible format change and requires at least a minor version bump; if they do, a patch bump is acceptable.

Comment on lines 12 to 15
constexpr uint8_t cArchiveMajorVersion = 0;
constexpr uint8_t cArchiveMinorVersion = 4;
constexpr uint16_t cArchivePatchVersion = 0;
constexpr uint16_t cArchivePatchVersion = 1;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Document and enforce the version bit layout; add a packed constant.

This header stores version as uint32_t. Please document the 8/8/16 layout (major/minor/patch) and provide a single packed constant to keep writer/reader logic in sync. Also, confirm the intentional difference from the streaming archive’s learnt convention (which reportedly uses a 16‑bit minor); if that applies here, adjust accordingly.

Apply:

 constexpr uint8_t cArchiveMajorVersion = 0;
-constexpr uint8_t cArchiveMinorVersion = 4;
-constexpr uint16_t cArchivePatchVersion = 1;
+constexpr uint8_t  cArchiveMinorVersion = 4;
+constexpr uint16_t cArchivePatchVersion = 1;
+// Version packing: [8 bits major][8 bits minor][16 bits patch] -> uint32_t
+// Keep this in sync with reader/writer checks.
+constexpr uint32_t cArchiveVersionPacked =
+    (static_cast<uint32_t>(cArchiveMajorVersion) << 24) |
+    (static_cast<uint32_t>(cArchiveMinorVersion) << 16) |
+    static_cast<uint32_t>(cArchivePatchVersion);

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
components/core/CMakeLists.txt (1)

360-361: Prevent ODR/duplicate‑symbol errors: don’t compile FloatFormatEncoding into unitTest if already part of a linked clp_s target.

These sources appear to be compiled into production (per prior review); compiling them again into unitTest risks duplicate symbols. Prefer linking the production target that owns FloatFormatEncoding and remove these from SOURCE_FILES_clp_s_unitTest.

Apply this diff:

 set(SOURCE_FILES_clp_s_unitTest
@@
-    src/clp_s/FloatFormatEncoding.cpp
-    src/clp_s/FloatFormatEncoding.hpp

Run to verify ownership and linkages:

#!/bin/bash
set -euo pipefail

echo "Occurrences of FloatFormatEncoding in CMake:"
rg -n --type cmake -C2 'FloatFormatEncoding\.(cpp|hpp)' components/core

echo -e "\nProduction target(s) that compile it:"
rg -nPU '(?s)add_(library|executable)\([^)]*\).*?FloatFormatEncoding\.cpp' components/core/src/clp_s/CMakeLists.txt || true

echo -e "\nunitTest link interface (confirm we link the owning target instead of compiling the TU again):"
sed -n '730,774p' components/core/CMakeLists.txt
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 66a7e8e and c0b7e84.

📒 Files selected for processing (2)
  • components/core/CMakeLists.txt (2 hunks)
  • components/core/src/clp_s/CMakeLists.txt (4 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/CMakeLists.txt
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/CMakeLists.txt
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/CMakeLists.txt
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/core/src/clp_s/CMakeLists.txt
📚 Learning: 2025-08-25T07:27:19.449Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1262
File: docs/src/dev-guide/components-core/index.md:50-50
Timestamp: 2025-08-25T07:27:19.449Z
Learning: In the CLP project, nlohmann_json version 3.11.3 is installed through their dependency management process, but they don't enforce version constraints in CMake find_package calls - they control versions through installation rather than CMake version pinning.

Applied to files:

  • components/core/src/clp_s/CMakeLists.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
🔇 Additional comments (3)
components/core/CMakeLists.txt (1)

705-705: LGTM: new FloatFormatEncoding test is wired into unitTest.

Once the duplicate‑compilation above is resolved, ensure unitTest links the production target that provides FloatFormatEncoding so this test still builds.

components/core/src/clp_s/CMakeLists.txt (2)

273-273: Keep ystdlib::error_handling as PUBLIC

FloatFormatEncoding.hpp publicly includes <ystdlib/error_handling/Result.hpp> and exposes ystdlib::error_handling::Result in its API (components/core/src/clp_s/FloatFormatEncoding.hpp:9,54,57), so clp_s_archive_writer must retain PUBLIC linkage.

Likely an incorrect or invalid review comment.


337-337: Keep ystdlib::error_handling PUBLIC — it's part of the reader's public API

components/core/src/clp_s/FloatFormatEncoding.hpp includes <ystdlib/error_handling/Result.hpp> and exposes ystdlib::error_handling::Result in public signatures (components/core/src/clp_s/FloatFormatEncoding.hpp: lines 9, 54–57); do not change to PRIVATE.

Likely an incorrect or invalid review comment.

Comment on lines +229 to 231
FloatFormatEncoding.cpp
FloatFormatEncoding.hpp
JsonFileIterator.cpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Adding FloatFormatEncoding to writer sources: SGTM

Good to see the encoding module compiled into the writer.

If this module will be shared broadly (writer, reader, json constructor, etc.), consider extracting it into a tiny common target (e.g., clp_s::float_format) and linking both writer/reader to it to avoid repeated compilation. Not urgent given existing project patterns.

🤖 Prompt for AI Agents
In components/core/src/clp_s/CMakeLists.txt around lines 229-231,
FloatFormatEncoding sources were added directly to the writer target; to avoid
duplicate compilation when this module is used by multiple targets, create a
small common library target (e.g., clp_s::float_format) that contains
FloatFormatEncoding.cpp/.hpp and any related sources/headers, set its include
directories and target properties, then remove those files from the writer
sources and link both the writer and any other consumers (reader, json
constructor) against clp_s::float_format using target_link_libraries so they
share a single compiled implementation.

Comment on lines +299 to 301
FloatFormatEncoding.cpp
FloatFormatEncoding.hpp
JsonSerializer.hpp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Adding FloatFormatEncoding to reader sources: SGTM

Mirrors the writer change; necessary for decode/restore path.

Same optional note as above: consider a small shared target if this module expands or is reused elsewhere.

🤖 Prompt for AI Agents
In components/core/src/clp_s/CMakeLists.txt around lines 299 to 301, the
FloatFormatEncoding source files are missing from the reader target; add
FloatFormatEncoding.cpp and FloatFormatEncoding.hpp to the reader sources (and
keep JsonSerializer.hpp) so the decode/restore path builds correctly, and
consider refactoring into a small shared/static library target if this module
will be reused to avoid duplication.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
components/core/src/clp_s/CommandLineArguments.cpp (1)

886-889: Expose the new flag in compression usage.

Add a succinct line so users see the option without scrolling the full list.

Apply:

 void CommandLineArguments::print_compression_usage() const {
   std::cerr << "Usage: " << m_program_name << " c [OPTIONS] ARCHIVES_DIR [FILE/DIR ...]"
             << std::endl;
+  std::cerr << "  --retain-float-format    Preserve original JSON float formatting"
+            << std::endl;
 }
components/core/tests/test-clp_s-search.cpp (1)

160-162: Fix designated initializers for clp_s::Path (use ‘=’, not brace-init)

.source{…}/.path{…} is invalid; use C++20 designated initializers.

-        auto archive_path = clp_s::Path{
-                .source{clp_s::InputSource::Filesystem},
-                .path{entry.path().string()}
-        };
+        auto archive_path = clp_s::Path{
+                .source = clp_s::InputSource::Filesystem,
+                .path = entry.path().string()
+        };
components/core/tests/test-clp_s-end_to_end.cpp (1)

110-113: Fix designated initializers for clp_s::Path in extract()

Use = in designated initializers; brace‑init is invalid and will fail to compile.

-        constructor_option.archive_path = clp_s::Path{
-                .source{clp_s::InputSource::Filesystem},
-                .path{entry.path().string()}
-        };
+        constructor_option.archive_path = clp_s::Path{
+                .source = clp_s::InputSource::Filesystem,
+                .path = entry.path().string()
+        };
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0b7e84 and 29fcbf2.

📒 Files selected for processing (12)
  • components/core/src/clp_s/CMakeLists.txt (4 hunks)
  • components/core/src/clp_s/CommandLineArguments.cpp (1 hunks)
  • components/core/src/clp_s/CommandLineArguments.hpp (2 hunks)
  • components/core/src/clp_s/JsonParser.cpp (7 hunks)
  • components/core/src/clp_s/JsonParser.hpp (2 hunks)
  • components/core/src/clp_s/clp-s.cpp (1 hunks)
  • components/core/tests/clp_s_test_utils.cpp (2 hunks)
  • components/core/tests/clp_s_test_utils.hpp (2 hunks)
  • components/core/tests/test-clp_s-delta-encode-log-order.cpp (1 hunks)
  • components/core/tests/test-clp_s-end_to_end.cpp (5 hunks)
  • components/core/tests/test-clp_s-range_index.cpp (1 hunks)
  • components/core/tests/test-clp_s-search.cpp (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/src/clp_s/clp-s.cpp
  • components/core/tests/clp_s_test_utils.hpp
  • components/core/src/clp_s/CommandLineArguments.hpp
  • components/core/src/clp_s/CommandLineArguments.cpp
  • components/core/tests/clp_s_test_utils.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/JsonParser.cpp
  • components/core/tests/test-clp_s-search.cpp
  • components/core/tests/test-clp_s-delta-encode-log-order.cpp
  • components/core/tests/test-clp_s-range_index.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
🧠 Learnings (16)
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/tests/clp_s_test_utils.hpp
  • components/core/tests/clp_s_test_utils.cpp
  • components/core/tests/test-clp_s-search.cpp
  • components/core/tests/test-clp_s-delta-encode-log-order.cpp
  • components/core/tests/test-clp_s-range_index.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2025-08-25T07:27:19.449Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1262
File: docs/src/dev-guide/components-core/index.md:50-50
Timestamp: 2025-08-25T07:27:19.449Z
Learning: In the CLP project, nlohmann_json version 3.11.3 is installed through their dependency management process, but they don't enforce version constraints in CMake find_package calls - they control versions through installation rather than CMake version pinning.

Applied to files:

  • components/core/src/clp_s/CMakeLists.txt
📚 Learning: 2024-10-07T21:16:41.660Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:769-779
Timestamp: 2024-10-07T21:16:41.660Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, when handling errors in `parse_from_ir`, prefer to maintain the current mix of try-catch and if-statements because specific messages are returned back up in some cases.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-10T16:03:13.322Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:13.322Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, validation and exception throwing are unnecessary in the `get_archive_node_id` method when processing nodes, and should not be added.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-12-10T16:03:08.691Z
Learnt from: gibber9809
PR: y-scope/clp#630
File: components/core/src/clp_s/JsonParser.cpp:702-703
Timestamp: 2024-12-10T16:03:08.691Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` function, validation and exception throwing for UTF-8 compliance of `curr_node.get_key_name()` are unnecessary and should be omitted.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-08T15:52:50.753Z
Learnt from: AVMatthews
PR: y-scope/clp#543
File: components/core/src/clp_s/JsonParser.cpp:581-627
Timestamp: 2024-10-08T15:52:50.753Z
Learning: In `components/core/src/clp_s/JsonParser.cpp`, within the `get_archive_node_id` method, throwing a string literal as an exception is acceptable practice.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2024-10-14T03:42:10.355Z
Learnt from: LinZhihao-723
PR: y-scope/clp#558
File: components/core/tests/test-ffi_KeyValuePairLogEvent.cpp:14-14
Timestamp: 2024-10-14T03:42:10.355Z
Learning: In the file `components/core/tests/test-ffi_KeyValuePairLogEvent.cpp`, including `<json/single_include/nlohmann/json.hpp>` is consistent with the project's coding standards.

Applied to files:

  • components/core/src/clp_s/JsonParser.cpp
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/core/tests/test-clp_s-search.cpp
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.

Applied to files:

  • components/core/tests/test-clp_s-search.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-12-04T15:50:37.827Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.

Applied to files:

  • components/core/tests/test-clp_s-delta-encode-log-order.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2025-01-14T16:06:54.692Z
Learnt from: haiqi96
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.

Applied to files:

  • components/core/tests/test-clp_s-delta-encode-log-order.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
PR: y-scope/clp#570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-14T01:03:27.144Z
Learnt from: LinZhihao-723
PR: y-scope/clp#450
File: components/core/src/clp/streaming_archive/reader/Segment.cpp:57-58
Timestamp: 2024-11-14T01:03:27.144Z
Learning: In `components/core/src/clp/streaming_archive/reader/Segment.cpp`, the method `m_decompressor.open()` does not have a return value.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-29T22:51:00.355Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:141-142
Timestamp: 2024-11-29T22:51:00.355Z
Learning: In 'components/core/tests/test-clp_s-end_to_end.cpp', using `WIFEXITED` and `WEXITSTATUS` with the return value of `std::system` is acceptable for checking the exit status of system commands, as per the C++ documentation.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
PR: y-scope/clp#557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
🧬 Code graph analysis (4)
components/core/src/clp_s/clp-s.cpp (1)
components/core/src/clp_s/ArchiveWriter.hpp (1)
  • option (146-146)
components/core/src/clp_s/JsonParser.cpp (6)
components/core/src/clp_s/ParsedMessage.hpp (10)
  • value (80-82)
  • value (80-80)
  • value (90-92)
  • value (90-90)
  • node_id (45-47)
  • node_id (45-45)
  • node_id (55-57)
  • node_id (55-55)
  • node_id (65-67)
  • node_id (65-65)
components/core/src/clp_s/FloatFormatEncoding.cpp (4)
  • restore_encoded_float (263-328)
  • restore_encoded_float (263-264)
  • get_float_encoding (136-261)
  • get_float_encoding (136-137)
components/core/src/clp_s/JsonParser.hpp (5)
  • JsonParser (59-59)
  • JsonParser (62-62)
  • line (112-112)
  • line (195-195)
  • line (202-202)
components/core/src/clp_s/ArchiveWriter.hpp (3)
  • option (146-146)
  • parent_node_id (171-171)
  • parent_node_id (179-179)
components/core/src/clp_s/ColumnReader.hpp (4)
  • FormattedFloat (164-164)
  • DictionaryFloat (200-200)
  • Float (141-141)
  • Integer (85-85)
components/core/src/clp_s/SchemaTree.hpp (1)
  • parent_node_id (132-132)
components/core/tests/test-clp_s-search.cpp (2)
components/core/tests/test-clp_s-end_to_end.cpp (6)
  • get_test_input_path_relative_to_tests_dir (41-42)
  • get_test_input_path_relative_to_tests_dir (52-55)
  • get_test_input_path_relative_to_tests_dir (52-53)
  • get_test_input_local_path (43-43)
  • get_test_input_local_path (57-61)
  • get_test_input_local_path (57-57)
components/core/tests/clp_s_test_utils.cpp (2)
  • compress_archive (14-55)
  • compress_archive (14-21)
components/core/tests/test-clp_s-end_to_end.cpp (5)
components/core/tests/test-clp_s-search.cpp (6)
  • get_test_input_path_relative_to_tests_dir (48-49)
  • get_test_input_path_relative_to_tests_dir (64-67)
  • get_test_input_path_relative_to_tests_dir (64-65)
  • get_test_input_local_path (50-50)
  • get_test_input_local_path (69-73)
  • get_test_input_local_path (69-69)
components/core/src/clp_s/SchemaTree.hpp (2)
  • child_id (113-113)
  • child_id (113-113)
components/core/tests/clp_s_test_utils.cpp (2)
  • compress_archive (14-55)
  • compress_archive (14-21)
components/core/tests/clp_s_test_utils.hpp (1)
  • compress_archive (24-31)
components/core/src/clp_s/ColumnReader.hpp (2)
  • FormattedFloat (164-164)
  • DictionaryFloat (200-200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: build-macos (macos-14, true)
  • GitHub Check: build-macos (macos-14, false)
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: lint-check (ubuntu-24.04)
  • GitHub Check: lint-check (macos-15)
🔇 Additional comments (30)
components/core/src/clp_s/CMakeLists.txt (4)

235-236: Avoid duplicate compilation: extract FloatFormatEncoding into a small shared target.

These sources are added here and again to the reader below. Create a tiny library (e.g., clp_s::float_format) and link both writer/reader to it to cut rebuild time and keep deps centralised.

Apply this diff in this block:

-        FloatFormatEncoding.cpp
-        FloatFormatEncoding.hpp

Then add a small target (outside this block):

add_library(clp_s_float_format STATIC
        FloatFormatEncoding.cpp
)
add_library(clp_s::float_format ALIAS clp_s_float_format)
target_compile_features(clp_s_float_format PRIVATE cxx_std_20)
target_include_directories(clp_s_float_format PUBLIC ../)
target_link_libraries(clp_s_float_format PUBLIC ystdlib::error_handling)

# Link it
target_link_libraries(clp_s_archive_writer PRIVATE clp_s::float_format)
target_link_libraries(clp_s_archive_reader PRIVATE clp_s::float_format)

305-306: Deduplicate FloatFormatEncoding build by sharing a library.

Same as writer: move these into clp_s::float_format and link reader to it.

Apply this diff in this block:

-        FloatFormatEncoding.cpp
-        FloatFormatEncoding.hpp

(Linking instructions provided in the earlier comment.)


279-279: Keep ystdlib::error_handling PUBLIC — it’s used in clp_s public headers

FloatFormatEncoding.hpp (includes Result) and kv_ir_search.hpp (includes ErrorCode, Result) expose ystdlib::error_handling types in the public API, so changing the link to PRIVATE is incorrect. Keep PUBLIC here; if retained, remove the redundant ystdlib::error_handling entry from the clp-s executable link list (around line 443).

Likely an incorrect or invalid review comment.


343-343: Make ystdlib::error_handling PRIVATE unless clp_s public headers require it

Repository scans produced no evidence that clp_s public headers expose ystdlib (rg returned no matches; a subsequent rg attempt errored on the 'hpp' file type). Change PUBLIC → PRIVATE in the target_link_libraries unless a public header includes or exposes ystdlib symbols/macros.

File: components/core/src/clp_s/CMakeLists.txt (around line 343)

-                PUBLIC
+                PRIVATE
                 absl::flat_hash_map
                 clp::string_utils
                 clp_s::io
                 msgpack-cxx
                 nlohmann_json::nlohmann_json
-                ystdlib::error_handling
+                ystdlib::error_handling

Quick verification command to run locally:
rg -n -C2 -e '\bystdlib\b' -e '#\sinclude\s<ystdlib' -e 'error_handling' -e '\bY_THROW\b' -e '\bY_EXPECT\b' components/core/src/clp_s/

components/core/src/clp_s/CommandLineArguments.hpp (2)

105-105: Good: accessor is [[nodiscard]].

Consistent with nearby getters. No further action.


184-184: Good: default-off flag.

Member is explicitly initialised to false; safe default.

components/core/src/clp_s/CommandLineArguments.cpp (1)

247-251: CLI flag wiring — LGTM.

Option is correctly bound via bool_switch to m_retain_float_format.

components/core/tests/clp_s_test_utils.cpp (1)

18-21: Plumbing looks correct.

Param added and propagated to parser_option.retain_float_format; matches header.

Also applies to: 41-42

components/core/src/clp_s/clp-s.cpp (1)

104-104: Log the flag and warn for KV‑IR inputs.

Improve UX by logging the effective setting and warning when it won’t apply to IR inputs.

Apply:

     option.print_archive_stats = command_line_arguments.print_archive_stats();
     option.retain_float_format = command_line_arguments.get_retain_float_format();
+    SPDLOG_DEBUG("retain_float_format={}", option.retain_float_format);
+    if (option.retain_float_format) {
+        bool has_ir_input{false};
+        for (auto const& p : option.input_paths) {
+            if (std::string::npos != p.path.find(clp::ir::cIrFileExtension)) {
+                has_ir_input = true;
+                break;
+            }
+        }
+        if (has_ir_input) {
+            SPDLOG_WARN("--retain-float-format currently has no effect for KV-IR inputs.");
+        }
+    }
components/core/src/clp_s/JsonParser.hpp (1)

44-46: Add brief docs for retain_float_format.

Clarify scope/impact; mirrors public CLI knob.

Apply:

 struct JsonParserOption {
@@
-    bool retain_float_format{false};
+    // When true, preserve original JSON float text formatting (e.g., decimal/scientific, precision).
+    // No effect for KV-IR inputs; may increase CPU/memory during parse/serialize.
+    bool retain_float_format{false};
@@
-    bool m_retain_float_format{false};
+    // Mirrors option.retain_float_format.
+    bool m_retain_float_format{false};

Also applies to: 244-245

components/core/tests/test-clp_s-delta-encode-log-order.cpp (1)

73-76: Name boolean args for readability.

Inline names prevent argument-order slips.

Apply:

-            std::nullopt,
-            false,
-            true,
-            false
+            std::nullopt,
+            /*retain_float_format=*/false,
+            /*single_file_archive=*/true,
+            /*structurize_arrays=*/false
components/core/tests/test-clp_s-range_index.cpp (1)

227-230: Name boolean args for readability.

Annotate booleans at the call site.

Apply:

-                    std::nullopt,
-                    false,
-                    single_file_archive,
-                    false
+                    std::nullopt,
+                    /*retain_float_format=*/false,
+                    single_file_archive,
+                    /*structurize_arrays=*/false
components/core/tests/clp_s_test_utils.hpp (1)

28-31: Signature change — verify all call sites updated

Automated scan located compress_archive usages but the verification script aborted; manually confirm each call supplies the three new bool params (retain_float_format, single_file_archive, structurize_arrays) in the correct order.

  • Example hits (verify these call-sites):

    • components/core/tests/test-clp_s-search.cpp:236 — args_count=6; trailing args: false, single_file_archive, structurize_arrays.
    • components/core/tests/test-clp_s-search.cpp:273 — args_count=7; trailing args: true, single_file_archive, false, clp_s::FileType::Json.
  • Also check: components/core/tests/test-clp_s-range_index.cpp, components/core/tests/test-clp_s-delta-encode-log-order.cpp, components/core/tests/test-clp_s-end_to_end.cpp, components/core/tests/clp_s_test_utils.cpp.

If any call is missing or has reordered boolean arguments, update the call to match the signature in components/core/tests/clp_s_test_utils.hpp.

components/core/tests/test-clp_s-search.cpp (7)

44-44: LGTM: new formatted-float fixture constant

The constant is clear and follows existing naming.


64-67: Const‑correctness for helper definitions

Make parameters const for consistency with other tests.

-auto get_test_input_path_relative_to_tests_dir(std::string_view test_input_path)
+auto get_test_input_path_relative_to_tests_dir(std::string_view const test_input_path)
         -> std::filesystem::path {
     return std::filesystem::path{cTestInputFileDirectory} / test_input_path;
 }
 
-auto get_test_input_local_path(std::string_view test_input_path) -> std::string {
+auto get_test_input_local_path(std::string_view const test_input_path) -> std::string {
     std::filesystem::path const current_file_path{__FILE__};
     auto const tests_dir{current_file_path.parent_path()};
     return (tests_dir / get_test_input_path_relative_to_tests_dir(test_input_path)).string();
 }

Also applies to: 69-73


233-244: Name boolean args and isolate per‑run archive directory

  • Annotate booleans to prevent arg‑order mistakes.
  • Use a unique archive dir per parametrisation to avoid cross‑run interference.
-    TestOutputCleaner const test_cleanup{{std::string{cTestSearchArchiveDirectory}}};
+    auto const archive_dir = fmt::format("{}-s{}_f{}",
+                                         cTestSearchArchiveDirectory,
+                                         structurize_arrays,
+                                         single_file_archive);
+    TestOutputCleaner const test_cleanup{{archive_dir}};
@@
-            std::ignore = compress_archive(
-                    get_test_input_local_path(cTestSearchInputFile),
-                    std::string{cTestSearchArchiveDirectory},
-                    std::string{cTestIdxKey},
-                    false,
-                    single_file_archive,
-                    structurize_arrays
-            )
+            std::ignore = compress_archive(
+                    get_test_input_local_path(cTestSearchInputFile),
+                    archive_dir,
+                    std::string{cTestIdxKey},
+                    /*retain_float_format=*/false,
+                    /*single_file_archive=*/single_file_archive,
+                    /*structurize_arrays=*/structurize_arrays
+            )

246-249: Capture generated params to aid triage

Include all generated values in failure output.

-        CAPTURE(query);
+        CAPTURE(query);
+        CAPTURE(structurize_arrays);
+        CAPTURE(single_file_archive);

256-267: Extend cases: negative zero and uppercase exponent

Numeric equality should treat -0.0 == 0 and accept 1E-16.

             {R"aa(formattedFloatValue: 1e-16)aa", {6, 7}},
+            {R"aa(formattedFloatValue: 1E-16)aa", {6, 7}},
+            {R"aa(formattedFloatValue: -0.0)aa", {3, 4, 5}},

48-50: Const‑correctness for helper declarations (and centralise)

Accept std::string_view const to match usage elsewhere and avoid accidental mutation. Consider moving these into clp_s_test_utils.{hpp,cpp}.

-auto get_test_input_path_relative_to_tests_dir(std::string_view test_input_path)
+auto get_test_input_path_relative_to_tests_dir(std::string_view const test_input_path)
         -> std::filesystem::path;
-auto get_test_input_local_path(std::string_view test_input_path) -> std::string;
+auto get_test_input_local_path(std::string_view const test_input_path) -> std::string;

270-283: Name boolean args and use unique archive dir

Match the style used above and avoid shared output paths across runs.

-    TestOutputCleaner const test_cleanup{{std::string{cTestSearchArchiveDirectory}}};
+    auto const archive_dir = fmt::format("{}-f{}",
+                                         cTestSearchArchiveDirectory,
+                                         single_file_archive);
+    TestOutputCleaner const test_cleanup{{archive_dir}};
@@
-            std::ignore = compress_archive(
-                    get_test_input_local_path(cTestSearchFormattedFloatFile),
-                    std::string{cTestSearchArchiveDirectory},
-                    std::nullopt,
-                    true,
-                    single_file_archive,
-                    false,
-                    clp_s::FileType::Json
-            )
+            std::ignore = compress_archive(
+                    get_test_input_local_path(cTestSearchFormattedFloatFile),
+                    archive_dir,
+                    std::nullopt,
+                    /*retain_float_format=*/true,
+                    /*single_file_archive=*/single_file_archive,
+                    /*structurize_arrays=*/false,
+                    clp_s::FileType::Json
+            )
components/core/tests/test-clp_s-end_to_end.cpp (6)

129-133: Quote paths in shell commands

Unquoted paths break on spaces. Quote both arguments.

-    auto command = fmt::format(
-            "jq --sort-keys --compact-output '.' {} | sort > {}",
-            extracted_json_path.string(),
-            cTestEndToEndOutputSortedJson
-    );
+    auto command = fmt::format(
+            "jq --sort-keys --compact-output '.' \"{}\" | sort > \"{}\"",
+            extracted_json_path.string(),
+            cTestEndToEndOutputSortedJson
+    );
@@
-    command = fmt::format(
-            "diff --unified {} {}  > /dev/null",
-            cTestEndToEndOutputSortedJson,
-            get_test_input_local_path(cTestEndToEndInputFile)
-    );
+    command = fmt::format(
+            "diff --unified \"{}\" \"{}\" > /dev/null",
+            cTestEndToEndOutputSortedJson,
+            get_test_input_local_path(cTestEndToEndInputFile)
+    );

Also applies to: 141-147


150-182: Quote paths in literallyCompare()

Apply consistent quoting throughout.

-    auto command = fmt::format(
-            "cat {} | awk 'NF' | tr -d ' ' | sort > {}",
-            expected_output_json_path.string(),
-            cTestEndToEndExpectedOutputSortedFile
-    );
+    auto command = fmt::format(
+            "cat \"{}\" | awk 'NF' | tr -d ' ' | sort > \"{}\"",
+            expected_output_json_path.string(),
+            cTestEndToEndExpectedOutputSortedFile
+    );
@@
-    command = fmt::format(
-            "cat {} | awk 'NF' | tr -d ' ' | sort > {}",
-            extracted_json_path.string(),
-            cTestEndToEndOutputSortedJson
-    );
+    command = fmt::format(
+            "cat \"{}\" | awk 'NF' | tr -d ' ' | sort > \"{}\"",
+            extracted_json_path.string(),
+            cTestEndToEndOutputSortedJson
+    );
@@
-    command = fmt::format(
-            "diff --unified {} {}  > /dev/null",
-            cTestEndToEndExpectedOutputSortedFile,
-            cTestEndToEndOutputSortedJson
-    );
+    command = fmt::format(
+            "diff --unified \"{}\" \"{}\" > /dev/null",
+            cTestEndToEndExpectedOutputSortedFile,
+            cTestEndToEndOutputSortedJson
+    );

197-206: Check compress_archive stats and name boolean args

Don’t discard stats; assert non‑empty and annotate flags.

-    REQUIRE_NOTHROW(
-            std::ignore = compress_archive(
+    REQUIRE_NOTHROW(
+            [&]{
+                auto stats = compress_archive(
                     get_test_input_local_path(cTestEndToEndInputFile),
                     std::string{cTestEndToEndArchiveDirectory},
                     std::nullopt,
-                    false,
-                    single_file_archive,
-                    structurize_arrays
-            )
+                    /*retain_float_format=*/false,
+                    /*single_file_archive=*/single_file_archive,
+                    /*structurize_arrays=*/structurize_arrays
+                );
+                REQUIRE(false == stats.empty());
+            }()
     );

220-237: Validate stats and name booleans (valid formatted‑floats test); isolate per‑run dir

Mirror the improvements; prevent parallel interference by using a unique archive dir.

-    TestOutputCleaner const test_cleanup{
-            {std::string{cTestEndToEndArchiveDirectory},
+    auto const archive_dir = fmt::format("{}-valid-s{}_f{}",
+                                         cTestEndToEndArchiveDirectory,
+                                         structurize_arrays,
+                                         single_file_archive);
+    TestOutputCleaner const test_cleanup{
+            {archive_dir,
              std::string{cTestEndToEndOutputDirectory},
              std::string{cTestEndToEndOutputSortedJson},
              std::string{cTestEndToEndExpectedOutputSortedFile}}
     };
@@
-    REQUIRE_NOTHROW(
-            std::ignore = compress_archive(
+    REQUIRE_NOTHROW(
+            [&]{
+                auto stats = compress_archive(
                     get_test_input_local_path(cTestEndToEndValidFormattedFloatInputFile),
-                    std::string{cTestEndToEndArchiveDirectory},
+                    archive_dir,
                     std::nullopt,
-                    true,
-                    single_file_archive,
-                    structurize_arrays,
+                    /*retain_float_format=*/true,
+                    /*single_file_archive=*/single_file_archive,
+                    /*structurize_arrays=*/structurize_arrays,
                     clp_s::FileType::Json
-            )
+                );
+                REQUIRE(false == stats.empty());
+            }()
     );

271-281: Validate stats and name booleans (invalid formatted‑floats test); isolate per‑run dir

Apply the same fixes here.

-    TestOutputCleaner const test_cleanup{
-            {std::string{cTestEndToEndArchiveDirectory},
+    auto const archive_dir = fmt::format("{}-invalid-s{}_f{}",
+                                         cTestEndToEndArchiveDirectory,
+                                         structurize_arrays,
+                                         single_file_archive);
+    TestOutputCleaner const test_cleanup{
+            {archive_dir,
              std::string{cTestEndToEndOutputDirectory},
              std::string{cTestEndToEndOutputSortedJson},
              std::string{cTestEndToEndExpectedOutputSortedFile}}
     };
@@
-    REQUIRE_NOTHROW(
-            std::ignore = compress_archive(
+    REQUIRE_NOTHROW(
+            [&]{
+                auto stats = compress_archive(
                     get_test_input_local_path(cTestEndToEndInvalidFormattedFloatInputFile),
-                    std::string{cTestEndToEndArchiveDirectory},
+                    archive_dir,
                     std::nullopt,
-                    true,
-                    single_file_archive,
-                    structurize_arrays,
+                    /*retain_float_format=*/true,
+                    /*single_file_archive=*/single_file_archive,
+                    /*structurize_arrays=*/structurize_arrays,
                     clp_s::FileType::Json
-            )
+                );
+                REQUIRE(false == stats.empty());
+            }()
     );

41-61: Deduplicate test‑path helpers across test files

These helpers duplicate those in test‑clp_s‑search.cpp; prefer centralising in clp_s_test_utils to avoid drift.

components/core/src/clp_s/JsonParser.cpp (4)

258-276: Make token view const; avoid accidental reuse

Const‑qualify the raw token view for clarity and to prevent accidental reassignment.

-                        auto double_value_str{trim_trailing_whitespace(cur_value.raw_json_token())};
+                        auto const double_value_str{
+                                trim_trailing_whitespace(cur_value.raw_json_token())
+                        };

365-383: Make token view const (array path too)

Mirror the const binding here as well.

-                        auto double_value_str{trim_trailing_whitespace(cur_value.raw_json_token())};
+                        auto const double_value_str{
+                                trim_trailing_whitespace(cur_value.raw_json_token())
+                        };

520-543: Make token view const (ordered/object path)

Keep consistent const‑binding across all three sites.

-                        auto double_value_str{trim_trailing_whitespace(line.raw_json_token())};
+                        auto const double_value_str{
+                                trim_trailing_whitespace(line.raw_json_token())
+                        };

121-131: Locale‑safe whitespace check: cast to unsigned char

std::isspace requires unsigned char to avoid UB for negative char values.

 auto trim_trailing_whitespace(std::string_view str) -> std::string_view {
     size_t substr_size{str.size()};
     for (auto it{str.rbegin()}; str.rend() != it; ++it) {
-        if (std::isspace(static_cast<int>(*it))) {
+        if (std::isspace(static_cast<unsigned char>(*it))) {
             --substr_size;
         } else {
             break;
         }
     }
     return str.substr(0ULL, substr_size);
 }

Comment on lines 202 to +289
case simdjson::ondemand::json_type::number: {
simdjson::ondemand::number number_value = cur_value.get_number();
if (true == number_value.is_double()) {
double double_value = number_value.get_double();
m_current_parsed_message.add_unordered_value(double_value);
node_id = m_archive_writer
->add_node(node_id_stack.top(), NodeType::Float, cur_key);
if (m_retain_float_format) {
auto double_value_str{trim_trailing_whitespace(cur_value.raw_json_token())};
auto const float_format_result{get_float_encoding(double_value_str)};
if (false == float_format_result.has_error()
&& round_trip_is_identical(
double_value_str,
number_value.get_double(),
float_format_result.value()
))
{
m_current_parsed_message.add_unordered_value(
number_value.get_double(),
float_format_result.value()
);
node_id = m_archive_writer->add_node(
node_id_stack.top(),
NodeType::FormattedFloat,
cur_key
);
} else {
m_current_parsed_message.add_unordered_value(double_value_str);
node_id = m_archive_writer->add_node(
node_id_stack.top(),
NodeType::DictionaryFloat,
cur_key
);
}
} else {
double double_value = number_value.get_double();
m_current_parsed_message.add_unordered_value(double_value);
node_id = m_archive_writer
->add_node(node_id_stack.top(), NodeType::Float, cur_key);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Deduplicate float‑handling into a helper

The FormattedFloat/DictionaryFloat decision logic is repeated in 3 places. Extract a small helper (e.g., process_double_token(...)) to compute {node_type, parsed_message_addition} and call it from parse_obj_in_array/parse_array/parse_line to reduce drift.

Would you like me to draft the helper signature and a minimal refactor patch?

Also applies to: 362-395, 518-549

🤖 Prompt for AI Agents
components/core/src/clp_s/JsonParser.cpp lines 254-289 (also apply same change
at 362-395 and 518-549): duplicate logic deciding between FormattedFloat vs
DictionaryFloat (and adding the corresponding value to m_current_parsed_message
+ choosing node type) appears in three places; extract a helper like
process_double_token(cur_value, cur_key, m_retain_float_format,
m_archive_writer, m_current_parsed_message) that returns a pair{NodeType,
optional<string|double>-value-addition} or directly performs the
m_current_parsed_message.add_unordered_value and returns the NodeType to add;
replace the repeated blocks in parse_obj_in_array/parse_array/parse_line with a
single call to that helper and then call
m_archive_writer->add_node(node_id_stack.top(), returnedNodeType, cur_key) to
keep behavior identical while centralizing the float-format detection and
round-trip check.

Comment on lines +18 to 20
* @param timestamp_key
* @param retain_float_format
* @param single_file_archive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Document scope: JSON-only effect.

Note in the doc that retain_float_format has no effect for KV‑IR inputs.

Apply:

- * @param retain_float_format
+ * @param retain_float_format
+ *        Preserve original JSON float text formatting (e.g., decimal/scientific, precision).
+ *        Note: No effect when compressing KV-IR inputs.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
* @param timestamp_key
* @param retain_float_format
* @param single_file_archive
* @param timestamp_key
* @param retain_float_format
* Preserve original JSON float text formatting (e.g., decimal/scientific, precision).
* Note: No effect when compressing KV-IR inputs.
* @param single_file_archive
🤖 Prompt for AI Agents
In components/core/tests/clp_s_test_utils.hpp around lines 18 to 20, the
parameter documentation for retain_float_format lacks the scope note; update the
comment to state explicitly that retain_float_format only applies to JSON inputs
and has no effect for KV-IR (key-value IR) inputs. Edit the @param description
for retain_float_format to add one short sentence such as "Applies to JSON-only;
no effect for KV-IR inputs" so readers know the parameter is ignored for KV‑IR.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/core/tests/test-clp_s-end_to_end.cpp (1)

111-113: Fix C++20 designated initializers for clp_s::Path (use '=')

Brace-init .source{...} / .path{...} is invalid; this won’t compile.

-        constructor_option.archive_path = clp_s::Path{
-                .source{clp_s::InputSource::Filesystem},
-                .path{entry.path().string()}
-        };
+        constructor_option.archive_path = clp_s::Path{
+                .source = clp_s::InputSource::Filesystem,
+                .path = entry.path().string()
+        };
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29fcbf2 and 777f126.

📒 Files selected for processing (2)
  • components/core/tests/test-clp_s-end_to_end.cpp (5 hunks)
  • components/core/tests/test-clp_s-search.cpp (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}

⚙️ CodeRabbit configuration file

  • Prefer false == <expression> rather than !<expression>.

Files:

  • components/core/tests/test-clp_s-search.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
🧠 Learnings (10)
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/core/tests/test-clp_s-search.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/core/tests/test-clp_s-search.cpp
📚 Learning: 2024-11-29T22:50:17.206Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:109-110
Timestamp: 2024-11-29T22:50:17.206Z
Learning: In `components/core/tests/test-clp_s-end_to_end.cpp`, the success of `constructor.store()` is verified through `REQUIRE` statements and subsequent comparisons.

Applied to files:

  • components/core/tests/test-clp_s-search.cpp
  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-01T03:26:26.386Z
Learnt from: LinZhihao-723
PR: y-scope/clp#570
File: components/core/tests/test-ir_encoding_methods.cpp:376-399
Timestamp: 2024-11-01T03:26:26.386Z
Learning: In the test code (`components/core/tests/test-ir_encoding_methods.cpp`), exception handling for `msgpack::unpack` can be omitted because the Catch2 testing framework captures exceptions if they occur.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-12-04T15:50:37.827Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#614
File: components/core/tests/test-StreamingCompression.cpp:45-54
Timestamp: 2024-12-04T15:50:37.827Z
Learning: In `components/core/tests/test-StreamingCompression.cpp`, within the `compress` function, compressing the same data repeatedly by passing the same `src` pointer without advancing is intentional to test the compressor with the same data multiple times.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-14T01:03:27.144Z
Learnt from: LinZhihao-723
PR: y-scope/clp#450
File: components/core/src/clp/streaming_archive/reader/Segment.cpp:57-58
Timestamp: 2024-11-14T01:03:27.144Z
Learning: In `components/core/src/clp/streaming_archive/reader/Segment.cpp`, the method `m_decompressor.open()` does not have a return value.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-11-29T22:51:00.355Z
Learnt from: AVMatthews
PR: y-scope/clp#595
File: components/core/tests/test-clp_s-end_to_end.cpp:141-142
Timestamp: 2024-11-29T22:51:00.355Z
Learning: In 'components/core/tests/test-clp_s-end_to_end.cpp', using `WIFEXITED` and `WEXITSTATUS` with the return value of `std::system` is acceptable for checking the exit status of system commands, as per the C++ documentation.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2024-10-13T09:27:43.408Z
Learnt from: LinZhihao-723
PR: y-scope/clp#557
File: components/core/tests/test-ir_encoding_methods.cpp:1216-1286
Timestamp: 2024-10-13T09:27:43.408Z
Learning: In the unit test case `ffi_ir_stream_serialize_schema_tree_node_id` in `test-ir_encoding_methods.cpp`, suppressing the `readability-function-cognitive-complexity` warning is acceptable due to the expansion of Catch2 macros in C++ tests, and such test cases may not have readability issues.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2025-01-14T16:06:54.692Z
Learnt from: haiqi96
PR: y-scope/clp#646
File: components/core/src/clp/streaming_archive/writer/Archive.hpp:354-354
Timestamp: 2025-01-14T16:06:54.692Z
Learning: Member variables in C++ classes should be explicitly initialized in the constructor to prevent undefined behavior, as demonstrated in the Archive class where `m_use_single_file_archive` is initialized to `false`.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
📚 Learning: 2025-08-09T04:07:27.083Z
Learnt from: Bill-hbrhbr
PR: y-scope/clp#1156
File: components/core/CMakeLists.txt:772-772
Timestamp: 2025-08-09T04:07:27.083Z
Learning: In the CLP project's CMakeLists.txt, when reviewing changes related to the ${zstd_TARGET} variable usage in test linking, the team is planning a refactoring PR to improve this mechanism. Guards for undefined target variables should be deferred to that separate PR rather than being added in focused dependency migration PRs.

Applied to files:

  • components/core/tests/test-clp_s-end_to_end.cpp
🧬 Code graph analysis (2)
components/core/tests/test-clp_s-search.cpp (3)
components/core/tests/test-clp_s-end_to_end.cpp (6)
  • get_test_input_path_relative_to_tests_dir (41-42)
  • get_test_input_path_relative_to_tests_dir (52-55)
  • get_test_input_path_relative_to_tests_dir (52-53)
  • get_test_input_local_path (43-43)
  • get_test_input_local_path (57-61)
  • get_test_input_local_path (57-57)
components/core/tests/clp_s_test_utils.cpp (2)
  • compress_archive (14-55)
  • compress_archive (14-21)
components/core/tests/clp_s_test_utils.hpp (1)
  • compress_archive (24-31)
components/core/tests/test-clp_s-end_to_end.cpp (4)
components/core/tests/test-clp_s-search.cpp (6)
  • get_test_input_path_relative_to_tests_dir (48-49)
  • get_test_input_path_relative_to_tests_dir (64-67)
  • get_test_input_path_relative_to_tests_dir (64-65)
  • get_test_input_local_path (50-50)
  • get_test_input_local_path (69-73)
  • get_test_input_local_path (69-69)
components/core/src/clp_s/SchemaTree.hpp (2)
  • child_id (113-113)
  • child_id (113-113)
components/core/tests/clp_s_test_utils.cpp (2)
  • compress_archive (14-55)
  • compress_archive (14-21)
components/core/src/clp_s/ColumnReader.hpp (2)
  • FormattedFloat (164-164)
  • DictionaryFloat (200-200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: package-image
  • GitHub Check: ubuntu-jammy-static-linked-bins
  • GitHub Check: ubuntu-jammy-dynamic-linked-bins
  • GitHub Check: ubuntu-jammy-lint
  • GitHub Check: musllinux_1_2-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-dynamic-linked-bins
  • GitHub Check: manylinux_2_28-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-dynamic-linked-bins
  • GitHub Check: musllinux_1_2-x86_64-static-linked-bins
  • GitHub Check: centos-stream-9-static-linked-bins
  • GitHub Check: build-macos (macos-15, false)
  • GitHub Check: build-macos (macos-15, true)
  • GitHub Check: lint-check (macos-15)
  • GitHub Check: lint-check (ubuntu-24.04)
🔇 Additional comments (22)
components/core/tests/test-clp_s-search.cpp (13)

44-44: Fixture constant LGTM.

Clear naming; no issues.


218-218: Use of parameterised input path LGTM.

Good switch to per-file path resolver.


246-249: Capture generated params for easier triage (repeat).

Include parameter values in failure output.

-        CAPTURE(query);
+        CAPTURE(query);
+        CAPTURE(structurize_arrays);
+        CAPTURE(single_file_archive);
         REQUIRE_NOTHROW(search(query, false, expected_results));

256-256: New formatted-float test case LGTM.

Good coverage of numeric operators and simple equality.


257-267: Add -0.0 and uppercase exponent cases; clarify numeric vs lexical (repeat).

Ensure numeric semantics handle -0.0 == 0 and exponent case-insensitivity.

             {R"aa(formattedFloatValue: 1e-16)aa", {6, 7}},
+            {R"aa(formattedFloatValue: -0.0)aa", {3, 4, 5}},
+            {R"aa(formattedFloatValue: 1E-16)aa", {6, 7}},

If lexical (exact-format) queries are also supported when quoted, consider adding quoted variants in a follow-up.


288-291: Metadata first-record query reuse LGTM.

Consistent with the other test; good sanity check.


48-50: Deduplicate helpers; make params const (repeat).

These helpers duplicate ones used elsewhere and lack const on parameters. Prefer centralising in clp_s_test_utils and apply const-correctness.

-auto get_test_input_path_relative_to_tests_dir(std::string_view test_input_path)
+auto get_test_input_path_relative_to_tests_dir(std::string_view const test_input_path)
         -> std::filesystem::path;
-auto get_test_input_local_path(std::string_view test_input_path) -> std::string;
+auto get_test_input_local_path(std::string_view const test_input_path) -> std::string;

64-67: Apply const-correctness to definition (repeat).

Match signatures used in other tests and reduce accidental copies.

-auto get_test_input_path_relative_to_tests_dir(std::string_view test_input_path)
+auto get_test_input_path_relative_to_tests_dir(std::string_view const test_input_path)
         -> std::filesystem::path {
     return std::filesystem::path{cTestInputFileDirectory} / test_input_path;
 }

69-73: Apply const-correctness to definition (repeat).

-auto get_test_input_local_path(std::string_view test_input_path) -> std::string {
+auto get_test_input_local_path(std::string_view const test_input_path) -> std::string {
     std::filesystem::path const current_file_path{__FILE__};
     auto const tests_dir{current_file_path.parent_path()};
     return (tests_dir / get_test_input_path_relative_to_tests_dir(test_input_path)).string();
 }

237-243: Name boolean/flag args in compress_archive (repeat).

Inline names prevent arg-order mistakes.

             std::ignore = compress_archive(
                     get_test_input_local_path(cTestSearchInputFile),
                     std::string{cTestSearchArchiveDirectory},
-                    std::string{cTestIdxKey},
-                    false,
-                    single_file_archive,
-                    structurize_arrays
+                    /*timestamp_key=*/std::string{cTestIdxKey},
+                    /*retain_float_format=*/false,
+                    /*single_file_archive=*/single_file_archive,
+                    /*structurize_arrays=*/structurize_arrays
             )

270-271: Isolate archive dirs per parametrisation (repeat).

Avoid cross-run interference in parallel runs by suffixing the directory.

-    TestOutputCleaner const test_cleanup{{std::string{cTestSearchArchiveDirectory}}};
+    auto archive_dir = fmt::format("{}-f{}", cTestSearchArchiveDirectory, single_file_archive);
+    TestOutputCleaner const test_cleanup{{archive_dir}};
@@
-                    std::string{cTestSearchArchiveDirectory},
+                    archive_dir,

Also applies to: 274-276


272-281: Name boolean/flag args in formatted-float compress_archive (repeat).

             std::ignore = compress_archive(
                     get_test_input_local_path(cTestSearchFormattedFloatFile),
-                    std::string{cTestSearchArchiveDirectory},
-                    std::nullopt,
-                    true,
-                    single_file_archive,
-                    false
+                    std::string{cTestSearchArchiveDirectory},
+                    /*timestamp_key=*/std::nullopt,
+                    /*retain_float_format=*/true,
+                    /*single_file_archive=*/single_file_archive,
+                    /*structurize_arrays=*/false
             )

283-286: Capture generated param in failure output.

Add single_file_archive to CAPTURE to aid debugging.

-        CAPTURE(query);
+        CAPTURE(query);
+        CAPTURE(single_file_archive);
         REQUIRE_NOTHROW(search(query, false, expected_results));
components/core/tests/test-clp_s-end_to_end.cpp (9)

197-206: Capture and assert compress_archive stats (no-floats test)

Avoid discarding return value; assert non-empty stats to validate write-out.

-    REQUIRE_NOTHROW(
-            std::ignore = compress_archive(
+    REQUIRE_NOTHROW([&]{
+            auto stats = compress_archive(
                     get_test_input_local_path(cTestEndToEndInputFile),
                     std::string{cTestEndToEndArchiveDirectory},
                     std::nullopt,
                     false,
                     single_file_archive,
                     structurize_arrays
-            )
-    );
+            );
+            REQUIRE(false == stats.empty());
+    }());

141-148: Quote paths in diff invocation

Unquoted paths break on spaces.

-    command = fmt::format(
-            "diff --unified {} {}  > /dev/null",
-            cTestEndToEndOutputSortedJson,
-            get_test_input_local_path(cTestEndToEndInputFile)
-    );
+    command = fmt::format(
+            "diff --unified \"{}\" \"{}\" > /dev/null",
+            cTestEndToEndOutputSortedJson,
+            get_test_input_local_path(cTestEndToEndInputFile)
+    );

150-153: Name is misleading: literallyCompare ignores whitespace

Either rename (e.g., compareIgnoringWhitespace) or add a brief comment stating it strips spaces and sorts lines.


155-178: Quote all file paths in shell pipelines (cat/awk/tr/sort/diff)

Consistently quote to avoid breakage with spaces/special chars.

-    auto command = fmt::format(
-            "cat {} | awk 'NF' | tr -d ' ' | sort > {}",
-            expected_output_json_path.string(),
-            cTestEndToEndExpectedOutputSortedFile
-    );
+    auto command = fmt::format(
+            "cat \"{}\" | awk 'NF' | tr -d ' ' | sort > \"{}\"",
+            expected_output_json_path.string(),
+            cTestEndToEndExpectedOutputSortedFile
+    );
@@
-    command = fmt::format(
-            "cat {} | awk 'NF' | tr -d ' ' | sort > {}",
-            extracted_json_path.string(),
-            cTestEndToEndOutputSortedJson
-    );
+    command = fmt::format(
+            "cat \"{}\" | awk 'NF' | tr -d ' ' | sort > \"{}\"",
+            extracted_json_path.string(),
+            cTestEndToEndOutputSortedJson
+    );
@@
-    command = fmt::format(
-            "diff --unified {} {}  > /dev/null",
-            cTestEndToEndExpectedOutputSortedFile,
-            cTestEndToEndOutputSortedJson
-    );
+    command = fmt::format(
+            "diff --unified \"{}\" \"{}\" > /dev/null",
+            cTestEndToEndExpectedOutputSortedFile,
+            cTestEndToEndOutputSortedJson
+    );

227-236: Also validate compress_archive stats (valid formatted-floats test)

Mirror the earlier improvement.

-    REQUIRE_NOTHROW(
-            std::ignore = compress_archive(
+    REQUIRE_NOTHROW([&]{
+            auto stats = compress_archive(
                     get_test_input_local_path(cTestEndToEndValidFormattedFloatInputFile),
                     std::string{cTestEndToEndArchiveDirectory},
                     std::nullopt,
                     true,
                     single_file_archive,
                     structurize_arrays
-            )
-    );
+            );
+            REQUIRE(false == stats.empty());
+    }());

270-279: Also validate compress_archive stats (invalid formatted-floats test)

Keep tests consistent and stronger.

-    REQUIRE_NOTHROW(
-            std::ignore = compress_archive(
+    REQUIRE_NOTHROW([&]{
+            auto stats = compress_archive(
                     get_test_input_local_path(cTestEndToEndInvalidFormattedFloatInputFile),
                     std::string{cTestEndToEndArchiveDirectory},
                     std::nullopt,
                     true,
                     single_file_archive,
                     structurize_arrays
-            )
-    );
+            );
+            REQUIRE(false == stats.empty());
+    }());

247-251: Pre-check extracted artefact exists before compare (minor safety)

Small guard prior to file ops.

-    auto extracted_json_path = extract();
+    auto extracted_json_path = extract();
+    REQUIRE(std::filesystem::exists(extracted_json_path));
+    REQUIRE(std::filesystem::is_regular_file(extracted_json_path));

290-294: Pre-check extracted artefact exists before compare (minor safety)

Mirror the check in the other test.

-    auto extracted_json_path = extract();
+    auto extracted_json_path = extract();
+    REQUIRE(std::filesystem::exists(extracted_json_path));
+    REQUIRE(std::filesystem::is_regular_file(extracted_json_path));

52-61: DRY: Share test-path helpers across tests

These helpers duplicate implementations in test-clp_s-search.cpp; move into a shared header/source (e.g., clp_s_test_utils.hpp/cpp).

Comment on lines +63 to +95
void check_all_leaf_nodes_match_types(std::set<clp_s::NodeType> const& types) {
clp_s::ArchiveReader archive_reader;
for (auto const& entry : std::filesystem::directory_iterator(cTestEndToEndArchiveDirectory)) {
REQUIRE_NOTHROW(archive_reader.open(
clp_s::Path{
.source = clp_s::InputSource::Filesystem,
.path = entry.path().string()
},
clp_s::NetworkAuthOption{}
));
auto const schema_tree{archive_reader.get_schema_tree()};
REQUIRE(nullptr != schema_tree);
auto const root{schema_tree->get_object_subtree_node_id_for_namespace(
clp_s::constants::cDefaultNamespace
)};
REQUIRE(-1 != root);
std::vector<int32_t> node_id_stack = {root};
while (false == node_id_stack.empty()) {
auto const cur_node_id{node_id_stack.back()};
node_id_stack.pop_back();
auto const& cur_node{schema_tree->get_node(cur_node_id)};

if (false == cur_node.get_children_ids().empty()) {
for (auto const child_id : cur_node.get_children_ids()) {
node_id_stack.emplace_back(child_id);
}
continue;
}
REQUIRE(0 != types.count(cur_node.get_type()));
}

REQUIRE_NOTHROW(archive_reader.close());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Skip non-regular files when scanning the archive directory

Defensive guard prevents attempting to open dirs or temp files.

     for (auto const& entry : std::filesystem::directory_iterator(cTestEndToEndArchiveDirectory)) {
+        if (false == std::filesystem::is_regular_file(entry.path())) {
+            continue;
+        }
         REQUIRE_NOTHROW(archive_reader.open(
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void check_all_leaf_nodes_match_types(std::set<clp_s::NodeType> const& types) {
clp_s::ArchiveReader archive_reader;
for (auto const& entry : std::filesystem::directory_iterator(cTestEndToEndArchiveDirectory)) {
REQUIRE_NOTHROW(archive_reader.open(
clp_s::Path{
.source = clp_s::InputSource::Filesystem,
.path = entry.path().string()
},
clp_s::NetworkAuthOption{}
));
auto const schema_tree{archive_reader.get_schema_tree()};
REQUIRE(nullptr != schema_tree);
auto const root{schema_tree->get_object_subtree_node_id_for_namespace(
clp_s::constants::cDefaultNamespace
)};
REQUIRE(-1 != root);
std::vector<int32_t> node_id_stack = {root};
while (false == node_id_stack.empty()) {
auto const cur_node_id{node_id_stack.back()};
node_id_stack.pop_back();
auto const& cur_node{schema_tree->get_node(cur_node_id)};
if (false == cur_node.get_children_ids().empty()) {
for (auto const child_id : cur_node.get_children_ids()) {
node_id_stack.emplace_back(child_id);
}
continue;
}
REQUIRE(0 != types.count(cur_node.get_type()));
}
REQUIRE_NOTHROW(archive_reader.close());
}
void check_all_leaf_nodes_match_types(std::set<clp_s::NodeType> const& types) {
clp_s::ArchiveReader archive_reader;
for (auto const& entry : std::filesystem::directory_iterator(cTestEndToEndArchiveDirectory)) {
if (false == std::filesystem::is_regular_file(entry.path())) {
continue;
}
REQUIRE_NOTHROW(archive_reader.open(
clp_s::Path{
.source = clp_s::InputSource::Filesystem,
.path = entry.path().string()
},
clp_s::NetworkAuthOption{}
));
auto const schema_tree{archive_reader.get_schema_tree()};
REQUIRE(nullptr != schema_tree);
auto const root{schema_tree->get_object_subtree_node_id_for_namespace(
clp_s::constants::cDefaultNamespace
)};
REQUIRE(-1 != root);
std::vector<int32_t> node_id_stack = {root};
while (false == node_id_stack.empty()) {
auto const cur_node_id{node_id_stack.back()};
node_id_stack.pop_back();
auto const& cur_node{schema_tree->get_node(cur_node_id)};
if (false == cur_node.get_children_ids().empty()) {
for (auto const child_id : cur_node.get_children_ids()) {
node_id_stack.emplace_back(child_id);
}
continue;
}
REQUIRE(0 != types.count(cur_node.get_type()));
}
REQUIRE_NOTHROW(archive_reader.close());
}
}
🤖 Prompt for AI Agents
In components/core/tests/test-clp_s-end_to_end.cpp around lines 63–95, the
directory iteration attempts to open every directory entry and can fail on
non-regular files (directories, symlinks, temp files); add a defensive guard
immediately inside the loop to skip entries that are not regular files (e.g., if
(!std::filesystem::is_regular_file(entry.path())) continue;) so only regular
files are passed to archive_reader.open; place the check before calling
archive_reader.open and keep existing behavior for regular files unchanged.

Comment on lines +268 to +269
auto single_file_archive = GENERATE(true, false);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick

Consider parametrising structurize_arrays here as well.

For parity with the first test, exercising both modes would catch schema-shape differences.

🤖 Prompt for AI Agents
In components/core/tests/test-clp_s-search.cpp around lines 268-269, the test
currently only parameterises single_file_archive; add a matching
parameterisation for structurize_arrays (e.g. auto structurize_arrays =
GENERATE(true, false)) so the test runs both modes and catches schema-shape
differences; ensure the rest of the test uses this variable where
structurize_arrays is consulted (or pass it into the function/setup) so both
combinations are exercised.

@kirkrodrigues kirkrodrigues changed the title feat(clp-s): Add support for losslessly compression floating point numbers; Bump archive formation version to v0.4.1. feat(clp-s): Add support for losslessly compressing floating-point numbers; Bump archive format version to v0.4.1. Sep 23, 2025
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring to @gibber9809's review.

Edited the PR title directly. Whoever wrote the previous PR title, watch out for misspellings, lol.

@gibber9809 gibber9809 merged commit 82339c8 into y-scope:main Sep 23, 2025
27 of 31 checks passed
@anlowee anlowee deleted the xwei/formatted-double branch September 23, 2025 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants