Skip to content

feat: Add CLP connector #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 67 commits into
base: main
Choose a base branch
from
Open

feat: Add CLP connector #3

wants to merge 67 commits into from

Conversation

wraymo
Copy link
Member

@wraymo wraymo commented Apr 11, 2025

Overview

The current Presto–CLP connector PR introduces the coordinator-side implementation, along with a placeholder (dummy) worker implementation. Detailed information about the overall design is available in the corresponding RFC. This Velox PR focuses on the worker-side logic.

The Velox CLP connector enables query execution on CLP archives. The Velox worker receives split information and the associated KQL query from the Presto coordinator. For each split, it executes the KQL query against the relevant CLP archive to find matching messages and stores their indices.

To support lazy evaluation, the implementation creates lazy vectors that wrap a CLP column reader and the list of matching indices. When accessed during query execution, these vectors load and decode only the necessary data on demand.

Core Classes

ClpDataSource

This class extends DataSource and implements the addSplit and next methods. During initialization, it records the KQL query and archive source (S3 or local), then traverses the output type to map Presto fields to CLP projection fields. Only ARRAY(VARCHAR) and primitive leaf fields like BIGINT, DOUBLE, BOOLEAN and VARCHAR are projected.

When a split is added, a ClpCursor is created with the archive path and input source. The query is parsed and simplified into an AST. On next, the cursor finds matching row indices and, if any exist, returns a row vector composed of lazy vectors, which load data as needed during execution.

ClpCursor

This class manages the execution of a query over a CLP-S archive. It handles parsing and validation, loading schemas and archives, setting up projection fields, and filtering results. In CLP-S, records are partitioned by schemas. ClpCursor uses ClpQueryRunner to initialize the execution context for each schema and evaluate the filters. It will skip archives where dictionary lookups for string filters return no matches and only scan the relevant schemas of a specific archive. For example, consider a log dataset with the following records.

{"a": "Hello", "b": 2}
{"a": "World", "b": 0, "c": false}
{"a": "World", "c": true}

The three log messages have varying schemas. If we run a KQL query a: World AND b: 0, it will skip loading the third message because it's schema does not match the query (there's no b field). And if the query is a: random AND b: 0, it will even skip scanning the first two records, because random cannot be found in the dictionary.

ClpQueryRunner

This class extends the generic CLP QueryRunner to support ordered projection and row filtering. It initializes projected column readers and returns filtered row indices for each batch.

ClpVectorLoader

In CLP, values are decoded and read from a BaseColumnReader. The ClpVectorLoader is custom Velox VectorLoader that loads vectors from CLP column readers. It supports integers, floats, booleans, strings, and arrays of strings. It's used by lazy vectors to load data on demand using the previously stored row indices.

Summary by CodeRabbit

  • New Features
    • Introduced the CLP Connector for reading CLP archives from local filesystems and S3 with support for KQL query pushdown.
    • Added lazy loading and filtering of data with support for nested row types and complex query expressions.
    • Integrated build support for the CLP Connector, including new dependencies such as spdlog, msgpack-cxx, and ANTLR.
    • Provided a new search library and query engine to power CLP data access and filtering.
  • Tests
    • Added comprehensive unit tests and example NDJSON datasets to validate CLP Connector functionality and query correctness.
  • Documentation
    • Updated Velox connector documentation to include detailed descriptions and usage examples for the CLP Connector.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 22

♻️ Duplicate comments (2)
velox/connectors/clp/CMakeLists.txt (1)

1-44: 🛠️ Refactor suggestion

Conditionally compile with VELOX_ENABLE_CLP_CONNECTOR option

The entire CMakeLists.txt should be wrapped in a conditional check to respect the VELOX_ENABLE_CLP_CONNECTOR option that was added to the main CMakeLists.txt. This ensures the connector is only built when explicitly enabled.

+if(VELOX_ENABLE_CLP_CONNECTOR)
 # Copyright (c) Facebook, Inc. and its affiliates.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 set(CLP_SRC_DIR ${clp_SOURCE_DIR}/components/core/src)
 add_subdirectory(search_lib)

 velox_add_library(
   velox_clp_connector
   OBJECT
   ClpConnector.cpp
   ClpDataSource.cpp
   ClpTableHandle.cpp)

 velox_link_libraries(
   velox_clp_connector
   PRIVATE
     clp-s-search
     velox_connector
     simdjson::simdjson
     msgpack-cxx
     spdlog::spdlog
     absl::flat_hash_map
     Boost::filesystem
     Boost::headers)

 velox_include_directories(
   velox_clp_connector PUBLIC ${clp_SOURCE_DIR}/components/core/submodules
                             ${CLP_SRC_DIR})

 target_compile_features(velox_clp_connector PRIVATE cxx_std_20)

 if(${VELOX_BUILD_TESTING})
   add_subdirectory(tests)
 endif()
+endif()
velox/connectors/clp/ClpConnector.h (1)

65-69: ⚠️ Potential issue

Initialize TimestampPattern in default constructor

The TimestampPattern::init() is called in the explicit constructor but not in the default constructor. This could lead to initialization issues if the default constructor is used.

Apply the following fix:

-  ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) {}
+  ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) {
+    clp_s::TimestampPattern::init();
+  }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 730aabf and 92f9fc0.

📒 Files selected for processing (11)
  • CMake/FindANTLR.cmake (1 hunks)
  • CMake/resolve_dependency_modules/clp.cmake (1 hunks)
  • CMakeLists.txt (4 hunks)
  • velox/connectors/clp/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/ClpConnector.h (1 hunks)
  • velox/connectors/clp/ClpConnectorSplit.h (1 hunks)
  • velox/connectors/clp/ClpDataSource.cpp (1 hunks)
  • velox/connectors/clp/search_lib/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/search_lib/ClpCursor.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ClpCursor.h (1 hunks)
  • velox/connectors/clp/tests/ClpConnectorTest.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • velox/connectors/clp/search_lib/ClpCursor.cpp
  • velox/connectors/clp/ClpDataSource.cpp
  • velox/connectors/clp/tests/ClpConnectorTest.cpp
🧬 Code Graph Analysis (2)
velox/connectors/clp/search_lib/ClpCursor.h (2)
velox/connectors/clp/search_lib/ClpCursor.cpp (2)
  • ClpCursor (33-37)
  • ClpCursor (39-43)
velox/connectors/clp/search_lib/ClpQueryRunner.h (1)
  • numRows (64-66)
velox/connectors/clp/ClpDataSource.cpp (3)
velox/connectors/clp/ClpDataSource.h (6)
  • ClpDataSource (28-35)
  • columnType (68-70)
  • split (37-37)
  • type (86-91)
  • size (39-40)
  • completedRows_ (52-54)
velox/connectors/clp/ClpConnector.h (1)
  • outputType (39-45)
velox/connectors/clp/search_lib/ClpCursor.h (1)
  • query (70-72)
🪛 GitHub Actions: Run Checks
velox/connectors/clp/ClpConnectorSplit.h

[error] 1-1: Formatting issues detected. Please apply fix using the provided patch.

velox/connectors/clp/search_lib/ClpCursor.cpp

[error] 1-1: Formatting issues detected. Please apply fix using the provided patch.

velox/connectors/clp/ClpConnector.h

[error] 1-1: Formatting issues detected. Please apply fix using the provided patch.

velox/connectors/clp/tests/ClpConnectorTest.cpp

[error] 1-1: Formatting issues detected. Please apply fix using the provided patch.

🪛 Cppcheck (2.10-2)
velox/connectors/clp/tests/ClpConnectorTest.cpp

[error] 84-84: syntax error

(syntaxError)

🔇 Additional comments (13)
CMake/resolve_dependency_modules/clp.cmake (1)

1-38: LGTM: Proper dependency resolution for CLP module

The implementation correctly handles fetching and setting up the CLP dependency, including error handling for submodule updates. The download is properly secured with HTTPS and the approach aligns with other dependency modules in the project.

CMakeLists.txt (4)

126-126: LGTM: Well-defined build option for CLP connector

Adding the option with a default value of ON is consistent with other connector options in the file.


173-173: LGTM: Properly disable CLP connector for minimal builds

The change correctly disables the CLP connector when using minimal build options, consistent with the handling of other optional components.


434-438: LGTM: Required Boost components added for CLP connector

The addition of iostreams and url components to the Boost libraries list is appropriate for supporting the CLP connector's functionality.


496-497:

✅ Verification successful

Verify if spdlog is still required

A past review comment suggests that spdlog might not be needed anymore. Verify if this dependency is still required for the CLP connector.


🏁 Script executed:

#!/bin/bash
# Check if spdlog is used in the CLP connector code
echo "Checking for spdlog usage in CLP connector code..."
rg -l "spdlog" velox/connectors/clp/

Length of output: 230


spdlog dependency remains required

Search results confirm that spdlog is still referenced in the CLP connector’s build files:

• velox/connectors/clp/CMakeLists.txt
• velox/connectors/clp/search_lib/CMakeLists.txt

No changes needed—keep the existing set(spdlog_SOURCE BUNDLED) and velox_resolve_dependency(spdlog 1.12.0) lines.

CMake/FindANTLR.cmake (3)

15-17: LGTM: Properly attributed ANTLR module source

Good practice to include attribution to the original source of the module.


31-39: LGTM: Secure download of ANTLR JAR

The implementation correctly handles downloading the ANTLR JAR if not found, using HTTPS and verifying the download with a SHA256 checksum for security.


69-164: LGTM: Well-structured ANTLR target macro

The ANTLR_TARGET macro is well-structured and handles various options for generating lexer, parser, listener, and visitor components. It properly sets up dependencies and creates appropriate custom build commands.

velox/connectors/clp/ClpDataSource.cpp (1)

155-166: Use filtered row count, not scanned row count, for progress metrics

completedRows_ += rowsScanned; increments the metric with all scanned rows, not the rows that will actually be delivered (rowsFiltered).
Depending on how the metric is consumed, this can be misleading.

Would you like to change this to rowsFiltered, or clarify the semantic of “completed rows” in the header comment?

velox/connectors/clp/search_lib/ClpCursor.cpp (1)

60-68: Handle loadArchive failure explicitly

If loadArchive() returns an error, the cursor silently stops scanning and next() will return no rows without explaining why.
Propagate the error (e.g., via an exception or dedicated status accessor) so the caller can surface it.

velox/connectors/clp/search_lib/ClpCursor.h (1)

1-15: LGTM: License header properly included

The license header is correctly included at the top of the file.

velox/connectors/clp/ClpConnector.h (2)

1-15: LGTM: License header properly included

The license header is correctly included at the top of the file.

🧰 Tools
🪛 GitHub Actions: Run Checks

[error] 1-1: Formatting issues detected. Please apply fix using the provided patch.


71-77:

❓ Verification inconclusive

Unused executor parameters in newConnector method

The ioExecutor and cpuExecutor parameters are accepted but not used in the connector instantiation. This might indicate missing functionality or unclear design.

Consider either:

  1. Using these parameters in the connector, or
  2. Documenting why they're unused

🏁 Script executed:

#!/bin/bash
# Check how other connectors use executors
echo "Checking usage of executors in other connectors:"
rg "newConnector.*ioExecutor.*cpuExecutor" --type cpp velox/connectors/ | grep -v "clp"

Length of output: 194


Unused executor parameters in ClpConnector::newConnector

The ioExecutor and cpuExecutor arguments are accepted by newConnector but never forwarded or stored when constructing the ClpConnector. This divergence from the signature suggests one of the following actions is needed:

  • Forward the executors to ClpConnector (e.g. extend its constructor and member variables).
  • Remove these parameters from the newConnector signature if they’re not required.
  • Add a clear comment in ClpConnector.h explaining why they’re intentionally unused.

Location to address:

  • velox/connectors/clp/ClpConnector.h: lines 71–77

CMakeLists.txt Outdated
Comment on lines 492 to 507
if(${VELOX_ENABLE_CLP_CONNECTOR})
set(clp_SOURCE BUNDLED)
velox_resolve_dependency(clp)

set(spdlog_SOURCE BUNDLED)
velox_resolve_dependency(spdlog 1.12.0)

velox_set_source(msgpack-cxx)
velox_resolve_dependency(msgpack-cxx)

velox_set_source(absl)
velox_resolve_dependency(absl)

find_package(CURL 7.68.0 REQUIRED)
find_package(ANTLR REQUIRED)
endif()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Check for potential conflicts with absl dependency

There might be a conflict between setting absl as a source here and in the testing block (lines 538-539). If both conditions can be true simultaneously, this could lead to duplicate dependency resolution calls.


🏁 Script executed:

#!/bin/bash
# Check if there could be conflicts when resolving absl dependency
echo "Checking potential conflicts with absl resolution..."
echo "CLP connector block (lines 492-507):"
grep -A 3 "velox_set_source(absl)" CMakeLists.txt

echo -e "\nChecking if build options can be enabled simultaneously:"
grep "VELOX_ENABLE_CLP_CONNECTOR.*ON" CMakeLists.txt
grep "VELOX_BUILD_TESTING.*ON" CMakeLists.txt

Length of output: 780


🏁 Script executed:

#!/bin/bash
# Show context around both absl resolution occurrences
grep -n -C5 "velox_set_source(absl)" CMakeLists.txt

# Find where VELOX_BUILD_TESTING is defined
grep -R -n "option.*VELOX_BUILD_TESTING" CMakeLists.txt

# Find any conditionals guarding the testing block
grep -R -n "if.*VELOX_BUILD_TESTING" CMakeLists.txt

Length of output: 1101


Prevent duplicate absl dependency resolution

Since both VELOX_ENABLE_CLP_CONNECTOR (lines 492–507) and VELOX_BUILD_TESTING (lines 536–543) default to ON, you currently invoke

velox_set_source(absl)
velox_resolve_dependency(absl)

twice. This can produce redundant configuration or warnings.

Consider consolidating absl resolution by:

  • Moving the absl calls out of both if(...) blocks so they run once.
  • Or adding a guard in the second block to skip if absl is already resolved.

Locations to update:

  • CMakeLists.txt: lines 502–503 (CLP connector block)
  • CMakeLists.txt: lines 538–539 (Testing block)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

♻️ Duplicate comments (5)
velox/connectors/clp/search_lib/ClpCursor.cpp (2)

166-170: ⚠️ Potential issue

Duplicate timestamp-range evaluation – remove the second pass

EvaluateTimestampIndex is executed twice (lines 166-170 and again 238-242). The second invocation provides no additional information and costs an extra scan over the timestamp dictionary. Please delete one of the two blocks to avoid redundant work.

This issue has been raised in a previous review and remains unresolved.

Also applies to: 238-242


125-135: 🧹 Nitpick (assertive)

Improve readability by separating assignment from condition

The C++17 init-statement is valid, but coupling assignment and condition makes it easy to mis-read as an accidental assignment. A tiny refactor clarifies intent without changing behaviour:

-  OrOfAndForm standardizePass;
-  if (expr_ = standardizePass.run(expr_);
-      std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
+  OrOfAndForm standardizePass;
+  expr_ = standardizePass.run(expr_);
+  if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) {
     VLOG(2) << "Query '" << query_ << "' is logically false";
     return ErrorCode::LogicalError;
   }
velox/connectors/clp/tests/ClpConnectorTest.cpp (2)

63-68: 🧹 Nitpick (assertive)

tableName parameter is unused – clean up helper signature

makeClpSplit() receives tableName but ignores it, leading to dead parameters and confusing call sites. Either forward the value to ClpConnectorSplit (if required) or drop the parameter:

-exec::Split makeClpSplit(
-    const std::string& tableName,
-    const std::string& splitPath) {
+exec::Split makeClpSplit(const std::string& splitPath) {
   return exec::Split(
       std::make_shared<ClpConnectorSplit>(kClpConnectorId, splitPath));
 }

78-81: ⚠️ Potential issue

Path resolution brittle – use source-relative path instead of CWD

The tests assume the working directory contains an examples/ folder. This fails when the test runner is started from a different location.

Resolve the path relative to the source file:

-static std::string getExampleFilePath(const std::string& filePath) {
-  std::string current_path = fs::current_path().c_str();
-  return current_path + "/examples/" + filePath;
+static std::string getExampleFilePath(const std::string& filePath) {
+  static const fs::path kExamplesDir =
+      fs::path(__FILE__).parent_path() / "examples";
+  return (kExamplesDir / filePath).string();
 }
velox/connectors/clp/ClpConnector.h (1)

65-68: ⚠️ Potential issue

Initialise TimestampPattern in the default constructor as well.

A past review already pointed this out, but the default constructor still omits the call to clp_s::TimestampPattern::init(). Any code path that instantiates ClpConnectorFactory via the default constructor will leave the static timestamp-matching infrastructure un-initialised, causing runtime failures when parsing CLP timestamps.

-  ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) {}
+  ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) {
+    clp_s::TimestampPattern::init();
+  }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 92f9fc0 and c466884.

📒 Files selected for processing (4)
  • velox/connectors/clp/ClpConnector.h (1 hunks)
  • velox/connectors/clp/ClpConnectorSplit.h (1 hunks)
  • velox/connectors/clp/search_lib/ClpCursor.cpp (1 hunks)
  • velox/connectors/clp/tests/ClpConnectorTest.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • velox/connectors/clp/search_lib/ClpCursor.cpp
  • velox/connectors/clp/tests/ClpConnectorTest.cpp
🪛 Cppcheck (2.10-2)
velox/connectors/clp/tests/ClpConnectorTest.cpp

[error] 84-84: syntax error

(syntaxError)

🔇 Additional comments (1)
velox/connectors/clp/tests/ClpConnectorTest.cpp (1)

84-125: Static analysis “syntax error” is a false positive

Cppcheck flags a syntax error at the start of the first test case, but the code is valid C++. No action required.

🧰 Tools
🪛 Cppcheck (2.10-2)

[error] 84-84: syntax error

(syntaxError)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
velox/connectors/clp/search_lib/CMakeLists.txt (1)

33-91: Manual listing of source files is error-prone
Maintaining the long CLP_SRC_FILES list by hand is brittle and has already caused duplicates in the past. Consider automating this with:

file(GLOB_RECURSE CLP_SRC_FILES
     "${CLP_SRC_DIR}/clp_s/*.cpp")

or CMake’s target_sources(... DIRECTORY ...) to recursively pick up all .cpp under clp_s.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c466884 and 6410857.

📒 Files selected for processing (1)
  • velox/connectors/clp/search_lib/CMakeLists.txt (1 hunks)
🔇 Additional comments (2)
velox/connectors/clp/search_lib/CMakeLists.txt (2)

1-13: License header present and correct
The Apache 2.0 header is included at the top and matches project conventions.


15-21: Inconsistent variable naming for CLP source directories
The script alternates between clp_SOURCE_DIR (lowercase “clp”) and CLP_SRC_DIR (uppercase) when referring to the CLP codebase. This can lead to undefined-variable errors if one of them isn’t set in the parent CMake context. Please verify that both variables are properly defined, or—better yet—unify to a single consistently named variable (or use CMAKE_CURRENT_SOURCE_DIR for local paths).

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (11)
velox/connectors/clp/ClpDataSource.cpp (4)

42-46: Follow the boolean-expression guideline (false == expr)
The project’s coding standard forbids unary ! in favour of the symmetrical form.
Please rewrite the emptiness check accordingly.


107-113: Handle unknown splitSource_ explicitly

If an unsupported splitSource_ value ever reaches this branch, cursor_ will
stay nullptr and lead to undefined behaviour later on. Add an else that
throws a VELOX_USER_FAIL.


115-116: Check the return status of executeQuery

ClpCursor::executeQuery() returns an ErrorCode, but it is silently ignored
here. Propagate or at least log failures so that query issues aren’t swallowed.


138-140: Avoid eager “all-null” bitmap allocation

allocateNulls(size, …, bits::kNull) sets every bit only to have
ClpVectorLoader overwrite them. Pass nullptr instead and let the loader
allocate only when needed.

velox/connectors/clp/search_lib/ClpCursor.h (2)

40-40: Assign an explicit value to Unknown

Giving Unknown an explicit (e.g., -1) value safeguards against accidental
changes when the enum is reordered or extended.


90-90: getProjectedColumns() breaks const-correctness

A const method returning a non-const reference lets callers mutate internal
state through a supposedly const interface. Return a const& or drop const
from the method.

velox/connectors/clp/search_lib/ClpCursor.cpp (5)

60-68: open_packed_streams() exceptions leave the cursor inconsistent

If archiveReader_->open_packed_streams() throws, currentArchiveLoaded_
remains false but some earlier state has already been mutated. Wrap the call
in a try/catch, reset internal flags, and convert the failure into an
appropriate ErrorCode.


104-110: Thread-safety risk: shared mutable static vector

Returning a reference to a function-local static vector from a const method
exposes mutable state across threads without synchronisation. Use
static const or return by value.


125-130: Clarify assignment-in-condition for readability

The C++17 if (init; cond) form is legal but easy to mis-read here. Split the
assignment and the check into two statements.


150-160: Map archive-open failures to more specific error codes

All exceptions from archiveReader_->open() are collapsed into
InternalError, obscuring actionable diagnostics (file not found, permission
denied, etc.). Catch specific exceptions and translate them into dedicated
ErrorCodes.


166-170: Duplicate timestamp-range evaluation

EvaluateTimestampIndex is executed twice with identical logic, incurring
unnecessary work. Remove one invocation to avoid redundant CPU and I/O.

Also applies to: 238-242

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6410857 and ec28c71.

📒 Files selected for processing (3)
  • velox/connectors/clp/ClpDataSource.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ClpCursor.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ClpCursor.h (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • velox/connectors/clp/ClpDataSource.cpp
  • velox/connectors/clp/search_lib/ClpCursor.cpp
🧬 Code Graph Analysis (1)
velox/connectors/clp/search_lib/ClpCursor.h (2)
velox/connectors/clp/search_lib/ClpCursor.cpp (2)
  • ClpCursor (33-37)
  • ClpCursor (39-43)
velox/connectors/clp/search_lib/ClpQueryRunner.h (1)
  • numRows (64-66)

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
add_executable(velox_clp_connector_test ClpConnectorTest.cpp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this to velox/Makefile, there is a release-unittest: release

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this? Didn't find it

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (5)
velox/connectors/clp/CMakeLists.txt (1)

17-27: Still missing VELOX_ENABLE_CLP_CONNECTOR guard

The library is built un-conditionally, which breaks builds where the connector is disabled.
The same issue was raised in a previous round of review and has not been addressed.

velox/connectors/clp/search_lib/CMakeLists.txt (2)

22-23: Prefer CMAKE_CURRENT_SOURCE_DIR for portability

Using CMAKE_SOURCE_DIR breaks when this sub-project is included in a larger build.
The same concern has been brought up previously.


33-92: Manually curated 200-line source list is hard to maintain

Re-consider generating the list with file(GLOB_RECURSE …) or a helper script to avoid future omissions and duplicates. This was highlighted in an earlier review round.

velox/connectors/clp/ClpConnector.h (2)

17-21: Add the required STL header for std::unordered_map.

The file directly uses std::unordered_map (line 41-43) but does not include the <unordered_map> header. Relying on an indirect include from another Velox header is fragile and violates the "include what you use" rule.

 #pragma once

 #include "velox/connectors/Connector.h"
 #include "velox/connectors/clp/ClpConfig.h"
+#include <unordered_map>

67-73: Silence unused-parameter warnings for the executors.

ioExecutor and cpuExecutor are accepted but not used, which triggers -Wunused-parameter with many toolchains. Cast them to void or use [[maybe_unused]] to keep the signature future-proof while maintaining clean builds.

 std::shared_ptr<Connector> newConnector(
     const std::string& id,
     std::shared_ptr<const config::ConfigBase> config,
-    folly::Executor* ioExecutor,
-    folly::Executor* cpuExecutor) override {
+    folly::Executor* /*ioExecutor*/,
+    folly::Executor* /*cpuExecutor*/) override {
   return std::make_shared<ClpConnector>(id, config);
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec28c71 and 0cd9943.

📒 Files selected for processing (5)
  • CMake/FindANTLR.cmake (1 hunks)
  • velox/connectors/clp/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/ClpConnector.cpp (1 hunks)
  • velox/connectors/clp/ClpConnector.h (1 hunks)
  • velox/connectors/clp/search_lib/CMakeLists.txt (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • velox/connectors/clp/ClpConnector.cpp
🧬 Code Graph Analysis (1)
velox/connectors/clp/ClpConnector.h (3)
velox/connectors/clp/ClpConnectorSplit.h (1)
  • nodiscard (28-30)
velox/connectors/clp/ClpConfig.h (2)
  • nodiscard (37-47)
  • nodiscard (54-57)
velox/connectors/clp/ClpTableHandle.h (2)
  • nodiscard (32-34)
  • nodiscard (36-38)
🔇 Additional comments (12)
velox/connectors/clp/CMakeLists.txt (1)

14-14: Verify that clp_SOURCE_DIR is defined before use

This variable is consumed to build include paths for CLP sources, but it is not declared in this file.
If a parent CMakeLists does not always guarantee its definition (e.g., the CLP option is OFF), configuration will fail.

if(NOT DEFINED clp_SOURCE_DIR)
  message(FATAL_ERROR "clp_SOURCE_DIR is not defined – did you forget to enable/fetch the CLP dependency?")
endif()
velox/connectors/clp/ClpConnector.cpp (7)

1-15: LGTM: License header correctly included.

The file includes the proper Apache 2.0 license header as required by the project.


17-20: LGTM: Required includes are present.

The file includes the necessary headers for TimestampPattern, ClpConnector, and ClpDataSource.


22-26: LGTM: Constructor implementation is correct.

The constructor properly initializes the base Connector class and wraps the config in a ClpConfig shared pointer.


28-41: LGTM: Data source creation is properly implemented.

The createDataSource method correctly instantiates a ClpDataSource with all required parameters from the connector context.


43-49: LGTM: Data sink implementation marked as NYI.

The createDataSink method properly uses VELOX_NYI to indicate this functionality is not yet implemented, which is appropriate for this PR.


51-54: LGTM: Default constructor now initializes TimestampPattern.

The default constructor initializes the TimestampPattern correctly, addressing a previous review comment.


56-59: LGTM: Parameterized constructor initializes TimestampPattern.

The parameterized constructor also initializes the TimestampPattern as required.

velox/connectors/clp/ClpConnector.h (4)

34-36: LGTM: Dynamic filter support is explicitly disabled.

The method correctly returns false to indicate that dynamic filters are not supported by this connector.


38-44: LGTM: Data source creation method is properly declared.

The createDataSource method declaration follows the expected interface for Velox connectors.


46-48: LGTM: Split preload support is explicitly disabled.

The method correctly returns false to indicate that split preloading is not supported by this connector.


50-55: LGTM: Data sink creation method is properly declared.

The createDataSink method is declared and implemented in the cpp file with a NYI error, which is appropriate for the current stage of implementation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

♻️ Duplicate comments (23)
velox/connectors/clp/ClpConnectorSplit.h (2)

19-19: Add the missing fmt include to guarantee self-contained header

toString() relies on fmt::format, but the file does not include <fmt/format.h>. Relying on transitive includes breaks the One-Definition Rule hygiene and can lead to fragile build failures when the include order changes.

 #include "velox/connectors/Connector.h"
+#include <fmt/format.h>

32-32: 🧹 Nitpick (assertive)

Consider making splitPath_ private

splitPath_ is exposed as a public data member. Once exposed, any caller can rely on its existence, which tightens coupling to the struct's internal layout and hinders future refactors.

A simple fix is to mark it private and expose a const accessor:

-  const std::string splitPath_;
+private:
+  const std::string splitPath_;
+
+public:
+  [[nodiscard]] const std::string& splitPath() const { return splitPath_; }
velox/connectors/clp/search_lib/ClpQueryRunner.cpp (1)

52-54: ⚠️ Potential issue

Use explicit boolean comparison instead of logical negation

According to the coding guidelines, you should prefer false == <expression> rather than !<expression>.

-    if (!foundReader) {
+    if (false == foundReader) {
      projectedColumns_.push_back(nullptr);
    }
CMake/FindANTLR.cmake (2)

20-20: 🛠️ Refactor suggestion

Avoid leaking compile definitions globally

add_definitions(-DANTLR4CPP_STATIC) affects every target in the build.
Define this property only on the imported ANTLR targets to prevent accidental ODR or linkage issues in unrelated modules.


157-162: ⚠️ Potential issue

-no-listener conflicts with opt-in listener generation

The custom command always passes -no-listener, but the macro adds -listener when the caller requests listeners (line 125). ANTLR picks the last flag, so the generated listeners might be silently dropped.

  COMMAND
    ${Java_JAVA_EXECUTABLE} -jar ${ANTLR_EXECUTABLE} ${InputFile} -o
-    ${ANTLR_${Name}_OUTPUT_DIR} -no-listener -Dlanguage=Cpp
+    ${ANTLR_${Name}_OUTPUT_DIR} -Dlanguage=Cpp
    ${ANTLR_TARGET_COMPILE_FLAGS}
CMake/ExternalAntlr4Cpp.cmake (1)

68-80: Re-use quoted form in all generator checks for consistency

Follow-up to the previous comment—use the quoted form in every if("${CMAKE_GENERATOR}" MATCHES …) block so the file is internally consistent and future-proof.

velox/connectors/clp/ClpDataSource.cpp (4)

105-112: Guard against unsupported storage types

If storageType_ is neither kFS nor kS3, cursor_ remains nullptr, leading to undefined behaviour later.
Add an explicit else branch that throws a VELOX_USER_FAIL.


113-114: Handle the return status of executeQuery()

executeQuery() returns an error/status code that is currently ignored.
Capturing and validating it would surface query failures early and ease debugging.


136-138: Avoid eager null-buffer allocation for lazy vectors

Allocating a full nulls bitmap only to have the ClpVectorLoader overwrite it wastes memory and cycles.
Pass nullptr and let the loader allocate nulls on demand.

-  vector->setNulls(allocateNulls(size, pool_, bits::kNull));
+  // Defer null allocation; ClpVectorLoader will allocate only if needed.

40-44: 🧹 Nitpick (assertive)

Follow project boolean-expression guideline

The code base prefers false == expr over unary !expr.

-  if (auto query = clpTableHandle->kqlQuery(); query && !query->empty()) {
+  if (auto query = clpTableHandle->kqlQuery(); query && false == query->empty()) {
velox/connectors/clp/search_lib/ClpCursor.h (1)

90-90: Still exposes mutable state through a const interface

getProjectedColumns() returns const std::vector<...>&, but the underlying pointer elements remain mutable.
If mutation is not intended, return a vector of const BaseColumnReader* or add const to the pointee type as well.

velox/connectors/clp/search_lib/ClpCursor.cpp (4)

66-71: open_packed_streams() may throw – cursor left in inconsistent state

If archiveReader_->open_packed_streams() throws, currentArchiveLoaded_ is already set to true, and subsequent calls will dereference a half-initialised cursor.
Wrap the call in a try/catch, reset flags on failure, and propagate a meaningful error code.


109-110: static non-const vector remains a data-race hazard

kEmpty should be declared const (or returned by value) to prevent unsynchronised writes from other threads that might accidentally mutate the vector.

-  static std::vector<clp_s::BaseColumnReader*> kEmpty;
+  static const std::vector<clp_s::BaseColumnReader*> kEmpty;

126-145: Chained assignment inside if statement hurts readability

The init-statement form is valid C++17, but it obscures the assignment and has already caused confusion in previous reviews.
Assign first, then test the result in a separate if to improve clarity.


167-171: Duplicate timestamp-range evaluation

EvaluateTimestampIndex is executed twice with identical logic.
Remove one invocation to avoid wasted work and simplify the control flow.

Also applies to: 239-243

velox/connectors/clp/tests/ClpConnectorTest.cpp (5)

108-109: 🧹 Nitpick (assertive)

Improve code formatting for better readability.

The calls to getResults() with the makeClpSplit() function span multiple lines and could be formatted more consistently.

-  auto output =
-      getResults(plan, {makeClpSplit(getExampleFilePath("test_1.clps"))});
+  auto output = getResults(plan, {
+      makeClpSplit(getExampleFilePath("test_1.clps"))
+  });

154-155: 🧹 Nitpick (assertive)

Improve code formatting for better readability.

Same formatting issue as in the previous test case.

-  auto output =
-      getResults(plan, {makeClpSplit(getExampleFilePath("test_1.clps"))});
+  auto output = getResults(plan, {
+      makeClpSplit(getExampleFilePath("test_1.clps"))
+  });

198-199: 🧹 Nitpick (assertive)

Improve code formatting for better readability.

Same formatting issue as in the previous test cases.

-  auto output =
-      getResults(plan, {makeClpSplit(getExampleFilePath("test_2.clps"))});
+  auto output = getResults(plan, {
+      makeClpSplit(getExampleFilePath("test_2.clps"))
+  });

246-247: 🧹 Nitpick (assertive)

Improve code formatting for better readability.

Same formatting issue as in the previous test cases.

-  auto output =
-      getResults(plan, {makeClpSplit(getExampleFilePath("test_2.clps"))});
+  auto output = getResults(plan, {
+      makeClpSplit(getExampleFilePath("test_2.clps"))
+  });

294-295: 🧹 Nitpick (assertive)

Improve code formatting for better readability.

Same formatting issue as in the previous test cases.

-  auto output =
-      getResults(plan, {makeClpSplit(getExampleFilePath("test_2.clps"))});
+  auto output = getResults(plan, {
+      makeClpSplit(getExampleFilePath("test_2.clps"))
+  });
velox/connectors/clp/ClpConnector.h (3)

17-21: ⚠️ Potential issue

Add the required STL header for std::unordered_map.

This file directly uses std::unordered_map (line 42) but does not include the <unordered_map> header. Relying on an indirect include from another Velox header is fragile and violates the "include what you use" rule.

 #pragma once

 #include "velox/connectors/Connector.h"
 #include "velox/connectors/clp/ClpConfig.h"
+#include <unordered_map>

50-55: ⚠️ Potential issue

Clarify or remove the unimplemented createDataSink.

The createDataSink() method is declared but not implemented according to the PR description, which only mentions the connector's ability to read CLP archives, not write to them. Leaving the declaration without implementation will cause link-time errors.

Either:

  1. Remove the override if writing is not supported in this version, or
  2. Add an implementation that throws a clear exception, such as:
std::unique_ptr<DataSink> createDataSink(
    RowTypePtr inputType,
    std::shared_ptr<ConnectorInsertTableHandle> connectorInsertTableHandle,
    ConnectorQueryCtx* connectorQueryCtx,
    CommitStrategy commitStrategy) override {
  VELOX_NYI("CLP data sink not implemented");
}

60-65: 🛠️ Refactor suggestion

Initialize TimestampPattern in the default constructor.

The TimestampPattern initialization is missing from the default constructor, which could lead to issues if the default constructor is used.

 class ClpConnectorFactory : public ConnectorFactory {
 public:
   static constexpr const char* kClpConnectorName = "clp";

-  ClpConnectorFactory();
+  ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) {
+    // Initialize timestamp pattern for CLP
+    clp_s::TimestampPattern::init();
+  }
   explicit ClpConnectorFactory(const char* connectorName);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0cd9943 and 0053396.

📒 Files selected for processing (12)
  • CMake/ExternalAntlr4Cpp.cmake (1 hunks)
  • CMake/FindANTLR.cmake (1 hunks)
  • velox/connectors/clp/ClpConfig.h (1 hunks)
  • velox/connectors/clp/ClpConnector.h (1 hunks)
  • velox/connectors/clp/ClpConnectorSplit.h (1 hunks)
  • velox/connectors/clp/ClpDataSource.cpp (1 hunks)
  • velox/connectors/clp/ClpDataSource.h (1 hunks)
  • velox/connectors/clp/ClpTableHandle.h (1 hunks)
  • velox/connectors/clp/search_lib/ClpCursor.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ClpCursor.h (1 hunks)
  • velox/connectors/clp/search_lib/ClpQueryRunner.cpp (1 hunks)
  • velox/connectors/clp/tests/ClpConnectorTest.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • velox/connectors/clp/search_lib/ClpQueryRunner.cpp
  • velox/connectors/clp/search_lib/ClpCursor.cpp
  • velox/connectors/clp/ClpDataSource.cpp
  • velox/connectors/clp/tests/ClpConnectorTest.cpp
🧬 Code Graph Analysis (7)
velox/connectors/clp/ClpConnectorSplit.h (3)
velox/connectors/clp/tests/ClpConnectorTest.cpp (2)
  • splitPath (63-66)
  • splitPath (63-63)
velox/connectors/clp/ClpConfig.h (1)
  • nodiscard (35-38)
velox/connectors/clp/ClpConnector.h (2)
  • nodiscard (29-32)
  • nodiscard (34-36)
velox/connectors/clp/ClpConfig.h (1)
velox/connectors/clp/ClpConnector.h (2)
  • nodiscard (29-32)
  • nodiscard (34-36)
velox/connectors/clp/ClpDataSource.h (1)
velox/connectors/clp/ClpConnector.h (1)
  • outputType (38-44)
velox/connectors/clp/search_lib/ClpQueryRunner.cpp (3)
velox/connectors/clp/search_lib/ClpQueryRunner.h (3)
  • schemaReader (53-56)
  • projectedColumns_ (72-74)
  • numRows (64-66)
velox/connectors/clp/search_lib/ClpCursor.cpp (2)
  • fetchNext (53-102)
  • fetchNext (53-55)
velox/connectors/clp/search_lib/ClpCursor.h (1)
  • numRows (81-83)
velox/connectors/clp/search_lib/ClpCursor.h (2)
velox/connectors/clp/search_lib/ClpCursor.cpp (2)
  • ClpCursor (33-37)
  • ClpCursor (39-43)
velox/connectors/clp/search_lib/ClpQueryRunner.h (1)
  • numRows (64-66)
velox/connectors/clp/ClpTableHandle.h (4)
velox/connectors/clp/ClpConfig.h (1)
  • nodiscard (35-38)
velox/connectors/clp/ClpConnectorSplit.h (1)
  • nodiscard (28-30)
velox/connectors/clp/ClpConnector.h (2)
  • nodiscard (29-32)
  • nodiscard (34-36)
velox/connectors/hive/TableHandle.h (1)
  • tableName_ (146-148)
velox/connectors/clp/ClpConnector.h (3)
velox/connectors/clp/ClpConfig.h (1)
  • nodiscard (35-38)
velox/connectors/clp/ClpConnectorSplit.h (1)
  • nodiscard (28-30)
velox/connectors/clp/ClpTableHandle.h (3)
  • nodiscard (39-41)
  • nodiscard (43-45)
  • nodiscard (47-49)
🪛 Cppcheck (2.10-2)
velox/connectors/clp/tests/ClpConnectorTest.cpp

[error] 82-82: syntax error

(syntaxError)

🔇 Additional comments (10)
velox/connectors/clp/ClpConnectorSplit.h (1)

28-30: Split path representation looks good.

The toString() method implementation uses fmt::format correctly and provides a descriptive prefix for debugging. Nice improvement over the previous version.

velox/connectors/clp/ClpConfig.h (2)

35-38: Good use of [[nodiscard]]

Proper use of the [[nodiscard]] attribute for a method that returns a reference that shouldn't be ignored.


30-33: 🧹 Nitpick (assertive)

Consider taking ownership via pass-by-value

The constructor takes a shared pointer by const reference and then moves it, which is inefficient. Since you're taking ownership of the pointer, consider taking it by value to leverage move semantics automatically:

-  explicit ClpConfig(std::shared_ptr<const config::ConfigBase> config) {
+  explicit ClpConfig(std::shared_ptr<const config::ConfigBase> config) {
     VELOX_CHECK_NOT_NULL(config, "Config is null for CLP initialization");
-    config_ = std::move(config);
+    config_ = std::move(config);
   }

This change doesn't actually modify code but improves readability by signaling the intended ownership transfer.

Likely an incorrect or invalid review comment.

velox/connectors/clp/search_lib/ClpQueryRunner.cpp (2)

41-50: The column lookup logic is well-implemented

The code correctly searches for a matching column reader from the provided column map based on node IDs. This approach supports flexible mapping between projected fields and available columns.


62-80: 🧹 Nitpick (assertive)

Ensure consistent types in the fetchNext method

The method signature uses uint64_t numRows but the local variable is declared as size_t rowsfiltered. While they're likely the same size on your platform, it's best practice to use consistent types.

uint64_t ClpQueryRunner::fetchNext(
    size_t numRows,
    const std::shared_ptr<std::vector<uint64_t>>& filteredRowIndices) {
-  size_t rowsfiltered = 0;
-  size_t rowsScanned = 0;
+  uint64_t rowsfiltered = 0;
+  uint64_t rowsScanned = 0;
  while (curMessage_ < numMessages_) {
    if (filter(curMessage_)) {
      filteredRowIndices->emplace_back(curMessage_);
      rowsfiltered += 1;
    }

    curMessage_ += 1;
    rowsScanned += 1;
    if (rowsfiltered >= numRows) {
      break;
    }
  }
  return rowsScanned;
}

Likely an incorrect or invalid review comment.

CMake/FindANTLR.cmake (1)

33-42: Good file download implementation with SHA256 verification

The implementation correctly checks if the JAR exists before downloading and uses proper SHA256 verification, making it secure and efficient.

velox/connectors/clp/ClpTableHandle.h (1)

29-37: 🧹 Nitpick (assertive)

std::move on an l-value is ineffective here

The constructor parameter kqlQuery is an l-value, so std::move(kqlQuery) just performs a copy of the smart pointer.
Either accept the parameter by value and std::move it, or drop the move call altogether:

-      std::shared_ptr<std::string> kqlQuery)
+      std::shared_ptr<std::string> kqlQuery)
       : ConnectorTableHandle(connectorId),
         tableName_(tableName),
-        storageType_(storageType),
-        kqlQuery_(std::move(kqlQuery)) {}
+        storageType_(storageType),
+        kqlQuery_(std::move(kqlQuery)) {} // if param passed by value

Likely an incorrect or invalid review comment.

velox/connectors/clp/tests/ClpConnectorTest.cpp (2)

1-16: LGTM! License header is properly included.

The Apache 2.0 license header is correctly included at the beginning of the file.


313-317: LGTM! Main function properly sets up the test environment.

The main function correctly initializes Google Test and Folly before running the tests.

velox/connectors/clp/ClpConnector.h (1)

67-73: LGTM! Properly handles unused parameters.

Unused parameters are correctly commented out to avoid compiler warnings, which is a good practice.

Comment on lines +40 to +46
if(${CMAKE_GENERATOR} MATCHES "Visual Studio.*")
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(Configuration))
elseif(${CMAKE_GENERATOR} MATCHES "Xcode.*")
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(CONFIGURATION))
else()
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist)
endif()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Quote generator variable in if() to avoid empty-string pitfalls

When CMAKE_GENERATOR is empty, if(${CMAKE_GENERATOR} MATCHES …) raises a CMake error.
Quote the variable to make the comparison safe:

-if(${CMAKE_GENERATOR} MATCHES "Visual Studio.*")
+if("${CMAKE_GENERATOR}" MATCHES "Visual Studio.*")

Apply the same pattern to the other generator checks below (lines 40-46, 42-44, 68-80).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if(${CMAKE_GENERATOR} MATCHES "Visual Studio.*")
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(Configuration))
elseif(${CMAKE_GENERATOR} MATCHES "Xcode.*")
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(CONFIGURATION))
else()
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist)
endif()
if("${CMAKE_GENERATOR}" MATCHES "Visual Studio.*")
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(Configuration))
elseif(${CMAKE_GENERATOR} MATCHES "Xcode.*")
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(CONFIGURATION))
else()
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist)
endif()

2. fix a bug in FindANTLR.cmake
3. download antlr jar to build directory instead of the source directory
4. add StorageType to ClpTableHandle
5. use consistent types for filteredRows across different methods
6. remove archiveSource_ from ClpConfig and config docs, use storageType_ in ClpDataSource
7. minor improvements in logs and method signatures
8. rename splitPath to path in ClpConnectorSplit
@wraymo wraymo force-pushed the clp_integration_refactored branch from 0053396 to f26ba13 Compare May 12, 2025 21:49
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (23)
velox/connectors/clp/ClpConnectorSplit.h (2)

30-31: Keep data members private to reduce coupling.

path_ is exposed publicly. Exposing internal state invites unnecessary dependencies and prevents later refactors. The accessor in the base class already exposes the split id; if callers still need the path, provide a const std::string& path() const method and make the member private.


19-21: ⚠️ Potential issue

Add missing fmt include to keep the header self-contained.

toString() relies on fmt::format, but <fmt/format.h> is not included. Depending on transitive includes breaks ODR hygiene and occasionally fails on different compilers.

 #include "velox/connectors/Connector.h"
+#include <fmt/format.h>
velox/connectors/clp/ClpConfig.h (1)

19-20: Remove unused Boost include.

<boost/algorithm/string.hpp> is not referenced in this header and adds an expensive dependency to every translation unit that includes ClpConfig.h.

CMake/FindANTLR.cmake (2)

20-21: Avoid propagating global compile definitions.

add_definitions(-DANTLR4CPP_STATIC) leaks into every target. Limit the definition to the imported ANTLR targets with target_compile_definitions() to prevent ODR issues.


157-162: -no-listener still conflicts with optional listener generation.

Passing -no-listener unconditionally overrides a caller’s -listener flag, silently dropping listener code when requested.

velox/connectors/clp/ClpTableHandle.h (2)

43-45: Remove redundant const qualifier on enum return value
The accessor returns the enum by value, therefore the leading const is meaningless and slightly misleading.

-  [[nodiscard]] const StorageType storageType() const {
+  [[nodiscard]] StorageType storageType() const {

47-49: Make the returned query string read-only
Returning a shared_ptr<std::string> gives callers write-access to the KQL text although the handle is conceptually immutable. Adopt a shared_ptr<const std::string> to enforce read-only semantics:

-  [[nodiscard]] const std::shared_ptr<std::string>& kqlQuery() const {
+  [[nodiscard]] const std::shared_ptr<const std::string>& kqlQuery() const {

This also requires adjusting the data-member and constructor parameter types.

CMake/ExternalAntlr4Cpp.cmake (1)

40-46: Quote CMAKE_GENERATOR in comparisons to avoid CMake errors
If CMAKE_GENERATOR is an empty string the un-quoted variable expands to nothing and if() fails to parse. The original recommendation to quote the variable is still unaddressed:

-if(${CMAKE_GENERATOR} MATCHES "Visual Studio.*")
+if("${CMAKE_GENERATOR}" MATCHES "Visual Studio.*")
...
-elseif(${CMAKE_GENERATOR} MATCHES "Xcode.*")
+elseif("${CMAKE_GENERATOR}" MATCHES "Xcode.*")
...
-if(${CMAKE_GENERATOR} MATCHES ".* Makefiles")
+if("${CMAKE_GENERATOR}" MATCHES ".* Makefiles")

Apply the same quoting to every generator check in this file.

Also applies to: 68-80

velox/connectors/clp/ClpDataSource.h (1)

99-101: Progress counters need atomicity for thread-safety
completedRows_ and completedBytes_ may be read concurrently by Velox. Wrap them in std::atomic<uint64_t> (or Velox’s atomic wrapper) to prevent data races.

velox/connectors/clp/ClpDataSource.cpp (6)

68-73: Use size_t for loop counter to match rowType.size()
rowType.size() returns size_t; using uint32_t risks truncation and compiler warnings on very wide rows.

-    for (uint32_t i = 0; i < rowType.size(); ++i) {
+    for (size_t i = 0; i < rowType.size(); ++i) {

113-114: Handle or propagate the result of executeQuery()
The return status is ignored; failures will go unnoticed and make debugging hard. Capture the returned error code and react appropriately (log, propagate, or throw).


136-138: Avoid eager “all-null” allocation for LazyVectors
Allocating and setting every bit to null wastes memory and work because the loader will overwrite them. Pass nullptr for the nulls buffer and let the loader allocate when needed.

-  vector->setNulls(allocateNulls(size, pool_, bits::kNull));
+  // Defer null-buffer allocation; ClpVectorLoader will allocate if required.

125-132: Loop counter type mismatch inside createVector()

Same issue as above: prefer size_t to align with rowType.size().

-    for (uint32_t i = 0; i < rowType.size(); ++i) {
+    for (size_t i = 0; i < rowType.size(); ++i) {

162-165: ⚠️ Potential issue

Compilation failure: readerIndex type mismatch

readerIndex is declared as uint64_t but createVector() requires size_t&.
Change the variable to size_t (or adjust the function signature).

-  uint64_t readerIndex = 0;
+  size_t readerIndex = 0;

38-44: 🧹 Nitpick (assertive)

Follow project boolean-expression style
Coding guidelines prefer false == expr over !expr.

-  if (auto query = clpTableHandle->kqlQuery(); query && !query->empty()) {
+  if (auto query = clpTableHandle->kqlQuery(); query && false == query->empty()) {
velox/connectors/clp/search_lib/ClpCursor.h (1)

93-98: 🧹 Nitpick (assertive)

Add concrete documentation of possible ErrorCode returns

preprocessQuery() advertises that it returns an ErrorCode, but callers (and future maintainers) have no idea which codes to expect under which circumstances. A concise list of the concrete values that may be returned and their meaning would eliminate guess-work and prevent misuse.

-   * @return The error code.
+   * @return ErrorCode::Success on success, or one of:
+   *         - ErrorCode::InvalidQuerySyntax   – the KQL could not be parsed
+   *         - ErrorCode::LogicalError         – the query simplifies to FALSE
+   *         - ErrorCode::SchemaNotFound       – no schema matches the query
+   *         - ErrorCode::InvalidTimestampRange– query’s time range has no hits
+   *         - ErrorCode::InternalError        – unexpected failure
velox/connectors/clp/search_lib/ClpCursor.cpp (5)

17-18: <sstream> include still missing – compilation is not guaranteed

std::istringstream is used in preprocessQuery() (line 114) but <sstream> is not included. Some tool-chains may compile thanks to transitive includes, others will fail.

 #include <glog/logging.h>
+#include <sstream>  // for std::istringstream

104-111: Thread-safety issue: return reference to mutable static vector

kEmpty is mutable and non-const, therefore shared mutation across threads is a data-race.

-  static std::vector<clp_s::BaseColumnReader*> kEmpty;
+  static const std::vector<clp_s::BaseColumnReader*> kEmpty;

167-171: Duplicate timestamp-range evaluation – remove one pass

EvaluateTimestampIndex is executed twice with identical logic, wasting CPU and I/O. Deleting the second block (lines 239-243) keeps behaviour intact.

-  EvaluateTimestampIndex timestamp_index(timestampDict);
-  if (EvaluatedValue::False == timestamp_index.run(expr_)) {
-    VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'";
-    return ErrorCode::InvalidTimestampRange;
-  }

Also applies to: 239-243


151-161: Convert generic InternalError into actionable codes

All failures from archiveReader_->open() collapse into InternalError, losing the root cause (e.g., file-not-found vs. permission denied). Distinct codes—or at least a descriptive message—would hugely simplify debugging.

Consider mapping common exceptions to specific ErrorCodes (FileNotFound, PermissionDenied, …).


60-71: ⚠️ Potential issue

Uncaught exception from open_packed_streams() can leave cursor in invalid state

archiveReader_->open_packed_streams() may throw (e.g., corrupt archive).
When that happens:

  1. currentArchiveLoaded_ is already true,
  2. queryRunner_ has been allocated but not fully initialised.

Subsequent calls to fetchNext() will dereference partially initialised members, risking SIGSEGV.

Wrap the call in a try / catch, reset state on failure, and propagate an ErrorCode.

-    archiveReader_->open_packed_streams();
-    currentArchiveLoaded_ = true;
+    try {
+      archiveReader_->open_packed_streams();
+      currentArchiveLoaded_ = true;
+    } catch (const std::exception& ex) {
+      VLOG(2) << "Failed to open packed streams: " << ex.what();
+      currentArchiveLoaded_ = false;
+      errorCode_ = ErrorCode::InternalError;
+      return 0;
+    }
velox/connectors/clp/tests/ClpConnectorTest.cpp (1)

76-79: Hard-coded CWD path breaks when tests run from build directory

Deriving the examples path from fs::current_path() assumes the test executable is launched from the project root. CTest/Ninja usually run tests from the build tree, causing a “file not found”.

Prefer computing the path relative to this source file:

-    std::string current_path = fs::current_path().string();
-    return current_path + "/examples/" + filePath;
+    std::filesystem::path here = __FILE__;
+    return (here.parent_path() / "examples" / filePath).string();
velox/connectors/clp/ClpConnector.h (1)

17-21: ⚠️ Potential issue

Missing include for std::unordered_map.

The createDataSource method uses std::unordered_map but this header is not included, leading to compilation failures if no transitive include is present. Please add:

 #include "velox/connectors/clp/ClpConfig.h"
+#include <unordered_map>
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0053396 and f26ba13.

📒 Files selected for processing (12)
  • CMake/ExternalAntlr4Cpp.cmake (1 hunks)
  • CMake/FindANTLR.cmake (1 hunks)
  • velox/connectors/clp/ClpConfig.h (1 hunks)
  • velox/connectors/clp/ClpConnector.h (1 hunks)
  • velox/connectors/clp/ClpConnectorSplit.h (1 hunks)
  • velox/connectors/clp/ClpDataSource.cpp (1 hunks)
  • velox/connectors/clp/ClpDataSource.h (1 hunks)
  • velox/connectors/clp/ClpTableHandle.h (1 hunks)
  • velox/connectors/clp/search_lib/ClpCursor.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ClpCursor.h (1 hunks)
  • velox/connectors/clp/search_lib/ClpQueryRunner.cpp (1 hunks)
  • velox/connectors/clp/tests/ClpConnectorTest.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • velox/connectors/clp/search_lib/ClpQueryRunner.cpp
  • velox/connectors/clp/search_lib/ClpCursor.cpp
  • velox/connectors/clp/ClpDataSource.cpp
  • velox/connectors/clp/tests/ClpConnectorTest.cpp
🧬 Code Graph Analysis (7)
velox/connectors/clp/ClpConnectorSplit.h (3)
velox/connectors/clp/ClpConfig.h (1)
  • nodiscard (35-38)
velox/connectors/clp/ClpConnector.h (2)
  • nodiscard (29-32)
  • nodiscard (34-36)
velox/connectors/clp/ClpTableHandle.h (3)
  • nodiscard (39-41)
  • nodiscard (43-45)
  • nodiscard (47-49)
velox/connectors/clp/ClpConfig.h (1)
velox/connectors/clp/ClpConnector.h (2)
  • nodiscard (29-32)
  • nodiscard (34-36)
velox/connectors/clp/search_lib/ClpQueryRunner.cpp (3)
velox/connectors/clp/search_lib/ClpQueryRunner.h (3)
  • schemaReader (53-56)
  • projectedColumns_ (72-74)
  • numRows (64-66)
velox/connectors/clp/search_lib/ClpCursor.cpp (2)
  • fetchNext (53-102)
  • fetchNext (53-55)
velox/connectors/clp/search_lib/ClpCursor.h (1)
  • numRows (81-83)
velox/connectors/clp/ClpTableHandle.h (4)
velox/connectors/clp/ClpConnectorSplit.h (1)
  • nodiscard (26-28)
velox/connectors/clp/ClpConfig.h (1)
  • nodiscard (35-38)
velox/connectors/clp/ClpConnector.h (2)
  • nodiscard (29-32)
  • nodiscard (34-36)
velox/connectors/hive/TableHandle.h (1)
  • tableName_ (146-148)
velox/connectors/clp/ClpDataSource.h (1)
velox/connectors/clp/ClpConnector.h (1)
  • outputType (38-44)
velox/connectors/clp/search_lib/ClpCursor.h (2)
velox/connectors/clp/search_lib/ClpCursor.cpp (2)
  • ClpCursor (33-37)
  • ClpCursor (39-43)
velox/connectors/clp/search_lib/ClpQueryRunner.h (1)
  • numRows (64-66)
velox/connectors/clp/ClpConnector.h (3)
velox/connectors/clp/ClpConnectorSplit.h (1)
  • nodiscard (26-28)
velox/connectors/clp/ClpConfig.h (1)
  • nodiscard (35-38)
velox/connectors/clp/ClpTableHandle.h (3)
  • nodiscard (39-41)
  • nodiscard (43-45)
  • nodiscard (47-49)
🪛 Cppcheck (2.10-2)
velox/connectors/clp/tests/ClpConnectorTest.cpp

[error] 82-82: syntax error

(syntaxError)

🔇 Additional comments (7)
velox/connectors/clp/search_lib/ClpCursor.cpp (1)

92-97: Potential stale data in filteredRowIndices across schema iterations

filteredRowIndices is reused between schema batches but never cleared.
If a previous schema produced matches and a later one does not, callers may mis-interpret residual indices.

Please either:

  1. filteredRowIndices->clear() at the top of fetchNext(), or
  2. Document that the caller must pass in a fresh (empty) vector each time.
velox/connectors/clp/ClpConnector.h (6)

1-15: License header present and correct.

The Apache 2.0 license header is properly included at the top of the file.


34-37: Dynamic filters are correctly disabled.

Overriding canAddDynamicFilter() to return false aligns with the CLP connector’s current capabilities.


38-44: Data source creation signature is sound.

The createDataSource override correctly declares all required parameters. Ensure the implementation rigorously validates and maps the Presto column handles to CLP projections.


46-48: Split preload support is correctly disabled.

Returning false from supportsSplitPreload() is appropriate since CLP archives do not support split preloading.


60-66: Ensure TimestampPattern is initialized in all constructors.

The default constructor of ClpConnectorFactory needs to call clp_s::TimestampPattern::init(); to mirror the explicit constructor and avoid uninitialized state:

-ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) {}
+ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) {
+  clp_s::TimestampPattern::init();
+}

[ suggest_essential_refactor]


67-74: Connector factory instantiation is correct.

The newConnector method cleanly constructs a ClpConnector and properly silences unused executor parameters.

2. use consistent type for numRows in ClpQueryRunner
3. remove spdlog version in CMakeLists.txt and change find_package to resolve_dependency for curl
4. fix readerIndex type in ClpDataSource
5. add SMALLINT in ClpDataSource
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (10)
velox/connectors/clp/CMakeLists.txt (2)

29-31: Modern CMake: avoid dereferencing variables in if() conditions.

if(${VELOX_BUILD_TESTING}) first expands the variable and then evaluates the
result as a string, which breaks if the value is unset or empty.
Prefer:

-if(${VELOX_BUILD_TESTING})
+if (VELOX_BUILD_TESTING)

This pattern is already recommended in earlier reviews.


17-23: 🛠️ Refactor suggestion

Add build-option guard to honour VELOX_ENABLE_CLP_CONNECTOR.

This file is always processed, even when the top-level option explicitly disables the connector (see lines 166-182 of the root CMakeLists.txt).
Wrap the entire library definition with the same guard used elsewhere to avoid configuring unused targets and their heavy third-party dependencies.

+if (VELOX_ENABLE_CLP_CONNECTOR)
 velox_add_library(
   velox_clp_connector
   OBJECT
   ClpConnector.cpp
   ClpDataSource.cpp
   ClpTableHandle.cpp)
+endif()
CMakeLists.txt (1)

492-508: Duplicate dependency resolution for absl, curl, etc.

absl (and other libs) are resolved here under the CLP connector guard and again in the unit-testing block at lines 537-543.
When both VELOX_ENABLE_CLP_CONNECTOR and VELOX_BUILD_TESTING are ON (the defaults), CMake re-invokes velox_set_source()/velox_resolve_dependency() on the same package, which can emit warnings or produce inconsistent settings.

Consolidate into a single place or add a if(NOT TARGET absl::absl) style guard in the second occurrence.

velox/connectors/clp/search_lib/ClpQueryRunner.cpp (1)

52-54: Follow boolean-expression guideline.

Per the project coding rules, favour an explicit comparison: false == foundReader instead of !foundReader.

-    if (!foundReader) {
+    if (false == foundReader) {
velox/connectors/clp/ClpDataSource.cpp (6)

40-44: Use explicit boolean comparison as per project style.

-  if (auto query = clpTableHandle->kqlQuery(); query && !query->empty()) {
+  if (auto query = clpTableHandle->kqlQuery(); query && false == query->empty()) {

68-74: Loop counter should match rowType.size() return type.

rowType.size() returns size_t; using uint32_t risks truncation for wide
rows and triggers signed/unsigned warnings.

-    for (uint32_t i = 0; i < rowType.size(); ++i) {
+    for (size_t i = 0; i < rowType.size(); ++i) {

114-115: Check return value of executeQuery for early failures.

cursor_->executeQuery() returns an ErrorCode, but the code discards it.
Capturing and validating the result will surface issues sooner and avoids
silently proceeding with an invalid cursor state.

-  cursor_->executeQuery(kqlQuery_, fields_);
+  auto err = cursor_->executeQuery(kqlQuery_, fields_);
+  VELOX_CHECK(err == search_lib::ErrorCode::Success,
+              "executeQuery failed with error code {}", static_cast<int>(err));

138-139: Avoid eagerly zero-initialising null buffers for lazy vectors.

LazyVector will allocate and fill the nulls buffer on demand. Passing a
pre-initialised all-null buffer wastes memory and CPU every batch.

-  vector->setNulls(allocateNulls(size, pool_, bits::kNull));
+  // Defer null-buffer allocation; ClpVectorLoader will allocate if needed.

120-123: ⚠️ Potential issue

Potential type mismatch between uint64_t and size_t.

Several APIs in the search library use size_t for row indices. Mixing
std::vector<uint64_t> here can force needless casts and may not compile if
the loader expects size_t.

Confirm both sides use the same type; otherwise switch to size_t:

-    const std::shared_ptr<std::vector<uint64_t>>& filteredRows,
+    const std::shared_ptr<std::vector<size_t>>& filteredRows,

Similarly update the allocation sites and ClpVectorLoader.


106-113: ⚠️ Potential issue

Guard against unsupported storage types.

If future code adds a new StorageType, cursor_ would stay nullptr,
leading to undefined behaviour when dereferenced. Add an explicit else
branch that throws a user-level error.

   } else if (storageType_ == ClpTableHandle::StorageType::kS3) {
     cursor_ = std::make_unique<search_lib::ClpCursor>(
         clp_s::InputSource::Network, clpSplit->path_);
+  } else {
+    VELOX_USER_FAIL(
+        "Unsupported storage type '{}'",
+        static_cast<int>(storageType_));
   }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f26ba13 and 9834a69.

📒 Files selected for processing (4)
  • CMakeLists.txt (4 hunks)
  • velox/connectors/clp/CMakeLists.txt (1 hunks)
  • velox/connectors/clp/ClpDataSource.cpp (1 hunks)
  • velox/connectors/clp/search_lib/ClpQueryRunner.cpp (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.

**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

  • velox/connectors/clp/search_lib/ClpQueryRunner.cpp
  • velox/connectors/clp/ClpDataSource.cpp
🧬 Code Graph Analysis (1)
velox/connectors/clp/search_lib/ClpQueryRunner.cpp (3)
velox/connectors/clp/search_lib/ClpQueryRunner.h (3)
  • schemaReader (53-56)
  • projectedColumns_ (72-74)
  • numRows (64-66)
velox/connectors/clp/search_lib/ClpCursor.cpp (2)
  • fetchNext (53-102)
  • fetchNext (53-55)
velox/connectors/clp/search_lib/ClpCursor.h (1)
  • numRows (81-83)

Comment on lines +65 to +72
size_t rowsfiltered = 0;
size_t rowsScanned = 0;
while (curMessage_ < numMessages_) {
if (filter(curMessage_)) {
filteredRowIndices->emplace_back(curMessage_);
rowsfiltered += 1;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

Minor: consistent naming and type for counters.

rowsfiltered mixes snake-case with camel-case and differs from the later
rowsScanned. Moreover, both counters track row counts and can exceed size_t
on some platforms; prefer uint64_t for symmetry with numRows.

-  size_t rowsfiltered = 0;
+  uint64_t rowsFiltered = 0;
...
-      rowsfiltered += 1;
+      rowsFiltered += 1;
...
-    if (rowsfiltered >= numRows) {
+    if (rowsFiltered >= numRows) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants