-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Add CLP connector #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 22
♻️ Duplicate comments (2)
velox/connectors/clp/CMakeLists.txt (1)
1-44
: 🛠️ Refactor suggestionConditionally compile with VELOX_ENABLE_CLP_CONNECTOR option
The entire CMakeLists.txt should be wrapped in a conditional check to respect the
VELOX_ENABLE_CLP_CONNECTOR
option that was added to the main CMakeLists.txt. This ensures the connector is only built when explicitly enabled.+if(VELOX_ENABLE_CLP_CONNECTOR) # Copyright (c) Facebook, Inc. and its affiliates. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. set(CLP_SRC_DIR ${clp_SOURCE_DIR}/components/core/src) add_subdirectory(search_lib) velox_add_library( velox_clp_connector OBJECT ClpConnector.cpp ClpDataSource.cpp ClpTableHandle.cpp) velox_link_libraries( velox_clp_connector PRIVATE clp-s-search velox_connector simdjson::simdjson msgpack-cxx spdlog::spdlog absl::flat_hash_map Boost::filesystem Boost::headers) velox_include_directories( velox_clp_connector PUBLIC ${clp_SOURCE_DIR}/components/core/submodules ${CLP_SRC_DIR}) target_compile_features(velox_clp_connector PRIVATE cxx_std_20) if(${VELOX_BUILD_TESTING}) add_subdirectory(tests) endif() +endif()velox/connectors/clp/ClpConnector.h (1)
65-69
:⚠️ Potential issueInitialize TimestampPattern in default constructor
The
TimestampPattern::init()
is called in the explicit constructor but not in the default constructor. This could lead to initialization issues if the default constructor is used.Apply the following fix:
- ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) {} + ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) { + clp_s::TimestampPattern::init(); + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (11)
CMake/FindANTLR.cmake
(1 hunks)CMake/resolve_dependency_modules/clp.cmake
(1 hunks)CMakeLists.txt
(4 hunks)velox/connectors/clp/CMakeLists.txt
(1 hunks)velox/connectors/clp/ClpConnector.h
(1 hunks)velox/connectors/clp/ClpConnectorSplit.h
(1 hunks)velox/connectors/clp/ClpDataSource.cpp
(1 hunks)velox/connectors/clp/search_lib/CMakeLists.txt
(1 hunks)velox/connectors/clp/search_lib/ClpCursor.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpCursor.h
(1 hunks)velox/connectors/clp/tests/ClpConnectorTest.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
velox/connectors/clp/search_lib/ClpCursor.cpp
velox/connectors/clp/ClpDataSource.cpp
velox/connectors/clp/tests/ClpConnectorTest.cpp
🧬 Code Graph Analysis (2)
velox/connectors/clp/search_lib/ClpCursor.h (2)
velox/connectors/clp/search_lib/ClpCursor.cpp (2)
ClpCursor
(33-37)ClpCursor
(39-43)velox/connectors/clp/search_lib/ClpQueryRunner.h (1)
numRows
(64-66)
velox/connectors/clp/ClpDataSource.cpp (3)
velox/connectors/clp/ClpDataSource.h (6)
ClpDataSource
(28-35)columnType
(68-70)split
(37-37)type
(86-91)size
(39-40)completedRows_
(52-54)velox/connectors/clp/ClpConnector.h (1)
outputType
(39-45)velox/connectors/clp/search_lib/ClpCursor.h (1)
query
(70-72)
🪛 GitHub Actions: Run Checks
velox/connectors/clp/ClpConnectorSplit.h
[error] 1-1: Formatting issues detected. Please apply fix using the provided patch.
velox/connectors/clp/search_lib/ClpCursor.cpp
[error] 1-1: Formatting issues detected. Please apply fix using the provided patch.
velox/connectors/clp/ClpConnector.h
[error] 1-1: Formatting issues detected. Please apply fix using the provided patch.
velox/connectors/clp/tests/ClpConnectorTest.cpp
[error] 1-1: Formatting issues detected. Please apply fix using the provided patch.
🪛 Cppcheck (2.10-2)
velox/connectors/clp/tests/ClpConnectorTest.cpp
[error] 84-84: syntax error
(syntaxError)
🔇 Additional comments (13)
CMake/resolve_dependency_modules/clp.cmake (1)
1-38
: LGTM: Proper dependency resolution for CLP moduleThe implementation correctly handles fetching and setting up the CLP dependency, including error handling for submodule updates. The download is properly secured with HTTPS and the approach aligns with other dependency modules in the project.
CMakeLists.txt (4)
126-126
: LGTM: Well-defined build option for CLP connectorAdding the option with a default value of ON is consistent with other connector options in the file.
173-173
: LGTM: Properly disable CLP connector for minimal buildsThe change correctly disables the CLP connector when using minimal build options, consistent with the handling of other optional components.
434-438
: LGTM: Required Boost components added for CLP connectorThe addition of
iostreams
andurl
components to the Boost libraries list is appropriate for supporting the CLP connector's functionality.
496-497
:✅ Verification successful
Verify if spdlog is still required
A past review comment suggests that spdlog might not be needed anymore. Verify if this dependency is still required for the CLP connector.
🏁 Script executed:
#!/bin/bash # Check if spdlog is used in the CLP connector code echo "Checking for spdlog usage in CLP connector code..." rg -l "spdlog" velox/connectors/clp/Length of output: 230
spdlog dependency remains required
Search results confirm that spdlog is still referenced in the CLP connector’s build files:
• velox/connectors/clp/CMakeLists.txt
• velox/connectors/clp/search_lib/CMakeLists.txtNo changes needed—keep the existing
set(spdlog_SOURCE BUNDLED)
andvelox_resolve_dependency(spdlog 1.12.0)
lines.CMake/FindANTLR.cmake (3)
15-17
: LGTM: Properly attributed ANTLR module sourceGood practice to include attribution to the original source of the module.
31-39
: LGTM: Secure download of ANTLR JARThe implementation correctly handles downloading the ANTLR JAR if not found, using HTTPS and verifying the download with a SHA256 checksum for security.
69-164
: LGTM: Well-structured ANTLR target macroThe
ANTLR_TARGET
macro is well-structured and handles various options for generating lexer, parser, listener, and visitor components. It properly sets up dependencies and creates appropriate custom build commands.velox/connectors/clp/ClpDataSource.cpp (1)
155-166
: Use filtered row count, not scanned row count, for progress metrics
completedRows_ += rowsScanned;
increments the metric with all scanned rows, not the rows that will actually be delivered (rowsFiltered
).
Depending on how the metric is consumed, this can be misleading.Would you like to change this to
rowsFiltered
, or clarify the semantic of “completed rows” in the header comment?velox/connectors/clp/search_lib/ClpCursor.cpp (1)
60-68
: HandleloadArchive
failure explicitlyIf
loadArchive()
returns an error, the cursor silently stops scanning andnext()
will return no rows without explaining why.
Propagate the error (e.g., via an exception or dedicated status accessor) so the caller can surface it.velox/connectors/clp/search_lib/ClpCursor.h (1)
1-15
: LGTM: License header properly includedThe license header is correctly included at the top of the file.
velox/connectors/clp/ClpConnector.h (2)
1-15
: LGTM: License header properly includedThe license header is correctly included at the top of the file.
🧰 Tools
🪛 GitHub Actions: Run Checks
[error] 1-1: Formatting issues detected. Please apply fix using the provided patch.
71-77
:❓ Verification inconclusive
Unused executor parameters in
newConnector
methodThe
ioExecutor
andcpuExecutor
parameters are accepted but not used in the connector instantiation. This might indicate missing functionality or unclear design.Consider either:
- Using these parameters in the connector, or
- Documenting why they're unused
🏁 Script executed:
#!/bin/bash # Check how other connectors use executors echo "Checking usage of executors in other connectors:" rg "newConnector.*ioExecutor.*cpuExecutor" --type cpp velox/connectors/ | grep -v "clp"Length of output: 194
Unused executor parameters in ClpConnector::newConnector
The
ioExecutor
andcpuExecutor
arguments are accepted by newConnector but never forwarded or stored when constructing the ClpConnector. This divergence from the signature suggests one of the following actions is needed:
- Forward the executors to ClpConnector (e.g. extend its constructor and member variables).
- Remove these parameters from the newConnector signature if they’re not required.
- Add a clear comment in ClpConnector.h explaining why they’re intentionally unused.
Location to address:
- velox/connectors/clp/ClpConnector.h: lines 71–77
CMakeLists.txt
Outdated
if(${VELOX_ENABLE_CLP_CONNECTOR}) | ||
set(clp_SOURCE BUNDLED) | ||
velox_resolve_dependency(clp) | ||
|
||
set(spdlog_SOURCE BUNDLED) | ||
velox_resolve_dependency(spdlog 1.12.0) | ||
|
||
velox_set_source(msgpack-cxx) | ||
velox_resolve_dependency(msgpack-cxx) | ||
|
||
velox_set_source(absl) | ||
velox_resolve_dependency(absl) | ||
|
||
find_package(CURL 7.68.0 REQUIRED) | ||
find_package(ANTLR REQUIRED) | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Check for potential conflicts with absl dependency
There might be a conflict between setting absl
as a source here and in the testing block (lines 538-539). If both conditions can be true simultaneously, this could lead to duplicate dependency resolution calls.
🏁 Script executed:
#!/bin/bash
# Check if there could be conflicts when resolving absl dependency
echo "Checking potential conflicts with absl resolution..."
echo "CLP connector block (lines 492-507):"
grep -A 3 "velox_set_source(absl)" CMakeLists.txt
echo -e "\nChecking if build options can be enabled simultaneously:"
grep "VELOX_ENABLE_CLP_CONNECTOR.*ON" CMakeLists.txt
grep "VELOX_BUILD_TESTING.*ON" CMakeLists.txt
Length of output: 780
🏁 Script executed:
#!/bin/bash
# Show context around both absl resolution occurrences
grep -n -C5 "velox_set_source(absl)" CMakeLists.txt
# Find where VELOX_BUILD_TESTING is defined
grep -R -n "option.*VELOX_BUILD_TESTING" CMakeLists.txt
# Find any conditionals guarding the testing block
grep -R -n "if.*VELOX_BUILD_TESTING" CMakeLists.txt
Length of output: 1101
Prevent duplicate absl dependency resolution
Since both VELOX_ENABLE_CLP_CONNECTOR
(lines 492–507) and VELOX_BUILD_TESTING
(lines 536–543) default to ON, you currently invoke
velox_set_source(absl)
velox_resolve_dependency(absl)
twice. This can produce redundant configuration or warnings.
Consider consolidating absl resolution by:
- Moving the absl calls out of both
if(...)
blocks so they run once. - Or adding a guard in the second block to skip if absl is already resolved.
Locations to update:
- CMakeLists.txt: lines 502–503 (CLP connector block)
- CMakeLists.txt: lines 538–539 (Testing block)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
♻️ Duplicate comments (5)
velox/connectors/clp/search_lib/ClpCursor.cpp (2)
166-170
:⚠️ Potential issueDuplicate timestamp-range evaluation – remove the second pass
EvaluateTimestampIndex
is executed twice (lines 166-170 and again 238-242). The second invocation provides no additional information and costs an extra scan over the timestamp dictionary. Please delete one of the two blocks to avoid redundant work.This issue has been raised in a previous review and remains unresolved.
Also applies to: 238-242
125-135
: 🧹 Nitpick (assertive)Improve readability by separating assignment from condition
The C++17 init-statement is valid, but coupling assignment and condition makes it easy to mis-read as an accidental assignment. A tiny refactor clarifies intent without changing behaviour:
- OrOfAndForm standardizePass; - if (expr_ = standardizePass.run(expr_); - std::dynamic_pointer_cast<EmptyExpr>(expr_)) { + OrOfAndForm standardizePass; + expr_ = standardizePass.run(expr_); + if (std::dynamic_pointer_cast<EmptyExpr>(expr_)) { VLOG(2) << "Query '" << query_ << "' is logically false"; return ErrorCode::LogicalError; }velox/connectors/clp/tests/ClpConnectorTest.cpp (2)
63-68
: 🧹 Nitpick (assertive)
tableName
parameter is unused – clean up helper signature
makeClpSplit()
receivestableName
but ignores it, leading to dead parameters and confusing call sites. Either forward the value toClpConnectorSplit
(if required) or drop the parameter:-exec::Split makeClpSplit( - const std::string& tableName, - const std::string& splitPath) { +exec::Split makeClpSplit(const std::string& splitPath) { return exec::Split( std::make_shared<ClpConnectorSplit>(kClpConnectorId, splitPath)); }
78-81
:⚠️ Potential issuePath resolution brittle – use source-relative path instead of CWD
The tests assume the working directory contains an
examples/
folder. This fails when the test runner is started from a different location.Resolve the path relative to the source file:
-static std::string getExampleFilePath(const std::string& filePath) { - std::string current_path = fs::current_path().c_str(); - return current_path + "/examples/" + filePath; +static std::string getExampleFilePath(const std::string& filePath) { + static const fs::path kExamplesDir = + fs::path(__FILE__).parent_path() / "examples"; + return (kExamplesDir / filePath).string(); }velox/connectors/clp/ClpConnector.h (1)
65-68
:⚠️ Potential issueInitialise
TimestampPattern
in the default constructor as well.A past review already pointed this out, but the default constructor still omits the call to
clp_s::TimestampPattern::init()
. Any code path that instantiatesClpConnectorFactory
via the default constructor will leave the static timestamp-matching infrastructure un-initialised, causing runtime failures when parsing CLP timestamps.- ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) {} + ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) { + clp_s::TimestampPattern::init(); + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
velox/connectors/clp/ClpConnector.h
(1 hunks)velox/connectors/clp/ClpConnectorSplit.h
(1 hunks)velox/connectors/clp/search_lib/ClpCursor.cpp
(1 hunks)velox/connectors/clp/tests/ClpConnectorTest.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
velox/connectors/clp/search_lib/ClpCursor.cpp
velox/connectors/clp/tests/ClpConnectorTest.cpp
🪛 Cppcheck (2.10-2)
velox/connectors/clp/tests/ClpConnectorTest.cpp
[error] 84-84: syntax error
(syntaxError)
🔇 Additional comments (1)
velox/connectors/clp/tests/ClpConnectorTest.cpp (1)
84-125
: Static analysis “syntax error” is a false positiveCppcheck flags a syntax error at the start of the first test case, but the code is valid C++. No action required.
🧰 Tools
🪛 Cppcheck (2.10-2)
[error] 84-84: syntax error
(syntaxError)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
velox/connectors/clp/search_lib/CMakeLists.txt (1)
33-91
: Manual listing of source files is error-prone
Maintaining the longCLP_SRC_FILES
list by hand is brittle and has already caused duplicates in the past. Consider automating this with:file(GLOB_RECURSE CLP_SRC_FILES "${CLP_SRC_DIR}/clp_s/*.cpp")or CMake’s
target_sources(... DIRECTORY ...)
to recursively pick up all.cpp
underclp_s
.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
velox/connectors/clp/search_lib/CMakeLists.txt
(1 hunks)
🔇 Additional comments (2)
velox/connectors/clp/search_lib/CMakeLists.txt (2)
1-13
: License header present and correct
The Apache 2.0 header is included at the top and matches project conventions.
15-21
: Inconsistent variable naming for CLP source directories
The script alternates betweenclp_SOURCE_DIR
(lowercase “clp”) andCLP_SRC_DIR
(uppercase) when referring to the CLP codebase. This can lead to undefined-variable errors if one of them isn’t set in the parent CMake context. Please verify that both variables are properly defined, or—better yet—unify to a single consistently named variable (or useCMAKE_CURRENT_SOURCE_DIR
for local paths).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (11)
velox/connectors/clp/ClpDataSource.cpp (4)
42-46
: Follow the boolean-expression guideline (false == expr
)
The project’s coding standard forbids unary!
in favour of the symmetrical form.
Please rewrite the emptiness check accordingly.
107-113
: Handle unknownsplitSource_
explicitlyIf an unsupported
splitSource_
value ever reaches this branch,cursor_
will
staynullptr
and lead to undefined behaviour later on. Add anelse
that
throws aVELOX_USER_FAIL
.
115-116
: Check the return status ofexecuteQuery
ClpCursor::executeQuery()
returns anErrorCode
, but it is silently ignored
here. Propagate or at least log failures so that query issues aren’t swallowed.
138-140
: Avoid eager “all-null” bitmap allocation
allocateNulls(size, …, bits::kNull)
sets every bit only to have
ClpVectorLoader
overwrite them. Passnullptr
instead and let the loader
allocate only when needed.velox/connectors/clp/search_lib/ClpCursor.h (2)
40-40
: Assign an explicit value toUnknown
Giving
Unknown
an explicit (e.g.,-1
) value safeguards against accidental
changes when the enum is reordered or extended.
90-90
:getProjectedColumns()
breaks const-correctnessA
const
method returning a non-const reference lets callers mutate internal
state through a supposedly const interface. Return aconst&
or dropconst
from the method.velox/connectors/clp/search_lib/ClpCursor.cpp (5)
60-68
:open_packed_streams()
exceptions leave the cursor inconsistentIf
archiveReader_->open_packed_streams()
throws,currentArchiveLoaded_
remainsfalse
but some earlier state has already been mutated. Wrap the call
in a try/catch, reset internal flags, and convert the failure into an
appropriateErrorCode
.
104-110
: Thread-safety risk: shared mutable static vectorReturning a reference to a function-local
static
vector from a const method
exposes mutable state across threads without synchronisation. Use
static const
or return by value.
125-130
: Clarify assignment-in-condition for readabilityThe C++17
if (init; cond)
form is legal but easy to mis-read here. Split the
assignment and the check into two statements.
150-160
: Map archive-open failures to more specific error codesAll exceptions from
archiveReader_->open()
are collapsed into
InternalError
, obscuring actionable diagnostics (file not found, permission
denied, etc.). Catch specific exceptions and translate them into dedicated
ErrorCode
s.
166-170
: Duplicate timestamp-range evaluation
EvaluateTimestampIndex
is executed twice with identical logic, incurring
unnecessary work. Remove one invocation to avoid redundant CPU and I/O.Also applies to: 238-242
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (3)
velox/connectors/clp/ClpDataSource.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpCursor.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpCursor.h
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
velox/connectors/clp/ClpDataSource.cpp
velox/connectors/clp/search_lib/ClpCursor.cpp
🧬 Code Graph Analysis (1)
velox/connectors/clp/search_lib/ClpCursor.h (2)
velox/connectors/clp/search_lib/ClpCursor.cpp (2)
ClpCursor
(33-37)ClpCursor
(39-43)velox/connectors/clp/search_lib/ClpQueryRunner.h (1)
numRows
(64-66)
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
add_executable(velox_clp_connector_test ClpConnectorTest.cpp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this to velox/Makefile
, there is a release-unittest: release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this? Didn't find it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (5)
velox/connectors/clp/CMakeLists.txt (1)
17-27
: Still missingVELOX_ENABLE_CLP_CONNECTOR
guardThe library is built un-conditionally, which breaks builds where the connector is disabled.
The same issue was raised in a previous round of review and has not been addressed.velox/connectors/clp/search_lib/CMakeLists.txt (2)
22-23
: PreferCMAKE_CURRENT_SOURCE_DIR
for portabilityUsing
CMAKE_SOURCE_DIR
breaks when this sub-project is included in a larger build.
The same concern has been brought up previously.
33-92
: Manually curated 200-line source list is hard to maintainRe-consider generating the list with
file(GLOB_RECURSE …)
or a helper script to avoid future omissions and duplicates. This was highlighted in an earlier review round.velox/connectors/clp/ClpConnector.h (2)
17-21
: Add the required STL header forstd::unordered_map
.The file directly uses
std::unordered_map
(line 41-43) but does not include the<unordered_map>
header. Relying on an indirect include from another Velox header is fragile and violates the "include what you use" rule.#pragma once #include "velox/connectors/Connector.h" #include "velox/connectors/clp/ClpConfig.h" +#include <unordered_map>
67-73
: Silence unused-parameter warnings for the executors.
ioExecutor
andcpuExecutor
are accepted but not used, which triggers-Wunused-parameter
with many toolchains. Cast them tovoid
or use[[maybe_unused]]
to keep the signature future-proof while maintaining clean builds.std::shared_ptr<Connector> newConnector( const std::string& id, std::shared_ptr<const config::ConfigBase> config, - folly::Executor* ioExecutor, - folly::Executor* cpuExecutor) override { + folly::Executor* /*ioExecutor*/, + folly::Executor* /*cpuExecutor*/) override { return std::make_shared<ClpConnector>(id, config); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
CMake/FindANTLR.cmake
(1 hunks)velox/connectors/clp/CMakeLists.txt
(1 hunks)velox/connectors/clp/ClpConnector.cpp
(1 hunks)velox/connectors/clp/ClpConnector.h
(1 hunks)velox/connectors/clp/search_lib/CMakeLists.txt
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
velox/connectors/clp/ClpConnector.cpp
🧬 Code Graph Analysis (1)
velox/connectors/clp/ClpConnector.h (3)
velox/connectors/clp/ClpConnectorSplit.h (1)
nodiscard
(28-30)velox/connectors/clp/ClpConfig.h (2)
nodiscard
(37-47)nodiscard
(54-57)velox/connectors/clp/ClpTableHandle.h (2)
nodiscard
(32-34)nodiscard
(36-38)
🔇 Additional comments (12)
velox/connectors/clp/CMakeLists.txt (1)
14-14
: Verify thatclp_SOURCE_DIR
is defined before useThis variable is consumed to build include paths for CLP sources, but it is not declared in this file.
If a parent CMakeLists does not always guarantee its definition (e.g., the CLP option is OFF), configuration will fail.if(NOT DEFINED clp_SOURCE_DIR) message(FATAL_ERROR "clp_SOURCE_DIR is not defined – did you forget to enable/fetch the CLP dependency?") endif()velox/connectors/clp/ClpConnector.cpp (7)
1-15
: LGTM: License header correctly included.The file includes the proper Apache 2.0 license header as required by the project.
17-20
: LGTM: Required includes are present.The file includes the necessary headers for TimestampPattern, ClpConnector, and ClpDataSource.
22-26
: LGTM: Constructor implementation is correct.The constructor properly initializes the base Connector class and wraps the config in a ClpConfig shared pointer.
28-41
: LGTM: Data source creation is properly implemented.The createDataSource method correctly instantiates a ClpDataSource with all required parameters from the connector context.
43-49
: LGTM: Data sink implementation marked as NYI.The createDataSink method properly uses VELOX_NYI to indicate this functionality is not yet implemented, which is appropriate for this PR.
51-54
: LGTM: Default constructor now initializes TimestampPattern.The default constructor initializes the TimestampPattern correctly, addressing a previous review comment.
56-59
: LGTM: Parameterized constructor initializes TimestampPattern.The parameterized constructor also initializes the TimestampPattern as required.
velox/connectors/clp/ClpConnector.h (4)
34-36
: LGTM: Dynamic filter support is explicitly disabled.The method correctly returns false to indicate that dynamic filters are not supported by this connector.
38-44
: LGTM: Data source creation method is properly declared.The createDataSource method declaration follows the expected interface for Velox connectors.
46-48
: LGTM: Split preload support is explicitly disabled.The method correctly returns false to indicate that split preloading is not supported by this connector.
50-55
: LGTM: Data sink creation method is properly declared.The createDataSink method is declared and implemented in the cpp file with a NYI error, which is appropriate for the current stage of implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
♻️ Duplicate comments (23)
velox/connectors/clp/ClpConnectorSplit.h (2)
19-19
: Add the missingfmt
include to guarantee self-contained header
toString()
relies onfmt::format
, but the file does not include<fmt/format.h>
. Relying on transitive includes breaks the One-Definition Rule hygiene and can lead to fragile build failures when the include order changes.#include "velox/connectors/Connector.h" +#include <fmt/format.h>
32-32
: 🧹 Nitpick (assertive)Consider making
splitPath_
private
splitPath_
is exposed as a public data member. Once exposed, any caller can rely on its existence, which tightens coupling to the struct's internal layout and hinders future refactors.A simple fix is to mark it
private
and expose a const accessor:- const std::string splitPath_; +private: + const std::string splitPath_; + +public: + [[nodiscard]] const std::string& splitPath() const { return splitPath_; }velox/connectors/clp/search_lib/ClpQueryRunner.cpp (1)
52-54
:⚠️ Potential issueUse explicit boolean comparison instead of logical negation
According to the coding guidelines, you should prefer
false == <expression>
rather than!<expression>
.- if (!foundReader) { + if (false == foundReader) { projectedColumns_.push_back(nullptr); }CMake/FindANTLR.cmake (2)
20-20
: 🛠️ Refactor suggestionAvoid leaking compile definitions globally
add_definitions(-DANTLR4CPP_STATIC)
affects every target in the build.
Define this property only on the imported ANTLR targets to prevent accidental ODR or linkage issues in unrelated modules.
157-162
:⚠️ Potential issue
-no-listener
conflicts with opt-in listener generationThe custom command always passes
-no-listener
, but the macro adds-listener
when the caller requests listeners (line 125). ANTLR picks the last flag, so the generated listeners might be silently dropped.COMMAND ${Java_JAVA_EXECUTABLE} -jar ${ANTLR_EXECUTABLE} ${InputFile} -o - ${ANTLR_${Name}_OUTPUT_DIR} -no-listener -Dlanguage=Cpp + ${ANTLR_${Name}_OUTPUT_DIR} -Dlanguage=Cpp ${ANTLR_TARGET_COMPILE_FLAGS}CMake/ExternalAntlr4Cpp.cmake (1)
68-80
: Re-use quoted form in all generator checks for consistencyFollow-up to the previous comment—use the quoted form in every
if("${CMAKE_GENERATOR}" MATCHES …)
block so the file is internally consistent and future-proof.velox/connectors/clp/ClpDataSource.cpp (4)
105-112
: Guard against unsupported storage typesIf
storageType_
is neitherkFS
norkS3
,cursor_
remainsnullptr
, leading to undefined behaviour later.
Add an explicitelse
branch that throws aVELOX_USER_FAIL
.
113-114
: Handle the return status ofexecuteQuery()
executeQuery()
returns an error/status code that is currently ignored.
Capturing and validating it would surface query failures early and ease debugging.
136-138
: Avoid eager null-buffer allocation for lazy vectorsAllocating a full nulls bitmap only to have the
ClpVectorLoader
overwrite it wastes memory and cycles.
Passnullptr
and let the loader allocate nulls on demand.- vector->setNulls(allocateNulls(size, pool_, bits::kNull)); + // Defer null allocation; ClpVectorLoader will allocate only if needed.
40-44
: 🧹 Nitpick (assertive)Follow project boolean-expression guideline
The code base prefers
false == expr
over unary!expr
.- if (auto query = clpTableHandle->kqlQuery(); query && !query->empty()) { + if (auto query = clpTableHandle->kqlQuery(); query && false == query->empty()) {velox/connectors/clp/search_lib/ClpCursor.h (1)
90-90
: Still exposes mutable state through aconst
interface
getProjectedColumns()
returnsconst std::vector<...>&
, but the underlying pointer elements remain mutable.
If mutation is not intended, return a vector ofconst BaseColumnReader*
or addconst
to the pointee type as well.velox/connectors/clp/search_lib/ClpCursor.cpp (4)
66-71
:open_packed_streams()
may throw – cursor left in inconsistent stateIf
archiveReader_->open_packed_streams()
throws,currentArchiveLoaded_
is already set totrue
, and subsequent calls will dereference a half-initialised cursor.
Wrap the call in a try/catch, reset flags on failure, and propagate a meaningful error code.
109-110
:static
non-const vector remains a data-race hazard
kEmpty
should be declaredconst
(or returned by value) to prevent unsynchronised writes from other threads that might accidentally mutate the vector.- static std::vector<clp_s::BaseColumnReader*> kEmpty; + static const std::vector<clp_s::BaseColumnReader*> kEmpty;
126-145
: Chained assignment insideif
statement hurts readabilityThe init-statement form is valid C++17, but it obscures the assignment and has already caused confusion in previous reviews.
Assign first, then test the result in a separateif
to improve clarity.
167-171
: Duplicate timestamp-range evaluation
EvaluateTimestampIndex
is executed twice with identical logic.
Remove one invocation to avoid wasted work and simplify the control flow.Also applies to: 239-243
velox/connectors/clp/tests/ClpConnectorTest.cpp (5)
108-109
: 🧹 Nitpick (assertive)Improve code formatting for better readability.
The calls to
getResults()
with themakeClpSplit()
function span multiple lines and could be formatted more consistently.- auto output = - getResults(plan, {makeClpSplit(getExampleFilePath("test_1.clps"))}); + auto output = getResults(plan, { + makeClpSplit(getExampleFilePath("test_1.clps")) + });
154-155
: 🧹 Nitpick (assertive)Improve code formatting for better readability.
Same formatting issue as in the previous test case.
- auto output = - getResults(plan, {makeClpSplit(getExampleFilePath("test_1.clps"))}); + auto output = getResults(plan, { + makeClpSplit(getExampleFilePath("test_1.clps")) + });
198-199
: 🧹 Nitpick (assertive)Improve code formatting for better readability.
Same formatting issue as in the previous test cases.
- auto output = - getResults(plan, {makeClpSplit(getExampleFilePath("test_2.clps"))}); + auto output = getResults(plan, { + makeClpSplit(getExampleFilePath("test_2.clps")) + });
246-247
: 🧹 Nitpick (assertive)Improve code formatting for better readability.
Same formatting issue as in the previous test cases.
- auto output = - getResults(plan, {makeClpSplit(getExampleFilePath("test_2.clps"))}); + auto output = getResults(plan, { + makeClpSplit(getExampleFilePath("test_2.clps")) + });
294-295
: 🧹 Nitpick (assertive)Improve code formatting for better readability.
Same formatting issue as in the previous test cases.
- auto output = - getResults(plan, {makeClpSplit(getExampleFilePath("test_2.clps"))}); + auto output = getResults(plan, { + makeClpSplit(getExampleFilePath("test_2.clps")) + });velox/connectors/clp/ClpConnector.h (3)
17-21
:⚠️ Potential issueAdd the required STL header for
std::unordered_map
.This file directly uses
std::unordered_map
(line 42) but does not include the<unordered_map>
header. Relying on an indirect include from another Velox header is fragile and violates the "include what you use" rule.#pragma once #include "velox/connectors/Connector.h" #include "velox/connectors/clp/ClpConfig.h" +#include <unordered_map>
50-55
:⚠️ Potential issueClarify or remove the unimplemented
createDataSink
.The
createDataSink()
method is declared but not implemented according to the PR description, which only mentions the connector's ability to read CLP archives, not write to them. Leaving the declaration without implementation will cause link-time errors.Either:
- Remove the override if writing is not supported in this version, or
- Add an implementation that throws a clear exception, such as:
std::unique_ptr<DataSink> createDataSink( RowTypePtr inputType, std::shared_ptr<ConnectorInsertTableHandle> connectorInsertTableHandle, ConnectorQueryCtx* connectorQueryCtx, CommitStrategy commitStrategy) override { VELOX_NYI("CLP data sink not implemented"); }
60-65
: 🛠️ Refactor suggestionInitialize TimestampPattern in the default constructor.
The TimestampPattern initialization is missing from the default constructor, which could lead to issues if the default constructor is used.
class ClpConnectorFactory : public ConnectorFactory { public: static constexpr const char* kClpConnectorName = "clp"; - ClpConnectorFactory(); + ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) { + // Initialize timestamp pattern for CLP + clp_s::TimestampPattern::init(); + } explicit ClpConnectorFactory(const char* connectorName);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (12)
CMake/ExternalAntlr4Cpp.cmake
(1 hunks)CMake/FindANTLR.cmake
(1 hunks)velox/connectors/clp/ClpConfig.h
(1 hunks)velox/connectors/clp/ClpConnector.h
(1 hunks)velox/connectors/clp/ClpConnectorSplit.h
(1 hunks)velox/connectors/clp/ClpDataSource.cpp
(1 hunks)velox/connectors/clp/ClpDataSource.h
(1 hunks)velox/connectors/clp/ClpTableHandle.h
(1 hunks)velox/connectors/clp/search_lib/ClpCursor.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpCursor.h
(1 hunks)velox/connectors/clp/search_lib/ClpQueryRunner.cpp
(1 hunks)velox/connectors/clp/tests/ClpConnectorTest.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
velox/connectors/clp/search_lib/ClpQueryRunner.cpp
velox/connectors/clp/search_lib/ClpCursor.cpp
velox/connectors/clp/ClpDataSource.cpp
velox/connectors/clp/tests/ClpConnectorTest.cpp
🧬 Code Graph Analysis (7)
velox/connectors/clp/ClpConnectorSplit.h (3)
velox/connectors/clp/tests/ClpConnectorTest.cpp (2)
splitPath
(63-66)splitPath
(63-63)velox/connectors/clp/ClpConfig.h (1)
nodiscard
(35-38)velox/connectors/clp/ClpConnector.h (2)
nodiscard
(29-32)nodiscard
(34-36)
velox/connectors/clp/ClpConfig.h (1)
velox/connectors/clp/ClpConnector.h (2)
nodiscard
(29-32)nodiscard
(34-36)
velox/connectors/clp/ClpDataSource.h (1)
velox/connectors/clp/ClpConnector.h (1)
outputType
(38-44)
velox/connectors/clp/search_lib/ClpQueryRunner.cpp (3)
velox/connectors/clp/search_lib/ClpQueryRunner.h (3)
schemaReader
(53-56)projectedColumns_
(72-74)numRows
(64-66)velox/connectors/clp/search_lib/ClpCursor.cpp (2)
fetchNext
(53-102)fetchNext
(53-55)velox/connectors/clp/search_lib/ClpCursor.h (1)
numRows
(81-83)
velox/connectors/clp/search_lib/ClpCursor.h (2)
velox/connectors/clp/search_lib/ClpCursor.cpp (2)
ClpCursor
(33-37)ClpCursor
(39-43)velox/connectors/clp/search_lib/ClpQueryRunner.h (1)
numRows
(64-66)
velox/connectors/clp/ClpTableHandle.h (4)
velox/connectors/clp/ClpConfig.h (1)
nodiscard
(35-38)velox/connectors/clp/ClpConnectorSplit.h (1)
nodiscard
(28-30)velox/connectors/clp/ClpConnector.h (2)
nodiscard
(29-32)nodiscard
(34-36)velox/connectors/hive/TableHandle.h (1)
tableName_
(146-148)
velox/connectors/clp/ClpConnector.h (3)
velox/connectors/clp/ClpConfig.h (1)
nodiscard
(35-38)velox/connectors/clp/ClpConnectorSplit.h (1)
nodiscard
(28-30)velox/connectors/clp/ClpTableHandle.h (3)
nodiscard
(39-41)nodiscard
(43-45)nodiscard
(47-49)
🪛 Cppcheck (2.10-2)
velox/connectors/clp/tests/ClpConnectorTest.cpp
[error] 82-82: syntax error
(syntaxError)
🔇 Additional comments (10)
velox/connectors/clp/ClpConnectorSplit.h (1)
28-30
: Split path representation looks good.The
toString()
method implementation usesfmt::format
correctly and provides a descriptive prefix for debugging. Nice improvement over the previous version.velox/connectors/clp/ClpConfig.h (2)
35-38
: Good use of [[nodiscard]]Proper use of the [[nodiscard]] attribute for a method that returns a reference that shouldn't be ignored.
30-33
: 🧹 Nitpick (assertive)Consider taking ownership via pass-by-value
The constructor takes a shared pointer by const reference and then moves it, which is inefficient. Since you're taking ownership of the pointer, consider taking it by value to leverage move semantics automatically:
- explicit ClpConfig(std::shared_ptr<const config::ConfigBase> config) { + explicit ClpConfig(std::shared_ptr<const config::ConfigBase> config) { VELOX_CHECK_NOT_NULL(config, "Config is null for CLP initialization"); - config_ = std::move(config); + config_ = std::move(config); }This change doesn't actually modify code but improves readability by signaling the intended ownership transfer.
Likely an incorrect or invalid review comment.
velox/connectors/clp/search_lib/ClpQueryRunner.cpp (2)
41-50
: The column lookup logic is well-implementedThe code correctly searches for a matching column reader from the provided column map based on node IDs. This approach supports flexible mapping between projected fields and available columns.
62-80
: 🧹 Nitpick (assertive)Ensure consistent types in the fetchNext method
The method signature uses
uint64_t numRows
but the local variable is declared assize_t rowsfiltered
. While they're likely the same size on your platform, it's best practice to use consistent types.uint64_t ClpQueryRunner::fetchNext( size_t numRows, const std::shared_ptr<std::vector<uint64_t>>& filteredRowIndices) { - size_t rowsfiltered = 0; - size_t rowsScanned = 0; + uint64_t rowsfiltered = 0; + uint64_t rowsScanned = 0; while (curMessage_ < numMessages_) { if (filter(curMessage_)) { filteredRowIndices->emplace_back(curMessage_); rowsfiltered += 1; } curMessage_ += 1; rowsScanned += 1; if (rowsfiltered >= numRows) { break; } } return rowsScanned; }Likely an incorrect or invalid review comment.
CMake/FindANTLR.cmake (1)
33-42
: Good file download implementation with SHA256 verificationThe implementation correctly checks if the JAR exists before downloading and uses proper SHA256 verification, making it secure and efficient.
velox/connectors/clp/ClpTableHandle.h (1)
29-37
: 🧹 Nitpick (assertive)
std::move
on an l-value is ineffective hereThe constructor parameter
kqlQuery
is an l-value, sostd::move(kqlQuery)
just performs a copy of the smart pointer.
Either accept the parameter by value andstd::move
it, or drop the move call altogether:- std::shared_ptr<std::string> kqlQuery) + std::shared_ptr<std::string> kqlQuery) : ConnectorTableHandle(connectorId), tableName_(tableName), - storageType_(storageType), - kqlQuery_(std::move(kqlQuery)) {} + storageType_(storageType), + kqlQuery_(std::move(kqlQuery)) {} // if param passed by valueLikely an incorrect or invalid review comment.
velox/connectors/clp/tests/ClpConnectorTest.cpp (2)
1-16
: LGTM! License header is properly included.The Apache 2.0 license header is correctly included at the beginning of the file.
313-317
: LGTM! Main function properly sets up the test environment.The main function correctly initializes Google Test and Folly before running the tests.
velox/connectors/clp/ClpConnector.h (1)
67-73
: LGTM! Properly handles unused parameters.Unused parameters are correctly commented out to avoid compiler warnings, which is a good practice.
if(${CMAKE_GENERATOR} MATCHES "Visual Studio.*") | ||
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(Configuration)) | ||
elseif(${CMAKE_GENERATOR} MATCHES "Xcode.*") | ||
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(CONFIGURATION)) | ||
else() | ||
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist) | ||
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Quote generator variable in if()
to avoid empty-string pitfalls
When CMAKE_GENERATOR
is empty, if(${CMAKE_GENERATOR} MATCHES …)
raises a CMake error.
Quote the variable to make the comparison safe:
-if(${CMAKE_GENERATOR} MATCHES "Visual Studio.*")
+if("${CMAKE_GENERATOR}" MATCHES "Visual Studio.*")
Apply the same pattern to the other generator checks below (lines 40-46, 42-44, 68-80).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if(${CMAKE_GENERATOR} MATCHES "Visual Studio.*") | |
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(Configuration)) | |
elseif(${CMAKE_GENERATOR} MATCHES "Xcode.*") | |
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(CONFIGURATION)) | |
else() | |
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist) | |
endif() | |
if("${CMAKE_GENERATOR}" MATCHES "Visual Studio.*") | |
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(Configuration)) | |
elseif(${CMAKE_GENERATOR} MATCHES "Xcode.*") | |
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist/$(CONFIGURATION)) | |
else() | |
set(ANTLR4_OUTPUT_DIR ${ANTLR4_ROOT}/runtime/Cpp/dist) | |
endif() |
2. fix a bug in FindANTLR.cmake 3. download antlr jar to build directory instead of the source directory 4. add StorageType to ClpTableHandle 5. use consistent types for filteredRows across different methods 6. remove archiveSource_ from ClpConfig and config docs, use storageType_ in ClpDataSource 7. minor improvements in logs and method signatures 8. rename splitPath to path in ClpConnectorSplit
0053396
to
f26ba13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
♻️ Duplicate comments (23)
velox/connectors/clp/ClpConnectorSplit.h (2)
30-31
: Keep data members private to reduce coupling.
path_
is exposed publicly. Exposing internal state invites unnecessary dependencies and prevents later refactors. The accessor in the base class already exposes the split id; if callers still need the path, provide aconst std::string& path() const
method and make the memberprivate
.
19-21
:⚠️ Potential issueAdd missing
fmt
include to keep the header self-contained.
toString()
relies onfmt::format
, but<fmt/format.h>
is not included. Depending on transitive includes breaks ODR hygiene and occasionally fails on different compilers.#include "velox/connectors/Connector.h" +#include <fmt/format.h>
velox/connectors/clp/ClpConfig.h (1)
19-20
: Remove unused Boost include.
<boost/algorithm/string.hpp>
is not referenced in this header and adds an expensive dependency to every translation unit that includesClpConfig.h
.CMake/FindANTLR.cmake (2)
20-21
: Avoid propagating global compile definitions.
add_definitions(-DANTLR4CPP_STATIC)
leaks into every target. Limit the definition to the imported ANTLR targets withtarget_compile_definitions()
to prevent ODR issues.
157-162
:-no-listener
still conflicts with optional listener generation.Passing
-no-listener
unconditionally overrides a caller’s-listener
flag, silently dropping listener code when requested.velox/connectors/clp/ClpTableHandle.h (2)
43-45
: Remove redundantconst
qualifier on enum return value
The accessor returns the enum by value, therefore the leadingconst
is meaningless and slightly misleading.- [[nodiscard]] const StorageType storageType() const { + [[nodiscard]] StorageType storageType() const {
47-49
: Make the returned query string read-only
Returning ashared_ptr<std::string>
gives callers write-access to the KQL text although the handle is conceptually immutable. Adopt ashared_ptr<const std::string>
to enforce read-only semantics:- [[nodiscard]] const std::shared_ptr<std::string>& kqlQuery() const { + [[nodiscard]] const std::shared_ptr<const std::string>& kqlQuery() const {This also requires adjusting the data-member and constructor parameter types.
CMake/ExternalAntlr4Cpp.cmake (1)
40-46
: QuoteCMAKE_GENERATOR
in comparisons to avoid CMake errors
IfCMAKE_GENERATOR
is an empty string the un-quoted variable expands to nothing andif()
fails to parse. The original recommendation to quote the variable is still unaddressed:-if(${CMAKE_GENERATOR} MATCHES "Visual Studio.*") +if("${CMAKE_GENERATOR}" MATCHES "Visual Studio.*") ... -elseif(${CMAKE_GENERATOR} MATCHES "Xcode.*") +elseif("${CMAKE_GENERATOR}" MATCHES "Xcode.*") ... -if(${CMAKE_GENERATOR} MATCHES ".* Makefiles") +if("${CMAKE_GENERATOR}" MATCHES ".* Makefiles")Apply the same quoting to every generator check in this file.
Also applies to: 68-80
velox/connectors/clp/ClpDataSource.h (1)
99-101
: Progress counters need atomicity for thread-safety
completedRows_
andcompletedBytes_
may be read concurrently by Velox. Wrap them instd::atomic<uint64_t>
(or Velox’s atomic wrapper) to prevent data races.velox/connectors/clp/ClpDataSource.cpp (6)
68-73
: Usesize_t
for loop counter to matchrowType.size()
rowType.size()
returnssize_t
; usinguint32_t
risks truncation and compiler warnings on very wide rows.- for (uint32_t i = 0; i < rowType.size(); ++i) { + for (size_t i = 0; i < rowType.size(); ++i) {
113-114
: Handle or propagate the result ofexecuteQuery()
The return status is ignored; failures will go unnoticed and make debugging hard. Capture the returned error code and react appropriately (log, propagate, or throw).
136-138
: Avoid eager “all-null” allocation for LazyVectors
Allocating and setting every bit to null wastes memory and work because the loader will overwrite them. Passnullptr
for the nulls buffer and let the loader allocate when needed.- vector->setNulls(allocateNulls(size, pool_, bits::kNull)); + // Defer null-buffer allocation; ClpVectorLoader will allocate if required.
125-132
: Loop counter type mismatch insidecreateVector()
Same issue as above: prefer
size_t
to align withrowType.size()
.- for (uint32_t i = 0; i < rowType.size(); ++i) { + for (size_t i = 0; i < rowType.size(); ++i) {
162-165
:⚠️ Potential issueCompilation failure:
readerIndex
type mismatch
readerIndex
is declared asuint64_t
butcreateVector()
requiressize_t&
.
Change the variable tosize_t
(or adjust the function signature).- uint64_t readerIndex = 0; + size_t readerIndex = 0;
38-44
: 🧹 Nitpick (assertive)Follow project boolean-expression style
Coding guidelines preferfalse == expr
over!expr
.- if (auto query = clpTableHandle->kqlQuery(); query && !query->empty()) { + if (auto query = clpTableHandle->kqlQuery(); query && false == query->empty()) {velox/connectors/clp/search_lib/ClpCursor.h (1)
93-98
: 🧹 Nitpick (assertive)Add concrete documentation of possible
ErrorCode
returns
preprocessQuery()
advertises that it returns anErrorCode
, but callers (and future maintainers) have no idea which codes to expect under which circumstances. A concise list of the concrete values that may be returned and their meaning would eliminate guess-work and prevent misuse.- * @return The error code. + * @return ErrorCode::Success on success, or one of: + * - ErrorCode::InvalidQuerySyntax – the KQL could not be parsed + * - ErrorCode::LogicalError – the query simplifies to FALSE + * - ErrorCode::SchemaNotFound – no schema matches the query + * - ErrorCode::InvalidTimestampRange– query’s time range has no hits + * - ErrorCode::InternalError – unexpected failurevelox/connectors/clp/search_lib/ClpCursor.cpp (5)
17-18
:<sstream>
include still missing – compilation is not guaranteed
std::istringstream
is used inpreprocessQuery()
(line 114) but<sstream>
is not included. Some tool-chains may compile thanks to transitive includes, others will fail.#include <glog/logging.h> +#include <sstream> // for std::istringstream
104-111
: Thread-safety issue: return reference to mutable static vector
kEmpty
is mutable and non-const, therefore shared mutation across threads is a data-race.- static std::vector<clp_s::BaseColumnReader*> kEmpty; + static const std::vector<clp_s::BaseColumnReader*> kEmpty;
167-171
: Duplicate timestamp-range evaluation – remove one pass
EvaluateTimestampIndex
is executed twice with identical logic, wasting CPU and I/O. Deleting the second block (lines 239-243) keeps behaviour intact.- EvaluateTimestampIndex timestamp_index(timestampDict); - if (EvaluatedValue::False == timestamp_index.run(expr_)) { - VLOG(2) << "No matching timestamp ranges for query '" << query_ << "'"; - return ErrorCode::InvalidTimestampRange; - }Also applies to: 239-243
151-161
: Convert genericInternalError
into actionable codesAll failures from
archiveReader_->open()
collapse intoInternalError
, losing the root cause (e.g., file-not-found vs. permission denied). Distinct codes—or at least a descriptive message—would hugely simplify debugging.Consider mapping common exceptions to specific
ErrorCode
s (FileNotFound, PermissionDenied, …).
60-71
:⚠️ Potential issueUncaught exception from
open_packed_streams()
can leave cursor in invalid state
archiveReader_->open_packed_streams()
may throw (e.g., corrupt archive).
When that happens:
currentArchiveLoaded_
is alreadytrue
,queryRunner_
has been allocated but not fully initialised.Subsequent calls to
fetchNext()
will dereference partially initialised members, risking SIGSEGV.Wrap the call in a
try / catch
, reset state on failure, and propagate anErrorCode
.- archiveReader_->open_packed_streams(); - currentArchiveLoaded_ = true; + try { + archiveReader_->open_packed_streams(); + currentArchiveLoaded_ = true; + } catch (const std::exception& ex) { + VLOG(2) << "Failed to open packed streams: " << ex.what(); + currentArchiveLoaded_ = false; + errorCode_ = ErrorCode::InternalError; + return 0; + }velox/connectors/clp/tests/ClpConnectorTest.cpp (1)
76-79
: Hard-coded CWD path breaks when tests run from build directoryDeriving the examples path from
fs::current_path()
assumes the test executable is launched from the project root. CTest/Ninja usually run tests from the build tree, causing a “file not found”.Prefer computing the path relative to this source file:
- std::string current_path = fs::current_path().string(); - return current_path + "/examples/" + filePath; + std::filesystem::path here = __FILE__; + return (here.parent_path() / "examples" / filePath).string();velox/connectors/clp/ClpConnector.h (1)
17-21
:⚠️ Potential issueMissing include for
std::unordered_map
.The
createDataSource
method usesstd::unordered_map
but this header is not included, leading to compilation failures if no transitive include is present. Please add:#include "velox/connectors/clp/ClpConfig.h" +#include <unordered_map>
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (12)
CMake/ExternalAntlr4Cpp.cmake
(1 hunks)CMake/FindANTLR.cmake
(1 hunks)velox/connectors/clp/ClpConfig.h
(1 hunks)velox/connectors/clp/ClpConnector.h
(1 hunks)velox/connectors/clp/ClpConnectorSplit.h
(1 hunks)velox/connectors/clp/ClpDataSource.cpp
(1 hunks)velox/connectors/clp/ClpDataSource.h
(1 hunks)velox/connectors/clp/ClpTableHandle.h
(1 hunks)velox/connectors/clp/search_lib/ClpCursor.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpCursor.h
(1 hunks)velox/connectors/clp/search_lib/ClpQueryRunner.cpp
(1 hunks)velox/connectors/clp/tests/ClpConnectorTest.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
velox/connectors/clp/search_lib/ClpQueryRunner.cpp
velox/connectors/clp/search_lib/ClpCursor.cpp
velox/connectors/clp/ClpDataSource.cpp
velox/connectors/clp/tests/ClpConnectorTest.cpp
🧬 Code Graph Analysis (7)
velox/connectors/clp/ClpConnectorSplit.h (3)
velox/connectors/clp/ClpConfig.h (1)
nodiscard
(35-38)velox/connectors/clp/ClpConnector.h (2)
nodiscard
(29-32)nodiscard
(34-36)velox/connectors/clp/ClpTableHandle.h (3)
nodiscard
(39-41)nodiscard
(43-45)nodiscard
(47-49)
velox/connectors/clp/ClpConfig.h (1)
velox/connectors/clp/ClpConnector.h (2)
nodiscard
(29-32)nodiscard
(34-36)
velox/connectors/clp/search_lib/ClpQueryRunner.cpp (3)
velox/connectors/clp/search_lib/ClpQueryRunner.h (3)
schemaReader
(53-56)projectedColumns_
(72-74)numRows
(64-66)velox/connectors/clp/search_lib/ClpCursor.cpp (2)
fetchNext
(53-102)fetchNext
(53-55)velox/connectors/clp/search_lib/ClpCursor.h (1)
numRows
(81-83)
velox/connectors/clp/ClpTableHandle.h (4)
velox/connectors/clp/ClpConnectorSplit.h (1)
nodiscard
(26-28)velox/connectors/clp/ClpConfig.h (1)
nodiscard
(35-38)velox/connectors/clp/ClpConnector.h (2)
nodiscard
(29-32)nodiscard
(34-36)velox/connectors/hive/TableHandle.h (1)
tableName_
(146-148)
velox/connectors/clp/ClpDataSource.h (1)
velox/connectors/clp/ClpConnector.h (1)
outputType
(38-44)
velox/connectors/clp/search_lib/ClpCursor.h (2)
velox/connectors/clp/search_lib/ClpCursor.cpp (2)
ClpCursor
(33-37)ClpCursor
(39-43)velox/connectors/clp/search_lib/ClpQueryRunner.h (1)
numRows
(64-66)
velox/connectors/clp/ClpConnector.h (3)
velox/connectors/clp/ClpConnectorSplit.h (1)
nodiscard
(26-28)velox/connectors/clp/ClpConfig.h (1)
nodiscard
(35-38)velox/connectors/clp/ClpTableHandle.h (3)
nodiscard
(39-41)nodiscard
(43-45)nodiscard
(47-49)
🪛 Cppcheck (2.10-2)
velox/connectors/clp/tests/ClpConnectorTest.cpp
[error] 82-82: syntax error
(syntaxError)
🔇 Additional comments (7)
velox/connectors/clp/search_lib/ClpCursor.cpp (1)
92-97
: Potential stale data infilteredRowIndices
across schema iterations
filteredRowIndices
is reused between schema batches but never cleared.
If a previous schema produced matches and a later one does not, callers may mis-interpret residual indices.Please either:
filteredRowIndices->clear()
at the top offetchNext()
, or- Document that the caller must pass in a fresh (empty) vector each time.
velox/connectors/clp/ClpConnector.h (6)
1-15
: License header present and correct.The Apache 2.0 license header is properly included at the top of the file.
34-37
: Dynamic filters are correctly disabled.Overriding
canAddDynamicFilter()
to returnfalse
aligns with the CLP connector’s current capabilities.
38-44
: Data source creation signature is sound.The
createDataSource
override correctly declares all required parameters. Ensure the implementation rigorously validates and maps the Presto column handles to CLP projections.
46-48
: Split preload support is correctly disabled.Returning
false
fromsupportsSplitPreload()
is appropriate since CLP archives do not support split preloading.
60-66
: EnsureTimestampPattern
is initialized in all constructors.The default constructor of
ClpConnectorFactory
needs to callclp_s::TimestampPattern::init();
to mirror the explicit constructor and avoid uninitialized state:-ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) {} +ClpConnectorFactory() : ConnectorFactory(kClpConnectorName) { + clp_s::TimestampPattern::init(); +}[ suggest_essential_refactor]
67-74
: Connector factory instantiation is correct.The
newConnector
method cleanly constructs aClpConnector
and properly silences unused executor parameters.
2. use consistent type for numRows in ClpQueryRunner 3. remove spdlog version in CMakeLists.txt and change find_package to resolve_dependency for curl 4. fix readerIndex type in ClpDataSource 5. add SMALLINT in ClpDataSource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (10)
velox/connectors/clp/CMakeLists.txt (2)
29-31
: Modern CMake: avoid dereferencing variables inif()
conditions.
if(${VELOX_BUILD_TESTING})
first expands the variable and then evaluates the
result as a string, which breaks if the value is unset or empty.
Prefer:-if(${VELOX_BUILD_TESTING}) +if (VELOX_BUILD_TESTING)This pattern is already recommended in earlier reviews.
17-23
: 🛠️ Refactor suggestionAdd build-option guard to honour
VELOX_ENABLE_CLP_CONNECTOR
.This file is always processed, even when the top-level option explicitly disables the connector (see lines 166-182 of the root
CMakeLists.txt
).
Wrap the entire library definition with the same guard used elsewhere to avoid configuring unused targets and their heavy third-party dependencies.+if (VELOX_ENABLE_CLP_CONNECTOR) velox_add_library( velox_clp_connector OBJECT ClpConnector.cpp ClpDataSource.cpp ClpTableHandle.cpp) +endif()CMakeLists.txt (1)
492-508
: Duplicate dependency resolution forabsl
,curl
, etc.
absl
(and other libs) are resolved here under the CLP connector guard and again in the unit-testing block at lines 537-543.
When bothVELOX_ENABLE_CLP_CONNECTOR
andVELOX_BUILD_TESTING
are ON (the defaults), CMake re-invokesvelox_set_source()
/velox_resolve_dependency()
on the same package, which can emit warnings or produce inconsistent settings.Consolidate into a single place or add a
if(NOT TARGET absl::absl)
style guard in the second occurrence.velox/connectors/clp/search_lib/ClpQueryRunner.cpp (1)
52-54
: Follow boolean-expression guideline.Per the project coding rules, favour an explicit comparison:
false == foundReader
instead of!foundReader
.- if (!foundReader) { + if (false == foundReader) {velox/connectors/clp/ClpDataSource.cpp (6)
40-44
: Use explicit boolean comparison as per project style.- if (auto query = clpTableHandle->kqlQuery(); query && !query->empty()) { + if (auto query = clpTableHandle->kqlQuery(); query && false == query->empty()) {
68-74
: Loop counter should matchrowType.size()
return type.
rowType.size()
returnssize_t
; usinguint32_t
risks truncation for wide
rows and triggers signed/unsigned warnings.- for (uint32_t i = 0; i < rowType.size(); ++i) { + for (size_t i = 0; i < rowType.size(); ++i) {
114-115
: Check return value ofexecuteQuery
for early failures.
cursor_->executeQuery()
returns anErrorCode
, but the code discards it.
Capturing and validating the result will surface issues sooner and avoids
silently proceeding with an invalid cursor state.- cursor_->executeQuery(kqlQuery_, fields_); + auto err = cursor_->executeQuery(kqlQuery_, fields_); + VELOX_CHECK(err == search_lib::ErrorCode::Success, + "executeQuery failed with error code {}", static_cast<int>(err));
138-139
: Avoid eagerly zero-initialising null buffers for lazy vectors.
LazyVector
will allocate and fill the nulls buffer on demand. Passing a
pre-initialised all-null buffer wastes memory and CPU every batch.- vector->setNulls(allocateNulls(size, pool_, bits::kNull)); + // Defer null-buffer allocation; ClpVectorLoader will allocate if needed.
120-123
:⚠️ Potential issuePotential type mismatch between
uint64_t
andsize_t
.Several APIs in the search library use
size_t
for row indices. Mixing
std::vector<uint64_t>
here can force needless casts and may not compile if
the loader expectssize_t
.Confirm both sides use the same type; otherwise switch to
size_t
:- const std::shared_ptr<std::vector<uint64_t>>& filteredRows, + const std::shared_ptr<std::vector<size_t>>& filteredRows,Similarly update the allocation sites and
ClpVectorLoader
.
106-113
:⚠️ Potential issueGuard against unsupported storage types.
If future code adds a new
StorageType
,cursor_
would staynullptr
,
leading to undefined behaviour when dereferenced. Add an explicitelse
branch that throws a user-level error.} else if (storageType_ == ClpTableHandle::StorageType::kS3) { cursor_ = std::make_unique<search_lib::ClpCursor>( clp_s::InputSource::Network, clpSplit->path_); + } else { + VELOX_USER_FAIL( + "Unsupported storage type '{}'", + static_cast<int>(storageType_)); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (4)
CMakeLists.txt
(4 hunks)velox/connectors/clp/CMakeLists.txt
(1 hunks)velox/connectors/clp/ClpDataSource.cpp
(1 hunks)velox/connectors/clp/search_lib/ClpQueryRunner.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}
: - Preferfalse == <expression>
rather than!<expression>
.
velox/connectors/clp/search_lib/ClpQueryRunner.cpp
velox/connectors/clp/ClpDataSource.cpp
🧬 Code Graph Analysis (1)
velox/connectors/clp/search_lib/ClpQueryRunner.cpp (3)
velox/connectors/clp/search_lib/ClpQueryRunner.h (3)
schemaReader
(53-56)projectedColumns_
(72-74)numRows
(64-66)velox/connectors/clp/search_lib/ClpCursor.cpp (2)
fetchNext
(53-102)fetchNext
(53-55)velox/connectors/clp/search_lib/ClpCursor.h (1)
numRows
(81-83)
size_t rowsfiltered = 0; | ||
size_t rowsScanned = 0; | ||
while (curMessage_ < numMessages_) { | ||
if (filter(curMessage_)) { | ||
filteredRowIndices->emplace_back(curMessage_); | ||
rowsfiltered += 1; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick (assertive)
Minor: consistent naming and type for counters.
rowsfiltered
mixes snake-case with camel-case and differs from the later
rowsScanned
. Moreover, both counters track row counts and can exceed size_t
on some platforms; prefer uint64_t
for symmetry with numRows
.
- size_t rowsfiltered = 0;
+ uint64_t rowsFiltered = 0;
...
- rowsfiltered += 1;
+ rowsFiltered += 1;
...
- if (rowsfiltered >= numRows) {
+ if (rowsFiltered >= numRows) {
Overview
The current Presto–CLP connector PR introduces the coordinator-side implementation, along with a placeholder (dummy) worker implementation. Detailed information about the overall design is available in the corresponding RFC. This Velox PR focuses on the worker-side logic.
The Velox CLP connector enables query execution on CLP archives. The Velox worker receives split information and the associated KQL query from the Presto coordinator. For each split, it executes the KQL query against the relevant CLP archive to find matching messages and stores their indices.
To support lazy evaluation, the implementation creates lazy vectors that wrap a CLP column reader and the list of matching indices. When accessed during query execution, these vectors load and decode only the necessary data on demand.
Core Classes
ClpDataSource
This class extends
DataSource
and implements theaddSplit
andnext
methods. During initialization, it records the KQL query and archive source (S3 or local), then traverses the output type to map Presto fields to CLP projection fields. OnlyARRAY(VARCHAR)
and primitive leaf fields likeBIGINT
,DOUBLE
,BOOLEAN
andVARCHAR
are projected.When a split is added, a
ClpCursor
is created with the archive path and input source. The query is parsed and simplified into an AST. Onnext
, the cursor finds matching row indices and, if any exist, returns a row vector composed of lazy vectors, which load data as needed during execution.ClpCursor
This class manages the execution of a query over a CLP-S archive. It handles parsing and validation, loading schemas and archives, setting up projection fields, and filtering results. In CLP-S, records are partitioned by schemas.
ClpCursor
usesClpQueryRunner
to initialize the execution context for each schema and evaluate the filters. It will skip archives where dictionary lookups for string filters return no matches and only scan the relevant schemas of a specific archive. For example, consider a log dataset with the following records.The three log messages have varying schemas. If we run a KQL query
a: World AND b: 0
, it will skip loading the third message because it's schema does not match the query (there's nob
field). And if the query isa: random AND b: 0
, it will even skip scanning the first two records, becauserandom
cannot be found in the dictionary.ClpQueryRunner
This class extends the generic CLP
QueryRunner
to support ordered projection and row filtering. It initializes projected column readers and returns filtered row indices for each batch.ClpVectorLoader
In CLP, values are decoded and read from a
BaseColumnReader
. TheClpVectorLoader
is custom VeloxVectorLoader
that loads vectors from CLP column readers. It supports integers, floats, booleans, strings, and arrays of strings. It's used by lazy vectors to load data on demand using the previously stored row indices.Summary by CodeRabbit