-
Notifications
You must be signed in to change notification settings - Fork 21
[MOD-11650] Fix Out-of-Bounds Write in Vector Preprocessing #784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset. In case there are security findings, they will be communicated to you as a comment inside the PR. Hope you’ll enjoy using Jit. Questions? Comments? Want to learn more? Get in touch with us. |
in VecSimIndexAbstract move data members to private when possible DONT FIX LEAK YET Factories: move NewAbstractInitParams to a general location (new file factory_utils) replace it in all factories
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #784 +/- ##
==========================================
+ Coverage 96.64% 96.66% +0.01%
==========================================
Files 125 126 +1
Lines 7724 7707 -17
==========================================
- Hits 7465 7450 -15
+ Misses 259 257 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
use sanitizer in pull request
This caused a buffer overflow because setup<TieredIndexParams> didn't set the data type to INT8, creating a float32 index instead. DataBlock::addElement() tries to copy dim * sizeof(float) bytes, but the allocated buffer is only dim * sizeof(int8) bytes, causing a read overflow.
fix leaks that will be moved to a separate PR it was failing only with codecov becuase only there we use FP64_TESTS=1 Prevent template deduction errors in GenerateAndAddVector by making data_t parameter non-deducible Used std::type_identity<data_t>::type for the value parameter to force explicit template specification (e.g., GenerateAndAddVector<double>()) instead of allowing compiler to incorrectly deduce int from literal values, which caused buffer overflows when index expected different data types.
This reverts commit af844ec.
af844ec will be addressed in a separate PR to allow backporting to all version brnaches failing logs |
Sanitizer will also be added in a separate PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok, please make sure the todo in brute_force_factory.cpp
is complete or remove the TODO comment and we're good to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a critical buffer read overflow vulnerability in vector preprocessing by properly distinguishing between input blob sizes and stored data sizes throughout the vector similarity pipeline. The core issue was that preprocessors were receiving incorrect size parameters, causing them to read beyond available input data boundaries.
Key changes include:
- Added
inputBlobSize
member to track original input vector size separately from processed storage size - Updated all preprocessor API calls to use correct input blob sizes instead of stored data sizes
- Centralized factory logic to ensure consistent parameter initialization across all index types
Reviewed Changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/VecSim/vec_sim_index.h | Added inputBlobSize member and updated preprocessor calls to use correct sizes |
src/VecSim/utils/vec_utils.h/.cpp | Renamed VecSimParams_GetDataSize to VecSimParams_GetStoredDataSize for clarity |
src/VecSim/index_factories/factory_utils.h | New centralized factory utility template for consistent parameter initialization |
src/VecSim/index_factories/*.cpp | Updated factories to use centralized parameter creation and correct size calculations |
tests/unit/test_*.cpp | Fixed buffer overflows in tests and added comprehensive Cosine metric test coverage |
tests/unit/unit_test_utils.cpp | Updated to use renamed getStoredDataSize() method |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
fix getQueryBlob in tiered add getHNSWIterator to tiered batch itertor if its BUILD_TESTS
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin 8.2
git worktree add -d .worktree/backport-784-to-8.2 origin/8.2
cd .worktree/backport-784-to-8.2
git switch --create backport-784-to-8.2
git cherry-pick -x 9fb223ae78c5effccfcef44953a22e9d89df40f9 |
This PR addresses a potential buffer read overflow issue resulting from incorrect blob size handling during vector preprocessing.
The core issue was confusion between
storedDataSize
(processed blob size) and the actual input blob size, passingdataSize
(stored data size) to the preprocessors instead of the actual input blob size.This mismatch caused preprocessors to copy more data than available in the input blob, leading to out-of-bounds reads.
Correct Behavior
The system should properly distinguish between two different size concepts:
inputBlobSize
: Size of the original input vector blob (dim * sizeof(type)
)storedDataSize
: Size of vectors after preprocessing (may include extra bytes for normalization)Data Size Relationships:
Non-Tiered Indexes:
storedDataSize = inputBlobSize = dim * sizeof(type)
storedDataSize = inputBlobSize + 4
(norm stored at end)Tiered Indexes:
storedDataSize = inputBlobSize
What Was Fixed
The fix ensures memory safety by properly distinguishing between input blob sizes and stored data sizes throughout the vector similarity pipeline.
The fix was implemented by:
inputBlobSize
as a new member to theVecSimIndexAbstract
class andAbstractIndexInitParams
struct to explicitly track the original input vector sizeinputBlobSize
instead ofstoredDataSize
when processing input vectors, ensuring preprocessors only access the actual available input datainputBlobSize
represents the original input size (dim * sizeof(type)
) whilestoredDataSize
represents the final processed size (which may include extra bytes for normalizationAdditional Changes
Centralized Factory Logic:
src/VecSim/index_factories/factory_utils.h
(new)NewAbstractInitParams()
template function to standardize parameter initialization across all factories, eliminating code duplication and ensuring consistentinputBlobSize
calculation logic throughout the codebaseRenamed
dataSize
→StoredDataSize
for ClarityTest Fixes - Buffer Overflow in Element Size Estimation:
test_int8.cpp
,test_uint8.cpp
TieredIndexParams
was created fromHNSWParams
that didn't explicitly set thetype
field.The buffer overflow occurred because when the test called
addVector
,DataBlock::addElement()
tried to copydim * sizeof(float)
bytes (16 bytes for dim=4), but the allocated buffer was onlydim * sizeof(int8_t)
bytes (4 bytes for dim=4), leading to out-of-bounds memory access during vector operations.Test Fixes - Compiler Warnings:
test_svs.cpp
,test_svs_multi.cpp
,test_svs_tiered.cpp
quantBits
constexpr. The compiler couldn't prove at compile-time thatquantBits
was never zero, even with runtime checks. Usingconstexpr
moves the evaluation to compile-time, allowingif constexpr
to eliminate the division code whenquantBits == VecSimSvsQuant_NONE
, satisfying static analysis requirements.Fixed Element Size Estimation in Factory Functions:
src/VecSim/index_factories/brute_force_factory.cpp
,src/VecSim/index_factories/hnsw_factory.cpp
EstimateElementSize()
functions to useVecSimParams_GetStoredDataSize()
instead ofdim * sizeof(type)
for accurate memory estimation. This ensures size calculations account for additional storage requirements (such as normalization data for INT8/UINT8 with Cosine metric) and provides consistent estimation across all index types. Added comprehensive test coverage for Cosine metric element size estimation in INT8 and UINT8 test suites to validate the fix.Fixed Query Blob Access in Tiered Batch Iterator:
src/VecSim/batch_iterator.h
,src/VecSim/algorithms/hnsw/hnsw_tiered.h
getQueryBlob()
virtual in the baseVecSimBatchIterator
class and overrode it inTieredHNSW_BatchIterator
to properly delegate to the frontend iterator. This was necessary because the tiered batch iterator itself doesn't own any query vector - it relies on its internal frontend and backend iterators to manage their own query copies.Enhanced Batch Iterator Testing for INT8 Cosine Metric:
tests/unit/test_int8.cpp
,src/VecSim/algorithms/hnsw/hnsw_tiered.h
,src/VecSim/batch_iterator.h
CosineBlobCorrectness
test to explicitly validate query blob transfer from frontend to backend in tiered batch iterators:BruteForce
iterator to the backendHNSW
iterator.