[MOD-11650] Fix Out-of-Bounds Write in Vector Preprocessing #784

meiravgri · 2025-09-28T04:46:07Z

This PR addresses a potential buffer read overflow issue resulting from incorrect blob size handling during vector preprocessing.
The core issue was confusion between storedDataSize (processed blob size) and the actual input blob size, passing dataSize (stored data size) to the preprocessors instead of the actual input blob size.
This mismatch caused preprocessors to copy more data than available in the input blob, leading to out-of-bounds reads.

Correct Behavior

The system should properly distinguish between two different size concepts:

inputBlobSize: Size of the original input vector blob (dim * sizeof(type))
storedDataSize: Size of vectors after preprocessing (may include extra bytes for normalization)

Data Size Relationships:

Non-Tiered Indexes:

No preprocessing required: storedDataSize = inputBlobSize = dim * sizeof(type)
INT8/UINT8 + Cosine: storedDataSize = inputBlobSize + 4 (norm stored at end)

Tiered Indexes:

Frontend index: Follows non-tiered rules above
Backend index: Receives preprocessed blobs, so storedDataSize = inputBlobSize

What Was Fixed

The fix ensures memory safety by properly distinguishing between input blob sizes and stored data sizes throughout the vector similarity pipeline.

The fix was implemented by:

Adding inputBlobSize as a new member to the VecSimIndexAbstract class and AbstractIndexInitParams struct to explicitly track the original input vector size
Modifying all preprocessor API calls to use inputBlobSize instead of storedDataSize when processing input vectors, ensuring preprocessors only access the actual available input data
Establishing clear size relationships where inputBlobSize represents the original input size (dim * sizeof(type)) while storedDataSize represents the final processed size (which may include extra bytes for normalization

Additional Changes

Centralized Factory Logic:
- File: src/VecSim/index_factories/factory_utils.h (new)
- Changes: Created NewAbstractInitParams() template function to standardize parameter initialization across all factories, eliminating code duplication and ensuring consistent inputBlobSize calculation logic throughout the codebase
Renamed dataSize → StoredDataSize for Clarity
Test Fixes - Buffer Overflow in Element Size Estimation:
- Files: test_int8.cpp, test_uint8.cpp
- Fix: Fixed buffer overflow in INT8/UINT8 element size estimation tests where the TieredIndexParams was created from HNSWParams that didn't explicitly set the type field.
  The buffer overflow occurred because when the test called addVector, DataBlock::addElement() tried to copy dim * sizeof(float) bytes (16 bytes for dim=4), but the allocated buffer was only dim * sizeof(int8_t) bytes (4 bytes for dim=4), leading to out-of-bounds memory access during vector operations.
Test Fixes - Compiler Warnings:
- Files: test_svs.cpp, test_svs_multi.cpp, test_svs_tiered.cpp
- Fix: Fixed compiler warnings about potential division by zero by making quantBits constexpr. The compiler couldn't prove at compile-time that quantBits was never zero, even with runtime checks. Using constexpr moves the evaluation to compile-time, allowing if constexpr to eliminate the division code when quantBits == VecSimSvsQuant_NONE, satisfying static analysis requirements.
Fixed Element Size Estimation in Factory Functions:
- Files: src/VecSim/index_factories/brute_force_factory.cpp, src/VecSim/index_factories/hnsw_factory.cpp
- Changes: Updated EstimateElementSize() functions to use VecSimParams_GetStoredDataSize() instead of dim * sizeof(type) for accurate memory estimation. This ensures size calculations account for additional storage requirements (such as normalization data for INT8/UINT8 with Cosine metric) and provides consistent estimation across all index types. Added comprehensive test coverage for Cosine metric element size estimation in INT8 and UINT8 test suites to validate the fix.
Fixed Query Blob Access in Tiered Batch Iterator:
- Files: src/VecSim/batch_iterator.h, src/VecSim/algorithms/hnsw/hnsw_tiered.h
- Changes: Made getQueryBlob() virtual in the base VecSimBatchIterator class and overrode it in TieredHNSW_BatchIterator to properly delegate to the frontend iterator. This was necessary because the tiered batch iterator itself doesn't own any query vector - it relies on its internal frontend and backend iterators to manage their own query copies.
Enhanced Batch Iterator Testing for INT8 Cosine Metric:
- Files: tests/unit/test_int8.cpp, src/VecSim/algorithms/hnsw/hnsw_tiered.h, src/VecSim/batch_iterator.h
- Changes: Extended the CosineBlobCorrectness test to explicitly validate query blob transfer from frontend to backend in tiered batch iterators:
- Added direct verification that the query blob (including the appended norm) is correctly preserved during the transfer from the frontend BruteForce iterator to the backend HNSW iterator.

jit-ci · 2025-09-28T04:46:18Z

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

in VecSimIndexAbstract move data members to private when possible DONT FIX LEAK YET Factories: move NewAbstractInitParams to a general location (new file factory_utils) replace it in all factories

codecov · 2025-09-29T12:25:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.66%. Comparing base (f68bb6b) to head (74e4fef).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #784      +/-   ##
==========================================
+ Coverage   96.64%   96.66%   +0.01%     
==========================================
  Files         125      126       +1     
  Lines        7724     7707      -17     
==========================================
- Hits         7465     7450      -15     
+ Misses        259      257       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

use sanitizer in pull request

This caused a buffer overflow because setup<TieredIndexParams> didn't set the data type to INT8, creating a float32 index instead. DataBlock::addElement() tries to copy dim * sizeof(float) bytes, but the allocated buffer is only dim * sizeof(int8) bytes, causing a read overflow.

… constexpr

fix leaks that will be moved to a separate PR it was failing only with codecov becuase only there we use FP64_TESTS=1 Prevent template deduction errors in GenerateAndAddVector by making data_t parameter non-deducible Used std::type_identity<data_t>::type for the value parameter to force explicit template specification (e.g., GenerateAndAddVector<double>()) instead of allowing compiler to incorrectly deduce int from literal values, which caused buffer overflows when index expected different data types.

This reverts commit af844ec.

meiravgri · 2025-09-30T09:14:35Z

af844ec will be addressed in a separate PR to allow backporting to all version brnaches

failing logs
https://github.com/RedisAI/VectorSimilarity/actions/runs/18124904227/job/51581732846?pr=784

meiravgri · 2025-09-30T09:18:18Z

Sanitizer will also be added in a separate PR
A successful run of the sanitizer can be found here:
https://github.com/RedisAI/VectorSimilarity/actions/runs/18119763274/job/51564624387

ofiryanai

Looks ok, please make sure the todo in brute_force_factory.cpp is complete or remove the TODO comment and we're good to go

src/VecSim/index_factories/brute_force_factory.cpp

add tests

Copilot

Pull Request Overview

This PR fixes a critical buffer read overflow vulnerability in vector preprocessing by properly distinguishing between input blob sizes and stored data sizes throughout the vector similarity pipeline. The core issue was that preprocessors were receiving incorrect size parameters, causing them to read beyond available input data boundaries.

Key changes include:

Added inputBlobSize member to track original input vector size separately from processed storage size
Updated all preprocessor API calls to use correct input blob sizes instead of stored data sizes
Centralized factory logic to ensure consistent parameter initialization across all index types

Reviewed Changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/VecSim/vec_sim_index.h	Added `inputBlobSize` member and updated preprocessor calls to use correct sizes
src/VecSim/utils/vec_utils.h/.cpp	Renamed `VecSimParams_GetDataSize` to `VecSimParams_GetStoredDataSize` for clarity
src/VecSim/index_factories/factory_utils.h	New centralized factory utility template for consistent parameter initialization
src/VecSim/index_factories/*.cpp	Updated factories to use centralized parameter creation and correct size calculations
tests/unit/test_*.cpp	Fixed buffer overflows in tests and added comprehensive Cosine metric test coverage
tests/unit/unit_test_utils.cpp	Updated to use renamed `getStoredDataSize()` method

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/unit/test_index_test_utils.cpp

fix getQueryBlob in tiered add getHNSWIterator to tiered batch itertor if its BUILD_TESTS

src/VecSim/index_factories/brute_force_factory.cpp

tests/unit/test_int8.cpp

github-actions · 2025-10-08T18:01:16Z

Backport failed for 8.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 8.2
git worktree add -d .worktree/backport-784-to-8.2 origin/8.2
cd .worktree/backport-784-to-8.2
git switch --create backport-784-to-8.2
git cherry-pick -x 9fb223ae78c5effccfcef44953a22e9d89df40f9

rename dataSize->getStoredDataSize

6a1b1a4

meiravgri added 3 commits September 28, 2025 07:57

missing rename

9860204

format

b4143aa

add input blob size to AbstractIndexInitParams

b3b735a

in VecSimIndexAbstract move data members to private when possible DONT FIX LEAK YET Factories: move NewAbstractInitParams to a general location (new file factory_utils) replace it in all factories

meiravgri added 12 commits September 29, 2025 17:55

try sanitizer

ea00886

run sanitizer

f03e559

runanyway

aad4b0c

rervt task unit test

766a412

use sanitizer in pull request

fix leak in input bob size

8dfb610

run codecov with sanitizer (also intek)

5b93332

some fixes and assertion

ff1e318

fix uint8

6827d65

fix again

a11d342

fix possible warning abour divison by zero by checking quantBits with…

580443d

… constexpr

meiravgri changed the title ~~rename dataSize->getStoredDataSize~~ Fix Out-of-Bounds Write in Vector Preprocessing Sep 30, 2025

meiravgri changed the title ~~Fix Out-of-Bounds Write in Vector Preprocessing~~ [MOD-11650] Fix Out-of-Bounds Write in Vector Preprocessing Sep 30, 2025

Revert "TO REVERT!"

9cff30c

This reverts commit af844ec.

rever ci changes

8b5bcd5

meiravgri requested a review from ofiryanai September 30, 2025 10:32

ofiryanai previously approved these changes Sep 30, 2025

View reviewed changes

src/VecSim/index_factories/brute_force_factory.cpp Outdated Show resolved Hide resolved

calculate EstimateElementSize accroding to the stored vector size

2f979ed

add tests

meiravgri dismissed ofiryanai’s stale review via 2f979ed September 30, 2025 16:40

meiravgri requested a review from ofiryanai September 30, 2025 16:43

ofiryanai previously approved these changes Sep 30, 2025

View reviewed changes

meiravgri requested review from Copilot and GuyAv46 September 30, 2025 17:00

Copilot AI reviewed Sep 30, 2025

View reviewed changes

tests/unit/test_index_test_utils.cpp Show resolved Hide resolved

tests/unit/test_index_test_utils.cpp Show resolved Hide resolved

tests/unit/test_index_test_utils.cpp Show resolved Hide resolved

revert unrelated change in cmake.san

2296850

meiravgri dismissed ofiryanai’s stale review via 2296850 October 1, 2025 05:30

add batch itertor blob correctness to int8 tests

c0443da

fix getQueryBlob in tiered add getHNSWIterator to tiered batch itertor if its BUILD_TESTS

GuyAv46 reviewed Oct 8, 2025

View reviewed changes

src/VecSim/index_factories/brute_force_factory.cpp Outdated Show resolved Hide resolved

tests/unit/test_int8.cpp Show resolved Hide resolved

apply suggesting

74e4fef

meiravgri requested a review from GuyAv46 October 8, 2025 14:41

GuyAv46 approved these changes Oct 8, 2025

View reviewed changes

meiravgri enabled auto-merge October 8, 2025 15:03

meiravgri added this pull request to the merge queue Oct 8, 2025

meiravgri added backport 0.8 backport 8.2 and removed backport 0.8 labels Oct 8, 2025

Merged via the queue into main with commit 9fb223a Oct 8, 2025
21 checks passed

meiravgri deleted the meiravg_fix_out_of_bound_write branch October 8, 2025 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MOD-11650] Fix Out-of-Bounds Write in Vector Preprocessing #784

[MOD-11650] Fix Out-of-Bounds Write in Vector Preprocessing #784

meiravgri commented Sep 28, 2025 •

edited

Loading

Uh oh!

jit-ci bot commented Sep 28, 2025

Uh oh!

codecov bot commented Sep 29, 2025 •

edited

Loading

Uh oh!

meiravgri commented Sep 30, 2025 •

edited

Loading

Uh oh!

meiravgri commented Sep 30, 2025

Uh oh!

ofiryanai left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 8, 2025

Uh oh!

Uh oh!

[MOD-11650] Fix Out-of-Bounds Write in Vector Preprocessing #784

[MOD-11650] Fix Out-of-Bounds Write in Vector Preprocessing #784

Conversation

meiravgri commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Correct Behavior

Data Size Relationships:

What Was Fixed

Additional Changes

Uh oh!

jit-ci bot commented Sep 28, 2025

Uh oh!

codecov bot commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

meiravgri commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meiravgri commented Sep 30, 2025

Uh oh!

ofiryanai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Oct 8, 2025

Uh oh!

Uh oh!

meiravgri commented Sep 28, 2025 •

edited

Loading

codecov bot commented Sep 29, 2025 •

edited

Loading

meiravgri commented Sep 30, 2025 •

edited

Loading