Skip to content

Conversation

meiravgri
Copy link
Collaborator

@meiravgri meiravgri commented Sep 28, 2025

This PR addresses a potential buffer read overflow issue resulting from incorrect blob size handling during vector preprocessing.
The core issue was confusion between storedDataSize (processed blob size) and the actual input blob size, passing dataSize (stored data size) to the preprocessors instead of the actual input blob size.
This mismatch caused preprocessors to copy more data than available in the input blob, leading to out-of-bounds reads.

Correct Behavior

The system should properly distinguish between two different size concepts:

  • inputBlobSize: Size of the original input vector blob (dim * sizeof(type))
  • storedDataSize: Size of vectors after preprocessing (may include extra bytes for normalization)

Data Size Relationships:

Non-Tiered Indexes:

  • No preprocessing required: storedDataSize = inputBlobSize = dim * sizeof(type)
  • INT8/UINT8 + Cosine: storedDataSize = inputBlobSize + 4 (norm stored at end)

Tiered Indexes:

  • Frontend index: Follows non-tiered rules above
  • Backend index: Receives preprocessed blobs, so storedDataSize = inputBlobSize

What Was Fixed

The fix ensures memory safety by properly distinguishing between input blob sizes and stored data sizes throughout the vector similarity pipeline.

The fix was implemented by:

  1. Adding inputBlobSize as a new member to the VecSimIndexAbstract class and AbstractIndexInitParams struct to explicitly track the original input vector size
  2. Modifying all preprocessor API calls to use inputBlobSize instead of storedDataSize when processing input vectors, ensuring preprocessors only access the actual available input data
  3. Establishing clear size relationships where inputBlobSize represents the original input size (dim * sizeof(type)) while storedDataSize represents the final processed size (which may include extra bytes for normalization

Additional Changes

  • Centralized Factory Logic:

    • File: src/VecSim/index_factories/factory_utils.h (new)
    • Changes: Created NewAbstractInitParams() template function to standardize parameter initialization across all factories, eliminating code duplication and ensuring consistent inputBlobSize calculation logic throughout the codebase
  • Renamed dataSizeStoredDataSize for Clarity

  • Test Fixes - Buffer Overflow in Element Size Estimation:

    • Files: test_int8.cpp, test_uint8.cpp
    • Fix: Fixed buffer overflow in INT8/UINT8 element size estimation tests where the TieredIndexParams was created from HNSWParams that didn't explicitly set the type field.
      The buffer overflow occurred because when the test called addVector, DataBlock::addElement() tried to copy dim * sizeof(float) bytes (16 bytes for dim=4), but the allocated buffer was only dim * sizeof(int8_t) bytes (4 bytes for dim=4), leading to out-of-bounds memory access during vector operations.
  • Test Fixes - Compiler Warnings:

    • Files: test_svs.cpp, test_svs_multi.cpp, test_svs_tiered.cpp
    • Fix: Fixed compiler warnings about potential division by zero by making quantBits constexpr. The compiler couldn't prove at compile-time that quantBits was never zero, even with runtime checks. Using constexpr moves the evaluation to compile-time, allowing if constexpr to eliminate the division code when quantBits == VecSimSvsQuant_NONE, satisfying static analysis requirements.
  • Fixed Element Size Estimation in Factory Functions:

    • Files: src/VecSim/index_factories/brute_force_factory.cpp, src/VecSim/index_factories/hnsw_factory.cpp
    • Changes: Updated EstimateElementSize() functions to use VecSimParams_GetStoredDataSize() instead of dim * sizeof(type) for accurate memory estimation. This ensures size calculations account for additional storage requirements (such as normalization data for INT8/UINT8 with Cosine metric) and provides consistent estimation across all index types. Added comprehensive test coverage for Cosine metric element size estimation in INT8 and UINT8 test suites to validate the fix.
  • Fixed Query Blob Access in Tiered Batch Iterator:

    • Files: src/VecSim/batch_iterator.h, src/VecSim/algorithms/hnsw/hnsw_tiered.h
    • Changes: Made getQueryBlob() virtual in the base VecSimBatchIterator class and overrode it in TieredHNSW_BatchIterator to properly delegate to the frontend iterator. This was necessary because the tiered batch iterator itself doesn't own any query vector - it relies on its internal frontend and backend iterators to manage their own query copies.
  • Enhanced Batch Iterator Testing for INT8 Cosine Metric:

    • Files: tests/unit/test_int8.cpp, src/VecSim/algorithms/hnsw/hnsw_tiered.h, src/VecSim/batch_iterator.h
    • Changes: Extended the CosineBlobCorrectness test to explicitly validate query blob transfer from frontend to backend in tiered batch iterators:
    • Added direct verification that the query blob (including the appended norm) is correctly preserved during the transfer from the frontend BruteForce iterator to the backend HNSW iterator.

Copy link

jit-ci bot commented Sep 28, 2025

Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset.

In case there are security findings, they will be communicated to you as a comment inside the PR.

Hope you’ll enjoy using Jit.

Questions? Comments? Want to learn more? Get in touch with us.

in VecSimIndexAbstract move data members to private when possible

DONT FIX LEAK YET

Factories:
move NewAbstractInitParams to a general location (new file factory_utils)
replace it in all factories
Copy link

codecov bot commented Sep 29, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.66%. Comparing base (f68bb6b) to head (74e4fef).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #784      +/-   ##
==========================================
+ Coverage   96.64%   96.66%   +0.01%     
==========================================
  Files         125      126       +1     
  Lines        7724     7707      -17     
==========================================
- Hits         7465     7450      -15     
+ Misses        259      257       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

use sanitizer in pull request
This caused a buffer overflow because setup<TieredIndexParams> didn't set the data type to INT8, creating a float32 index instead.
DataBlock::addElement() tries to copy dim * sizeof(float) bytes, but the allocated buffer is only dim * sizeof(int8) bytes, causing a read overflow.
fix leaks that will be moved to a separate PR
it was failing only with codecov becuase only there we use FP64_TESTS=1

Prevent template deduction errors in GenerateAndAddVector by making data_t parameter non-deducible

Used std::type_identity<data_t>::type for the value parameter to force explicit template specification (e.g., GenerateAndAddVector<double>()) instead of allowing compiler to incorrectly deduce int from literal values, which caused buffer overflows when index expected different data types.
@meiravgri meiravgri changed the title rename dataSize->getStoredDataSize Fix Out-of-Bounds Write in Vector Preprocessing Sep 30, 2025
@meiravgri meiravgri changed the title Fix Out-of-Bounds Write in Vector Preprocessing [MOD-11650] Fix Out-of-Bounds Write in Vector Preprocessing Sep 30, 2025
This reverts commit af844ec.
@meiravgri
Copy link
Collaborator Author

meiravgri commented Sep 30, 2025

af844ec will be addressed in a separate PR to allow backporting to all version brnaches

failing logs
https://github.com/RedisAI/VectorSimilarity/actions/runs/18124904227/job/51581732846?pr=784

@meiravgri
Copy link
Collaborator Author

Sanitizer will also be added in a separate PR
A successful run of the sanitizer can be found here:
https://github.com/RedisAI/VectorSimilarity/actions/runs/18119763274/job/51564624387

@meiravgri meiravgri requested a review from ofiryanai September 30, 2025 10:32
ofiryanai
ofiryanai previously approved these changes Sep 30, 2025
Copy link
Collaborator

@ofiryanai ofiryanai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, please make sure the todo in brute_force_factory.cpp is complete or remove the TODO comment and we're good to go

ofiryanai
ofiryanai previously approved these changes Sep 30, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a critical buffer read overflow vulnerability in vector preprocessing by properly distinguishing between input blob sizes and stored data sizes throughout the vector similarity pipeline. The core issue was that preprocessors were receiving incorrect size parameters, causing them to read beyond available input data boundaries.

Key changes include:

  • Added inputBlobSize member to track original input vector size separately from processed storage size
  • Updated all preprocessor API calls to use correct input blob sizes instead of stored data sizes
  • Centralized factory logic to ensure consistent parameter initialization across all index types

Reviewed Changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/VecSim/vec_sim_index.h Added inputBlobSize member and updated preprocessor calls to use correct sizes
src/VecSim/utils/vec_utils.h/.cpp Renamed VecSimParams_GetDataSize to VecSimParams_GetStoredDataSize for clarity
src/VecSim/index_factories/factory_utils.h New centralized factory utility template for consistent parameter initialization
src/VecSim/index_factories/*.cpp Updated factories to use centralized parameter creation and correct size calculations
tests/unit/test_*.cpp Fixed buffer overflows in tests and added comprehensive Cosine metric test coverage
tests/unit/unit_test_utils.cpp Updated to use renamed getStoredDataSize() method

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

fix getQueryBlob in tiered
add getHNSWIterator to tiered batch itertor if its BUILD_TESTS
@meiravgri meiravgri requested a review from GuyAv46 October 8, 2025 14:41
@meiravgri meiravgri enabled auto-merge October 8, 2025 15:03
@meiravgri meiravgri added this pull request to the merge queue Oct 8, 2025
Merged via the queue into main with commit 9fb223a Oct 8, 2025
21 checks passed
@meiravgri meiravgri deleted the meiravg_fix_out_of_bound_write branch October 8, 2025 18:01
Copy link

github-actions bot commented Oct 8, 2025

Backport failed for 8.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin 8.2
git worktree add -d .worktree/backport-784-to-8.2 origin/8.2
cd .worktree/backport-784-to-8.2
git switch --create backport-784-to-8.2
git cherry-pick -x 9fb223ae78c5effccfcef44953a22e9d89df40f9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants