Skip to content

Conversation

@ffelixg
Copy link

@ffelixg ffelixg commented Dec 31, 2025

Work Item / Issue Reference

GitHub Issue: #391


Summary

This PR enlarges the fetch buffer size for char/varchar columns by a factor of 4 to account for characters which take up more space in utf8 than in whichever encoding the database is using.

It also adds a test, which passes at each commit and therefore tracks how the interface changes. I removed some of the fallback mechanisms and I hope that the evolution of the error messages over the different iterations of the test shows why that's preferable.

For the SQLGetData path for wchars, a columnSize == 0 check is added, to avoid needing the fallback branch for nvarchar(max). Previously, SQLGetData was called once with a buffer of length 0 and only then did the real attempt to fetch data start, after entering the fallback branch. columnSize == 0 was actually also the only case where the fallback branch behaved correctly, anything else discarded the first columnSize characters.

A test that covers all of latin1 and documents behavior with unmapped characters is added as well.

Note that if the database is using a utf8 collation then a buffer of size columnSize would be enough, as it tells us the number of bytes and not the number of characters (this distinction only matters for utf8).

Also add columnSize == 0 check to FetchLobColumnData which was likely missed. That way FetchLobColumnData gets called directly instead of going through a SQLGetData call with an empty buffer first
Copilot AI review requested due to automatic review settings December 31, 2025 01:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a buffer overflow issue when fetching VARCHAR columns containing characters that require more bytes in UTF-8 encoding than in the database's character encoding. The fix multiplies the fetch buffer size by 2 for CHAR/VARCHAR columns and removes fallback mechanisms that previously masked these issues.

Key Changes:

  • Doubled buffer allocation for CHAR/VARCHAR columns to account for UTF-8 conversion overhead
  • Removed isLob field from column info structures and simplified LOB detection logic
  • Replaced fallback mechanisms (FetchLobColumnData calls) with explicit error messages when buffers overflow
  • Added test case validating special character handling (ß character that requires 2 bytes in UTF-8)

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 13 comments.

File Description
tests/test_004_cursor.py Adds test for VARCHAR buffer sizing with special character (ß) that requires 2 bytes in UTF-8
mssql_python/pybind/ddbc_bindings.h Removes isLob field and FetchLobColumnData forward declaration; updates ProcessChar, ProcessWChar, and ProcessBinary to throw errors instead of falling back to LOB streaming
mssql_python/pybind/ddbc_bindings.cpp Implements 2x buffer multiplier for VARCHAR columns in SQLGetData_wrap, SQLBindColums, FetchBatchData, and calculateRowSize; removes lobColumns parameter from FetchBatchData signature; adds columnSize == 0 check for WCHAR LOB detection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gargsaumya
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jahnvi480
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jahnvi480
Copy link
Contributor

@ffelixg Can you please have a look at why the tests are failing, and make the required changes

@ffelixg
Copy link
Author

ffelixg commented Jan 6, 2026

I couldn't reproduce the failure locally, but I think that sys.objects might not have had enough rows in the CI for what I was trying to use it for. Curiously it worked for the first CI run. Maybe the recent merges from Main changed the number of objects somehow, or the test was just flaky. Either way, this part now no longer uses sys.objects.

MSVC seems to be configured more restrictively than GCC regarding unused parameters, I removed hStmt from the columnProcessors

Could you rerun the CI @jahnvi480?

@jahnvi480
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bewithgaurav
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bewithgaurav bewithgaurav linked an issue Jan 8, 2026 that may be closed by this pull request
@jahnvi480
Copy link
Contributor

@ffelixg Can you please check on why the tests are only failing on Windows.
Please let me know if you need any help here.

@ffelixg
Copy link
Author

ffelixg commented Jan 8, 2026

@jahnvi480 It's the issues I've commented on in #391. I changed the tests so that they pass and instead execute a separate branch for windows where they assert that the problematic cases return exceptions or corrupt data.

My original issue regarding the buffer size is fixed and everything works as I would expect on Linux now. Since the windows driver returns latin1 data, which can't have multiple bytes per character, that issue was actually only affecting Linux/Macos.

Regarding the additional issues I've found while writing the tests, I think you should discuss internally about how you want to restructure the API. The current behavior would require the User to write platform specific code, which is not great. My suggestion would be requesting wide chars from the odbc driver, like I wrote in the issue.

Actually addressing that is probably outside the scope of this PR/issue. Maybe we could sneak in a fix for fetchmany/fetchall ignoring setdecoding, that seems fairly straight forward.

@jahnvi480
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

uint64_t fetchBufferSize = columnSize + 1 /* null-termination */;
// Multiply by 4 because utf8 conversion by the driver might
// turn varchar(x) into up to 3*x (maybe 4*x?) bytes.
uint64_t fetchBufferSize = 4 * columnSize + 1 /* null-termination */;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific reason of this number or is it random?

Copy link
Author

@ffelixg ffelixg Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are fetching varchar(x), the driver gives us x in the columnSize variable. The driver (msodbcsql18) however does not give us which collation the varchar(x) column uses. The Non-Windows drivers also convert the data to utf-8 no matter which collation is used by the column. So if it's a utf-8 collation, x equals the maximum number of bytes, so we're fine. If varchar(x) uses any other single byte collation, we must allocate the buffer large enough such that the result of a conversion to utf-8 fits. From my understanding, utf-8 tries to encode most characters of other collations with 2 bytes. Some later additions, like "€" to latin1 may require 3 bytes.

I've also verified this by iterating over all possible SQL Server collations and converting the result to utf-8. If there was a single byte collation with a character that requires more than 3 bytes, the following script would error. (posted this also in some copilot review comment)

SET NOCOUNT ON;
DECLARE @collation_name NVARCHAR(128);
DECLARE collation_cursor CURSOR FOR SELECT name FROM fn_helpcollations();
OPEN collation_cursor;
FETCH NEXT FROM collation_cursor INTO @collation_name;

WHILE @@FETCH_STATUS = 0
BEGIN
    IF @collation_name NOT LIKE N'%_UTF8%'
    BEGIN
        DECLARE @sql NVARCHAR(MAX) = N'
        declare @t1 table (a varchar(1) collate ' + @collation_name + N')
        declare @t2 table (a varchar(4) collate Latin1_General_100_CI_AI_SC_UTF8)
        insert into @t1 select top 256 cast(row_number() over(order by (select 1)) - 1 as binary(1)) a from sys.objects
        insert into @t2 select cast(a as nvarchar(10)) from @t1
        if (select max(datalength(a)) from @t2) > 3
            throw 50000, ''datalength too big'', 1
        ';
        EXEC sp_executesql @sql;
    END
    
    FETCH NEXT FROM collation_cursor INTO @collation_name;
END
CLOSE collation_cursor;
DEALLOCATE collation_cursor;

Therefore x needs to be multiplied by 3 at least. +1 for the null terminator is fine, since the null terminator takes 1 byte no matter the encoding/collation.

@github-actions
Copy link

github-actions bot commented Jan 9, 2026

📊 Code Coverage Report

🔥 Diff Coverage

64%


🎯 Overall Coverage

76%


📈 Total Lines Covered: 5442 out of 7122
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/ddbc_bindings.cpp (62.5%): Missing lines 2963-2966,3037-3040,3307-3310
  • mssql_python/pybind/ddbc_bindings.h (68.4%): Missing lines 809-810,879-880,922-923

Summary

  • Total: 51 lines
  • Missing: 18 lines
  • Coverage: 64%

mssql_python/pybind/ddbc_bindings.cpp

Lines 2959-2970

0 ""

0 ""

0 ""

1 "/usr/include/stdc-predef.h" 1 3 4

0 "" 2

1 ""


Lines 3033-3044

0 ""

0 ""

0 ""

1 "/usr/include/stdc-predef.h" 1 3 4

0 "" 2

1 ""


Lines 3303-3314

0 ""

0 ""

0 ""

1 "/usr/include/stdc-predef.h" 1 3 4

0 "" 2

1 ""


mssql_python/pybind/ddbc_bindings.h





📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.hpp: 58.8%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.row.py: 66.2%
mssql_python.helpers.py: 67.5%
mssql_python.pybind.ddbc_bindings.cpp: 69.3%
mssql_python.pybind.ddbc_bindings.h: 71.9%
mssql_python.pybind.connection.connection.cpp: 73.6%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 84.1%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Utf8-related buffer size issue with varchar fetches

4 participants