FIX: varchar columnsize does not account for utf8 conversion #392

ffelixg · 2025-12-31T01:21:45Z

Work Item / Issue Reference

GitHub Issue: #391

Summary

This PR enlarges the fetch buffer size for char/varchar columns by a factor of 4 to account for characters which take up more space in utf8 than in whichever encoding the database is using.

It also adds a test, which passes at each commit and therefore tracks how the interface changes. I removed some of the fallback mechanisms and I hope that the evolution of the error messages over the different iterations of the test shows why that's preferable.

For the SQLGetData path for wchars, a columnSize == 0 check is added, to avoid needing the fallback branch for nvarchar(max). Previously, SQLGetData was called once with a buffer of length 0 and only then did the real attempt to fetch data start, after entering the fallback branch. columnSize == 0 was actually also the only case where the fallback branch behaved correctly, anything else discarded the first columnSize characters.

A test that covers all of latin1 and documents behavior with unmapped characters is added as well.

Note that if the database is using a utf8 collation then a buffer of size columnSize would be enough, as it tells us the number of bytes and not the number of characters (this distinction only matters for utf8).

…less SQLGetData fallback

Also add columnSize == 0 check to FetchLobColumnData which was likely missed. That way FetchLobColumnData gets called directly instead of going through a SQLGetData call with an empty buffer first

Copilot

Pull request overview

This PR addresses a buffer overflow issue when fetching VARCHAR columns containing characters that require more bytes in UTF-8 encoding than in the database's character encoding. The fix multiplies the fetch buffer size by 2 for CHAR/VARCHAR columns and removes fallback mechanisms that previously masked these issues.

Key Changes:

Doubled buffer allocation for CHAR/VARCHAR columns to account for UTF-8 conversion overhead
Removed isLob field from column info structures and simplified LOB detection logic
Replaced fallback mechanisms (FetchLobColumnData calls) with explicit error messages when buffers overflow
Added test case validating special character handling (ß character that requires 2 bytes in UTF-8)

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 13 comments.

File	Description
tests/test_004_cursor.py	Adds test for VARCHAR buffer sizing with special character (ß) that requires 2 bytes in UTF-8
mssql_python/pybind/ddbc_bindings.h	Removes `isLob` field and FetchLobColumnData forward declaration; updates ProcessChar, ProcessWChar, and ProcessBinary to throw errors instead of falling back to LOB streaming
mssql_python/pybind/ddbc_bindings.cpp	Implements 2x buffer multiplier for VARCHAR columns in SQLGetData_wrap, SQLBindColums, FetchBatchData, and calculateRowSize; removes lobColumns parameter from FetchBatchData signature; adds columnSize == 0 check for WCHAR LOB detection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mssql_python/pybind/ddbc_bindings.cpp

tests/test_004_cursor.py

mssql_python/pybind/ddbc_bindings.cpp

mssql_python/pybind/ddbc_bindings.h

gargsaumya · 2026-01-02T06:16:47Z

/azp run

azure-pipelines · 2026-01-02T06:16:58Z

Azure Pipelines successfully started running 1 pipeline(s).

jahnvi480 · 2026-01-06T09:32:44Z

/azp run

azure-pipelines · 2026-01-06T09:32:54Z

Azure Pipelines successfully started running 1 pipeline(s).

jahnvi480 · 2026-01-06T09:54:12Z

@ffelixg Can you please have a look at why the tests are failing, and make the required changes

ffelixg · 2026-01-06T12:30:27Z

I couldn't reproduce the failure locally, but I think that sys.objects might not have had enough rows in the CI for what I was trying to use it for. Curiously it worked for the first CI run. Maybe the recent merges from Main changed the number of objects somehow, or the test was just flaky. Either way, this part now no longer uses sys.objects.

MSVC seems to be configured more restrictively than GCC regarding unused parameters, I removed hStmt from the columnProcessors

Could you rerun the CI @jahnvi480?

jahnvi480 · 2026-01-07T06:09:25Z

/azp run

azure-pipelines · 2026-01-07T06:09:36Z

Azure Pipelines successfully started running 1 pipeline(s).

bewithgaurav · 2026-01-07T09:51:13Z

/azp run

azure-pipelines · 2026-01-07T09:51:23Z

Azure Pipelines successfully started running 1 pipeline(s).

jahnvi480 · 2026-01-08T06:48:27Z

@ffelixg Can you please check on why the tests are only failing on Windows.
Please let me know if you need any help here.

…d of failing

ffelixg · 2026-01-08T19:18:03Z

@jahnvi480 It's the issues I've commented on in #391. I changed the tests so that they pass and instead execute a separate branch for windows where they assert that the problematic cases return exceptions or corrupt data.

My original issue regarding the buffer size is fixed and everything works as I would expect on Linux now. Since the windows driver returns latin1 data, which can't have multiple bytes per character, that issue was actually only affecting Linux/Macos.

Regarding the additional issues I've found while writing the tests, I think you should discuss internally about how you want to restructure the API. The current behavior would require the User to write platform specific code, which is not great. My suggestion would be requesting wide chars from the odbc driver, like I wrote in the issue.

Actually addressing that is probably outside the scope of this PR/issue. Maybe we could sneak in a fix for fetchmany/fetchall ignoring setdecoding, that seems fairly straight forward.

jahnvi480 · 2026-01-09T05:51:30Z

/azp run

azure-pipelines · 2026-01-09T05:51:40Z

Azure Pipelines successfully started running 1 pipeline(s).

jahnvi480 · 2026-01-06T09:51:22Z

mssql_python/pybind/ddbc_bindings.cpp

-                    uint64_t fetchBufferSize = columnSize + 1 /* null-termination */;
+                    // Multiply by 4 because utf8 conversion by the driver might
+                    // turn varchar(x) into up to 3*x (maybe 4*x?) bytes.
+                    uint64_t fetchBufferSize = 4 * columnSize + 1 /* null-termination */;


Is there any specific reason of this number or is it random?

If we are fetching varchar(x), the driver gives us x in the columnSize variable. The driver (msodbcsql18) however does not give us which collation the varchar(x) column uses. The Non-Windows drivers also convert the data to utf-8 no matter which collation is used by the column. So if it's a utf-8 collation, x equals the maximum number of bytes, so we're fine. If varchar(x) uses any other single byte collation, we must allocate the buffer large enough such that the result of a conversion to utf-8 fits. From my understanding, utf-8 tries to encode most characters of other collations with 2 bytes. Some later additions, like "€" to latin1 may require 3 bytes.

I've also verified this by iterating over all possible SQL Server collations and converting the result to utf-8. If there was a single byte collation with a character that requires more than 3 bytes, the following script would error. (posted this also in some copilot review comment)

SET NOCOUNT ON; DECLARE @collation_name NVARCHAR(128); DECLARE collation_cursor CURSOR FOR SELECT name FROM fn_helpcollations(); OPEN collation_cursor; FETCH NEXT FROM collation_cursor INTO @collation_name; WHILE @@FETCH_STATUS = 0 BEGIN IF @collation_name NOT LIKE N'%_UTF8%' BEGIN DECLARE @sql NVARCHAR(MAX) = N' declare @t1 table (a varchar(1) collate ' + @collation_name + N') declare @t2 table (a varchar(4) collate Latin1_General_100_CI_AI_SC_UTF8) insert into @t1 select top 256 cast(row_number() over(order by (select 1)) - 1 as binary(1)) a from sys.objects insert into @t2 select cast(a as nvarchar(10)) from @t1 if (select max(datalength(a)) from @t2) > 3 throw 50000, ''datalength too big'', 1 '; EXEC sp_executesql @sql; END FETCH NEXT FROM collation_cursor INTO @collation_name; END CLOSE collation_cursor; DEALLOCATE collation_cursor;

Therefore x needs to be multiplied by 3 at least. +1 for the null terminator is fine, since the null terminator takes 1 byte no matter the encoding/collation.

github-actions · 2026-01-09T06:30:23Z

📊 Code Coverage Report

🔥 Diff Coverage 64%	🎯 Overall Coverage 76%	📈 Total Lines Covered: `5442` out of `7122` 📁 Project: `mssql-python`

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

mssql_python/pybind/ddbc_bindings.cpp (62.5%): Missing lines 2963-2966,3037-3040,3307-3310
mssql_python/pybind/ddbc_bindings.h (68.4%): Missing lines 809-810,879-880,922-923

Summary

Total: 51 lines
Missing: 18 lines
Coverage: 64%

mssql_python/pybind/ddbc_bindings.cpp

Lines 2959-2970

0 ""

1 "/usr/include/stdc-predef.h" 1 3 4

0 "" 2

1 ""

Lines 3033-3044

0 ""

1 "/usr/include/stdc-predef.h" 1 3 4

0 "" 2

1 ""

Lines 3303-3314

0 ""

1 "/usr/include/stdc-predef.h" 1 3 4

0 "" 2

1 ""

mssql_python/pybind/ddbc_bindings.h

📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)

mssql_python.pybind.logger_bridge.hpp: 58.8%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.row.py: 66.2%
mssql_python.helpers.py: 67.5%
mssql_python.pybind.ddbc_bindings.cpp: 69.3%
mssql_python.pybind.ddbc_bindings.h: 71.9%
mssql_python.pybind.connection.connection.cpp: 73.6%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 84.1%

🔗 Quick Links

⚙️ Build Summary	📋 Coverage Details
View Azure DevOps Build	Browse Full Coverage Report

ffelixg added 6 commits December 30, 2025 23:03

Document original behavior in test_encoding_buffersize

59fcad4

Remove trivially false isLob

c14964f

Explain the buffer size issue in exception instead of attempting hope…

459868a

…less SQLGetData fallback

Raise exception instead of returning corrupted data for fetchone

a7e43dc

Also add columnSize == 0 check to FetchLobColumnData which was likely missed. That way FetchLobColumnData gets called directly instead of going through a SQLGetData call with an empty buffer first

Actually fix the buffer size

c211253

Rename test

c6063ca

Copilot AI review requested due to automatic review settings December 31, 2025 01:21

Copilot started reviewing on behalf of ffelixg December 31, 2025 01:22 View session

Copilot AI reviewed Dec 31, 2025

View reviewed changes

ffelixg and others added 3 commits January 3, 2026 02:00

Change multiplier to 4, add test which covers all of latin1

11c94c4

Merge branch 'main' into varchar_columnsize_small

65ffc59

Merge branch 'main' into varchar_columnsize_small

4446a50

ffelixg added 2 commits January 6, 2026 13:09

Use recursive CTE instead of sys.objects in test

bc2c767

Remove now redundant hstmt arg from ColumnProcessors

572516e

Merge branch 'main' into varchar_columnsize_small

e634f24

bewithgaurav linked an issue Jan 8, 2026 that may be closed by this pull request

Utf8-related buffer size issue with varchar fetches #391

Open

Windows tests assert that an exception/data corruption happens instea…

06d83a2

…d of failing

Merge branch 'main' into varchar_columnsize_small

838af1b

jahnvi480 reviewed Jan 9, 2026

View reviewed changes

FIX: varchar columnsize does not account for utf8 conversion #392

Are you sure you want to change the base?

FIX: varchar columnsize does not account for utf8 conversion #392

Uh oh!

Conversation

ffelixg commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Work Item / Issue Reference

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gargsaumya commented Jan 2, 2026

Uh oh!

azure-pipelines bot commented Jan 2, 2026

Uh oh!

jahnvi480 commented Jan 6, 2026

Uh oh!

azure-pipelines bot commented Jan 6, 2026

Uh oh!

jahnvi480 commented Jan 6, 2026

Uh oh!

ffelixg commented Jan 6, 2026

Uh oh!

jahnvi480 commented Jan 7, 2026

Uh oh!

azure-pipelines bot commented Jan 7, 2026

Uh oh!

bewithgaurav commented Jan 7, 2026

Uh oh!

azure-pipelines bot commented Jan 7, 2026

Uh oh!

jahnvi480 commented Jan 8, 2026

Uh oh!

ffelixg commented Jan 8, 2026

Uh oh!

jahnvi480 commented Jan 9, 2026

Uh oh!

azure-pipelines bot commented Jan 9, 2026

Uh oh!

jahnvi480 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

ffelixg Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 9, 2026

📊 Code Coverage Report

🔥 Diff Coverage

64%

🎯 Overall Coverage

76%

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

Summary

mssql_python/pybind/ddbc_bindings.cpp

0 ""

0 ""

0 ""

1 "/usr/include/stdc-predef.h" 1 3 4

0 "" 2

1 ""

0 ""

0 ""

0 ""

1 "/usr/include/stdc-predef.h" 1 3 4

ffelixg commented Dec 31, 2025 •

edited

Loading

ffelixg Jan 9, 2026 •

edited

Loading