-
Notifications
You must be signed in to change notification settings - Fork 33
FIX: varchar columnsize does not account for utf8 conversion #392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…less SQLGetData fallback
Also add columnSize == 0 check to FetchLobColumnData which was likely missed. That way FetchLobColumnData gets called directly instead of going through a SQLGetData call with an empty buffer first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses a buffer overflow issue when fetching VARCHAR columns containing characters that require more bytes in UTF-8 encoding than in the database's character encoding. The fix multiplies the fetch buffer size by 2 for CHAR/VARCHAR columns and removes fallback mechanisms that previously masked these issues.
Key Changes:
- Doubled buffer allocation for CHAR/VARCHAR columns to account for UTF-8 conversion overhead
- Removed
isLobfield from column info structures and simplified LOB detection logic - Replaced fallback mechanisms (FetchLobColumnData calls) with explicit error messages when buffers overflow
- Added test case validating special character handling (ß character that requires 2 bytes in UTF-8)
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 13 comments.
| File | Description |
|---|---|
| tests/test_004_cursor.py | Adds test for VARCHAR buffer sizing with special character (ß) that requires 2 bytes in UTF-8 |
| mssql_python/pybind/ddbc_bindings.h | Removes isLob field and FetchLobColumnData forward declaration; updates ProcessChar, ProcessWChar, and ProcessBinary to throw errors instead of falling back to LOB streaming |
| mssql_python/pybind/ddbc_bindings.cpp | Implements 2x buffer multiplier for VARCHAR columns in SQLGetData_wrap, SQLBindColums, FetchBatchData, and calculateRowSize; removes lobColumns parameter from FetchBatchData signature; adds columnSize == 0 check for WCHAR LOB detection |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@ffelixg Can you please have a look at why the tests are failing, and make the required changes |
|
I couldn't reproduce the failure locally, but I think that MSVC seems to be configured more restrictively than GCC regarding unused parameters, I removed hStmt from the columnProcessors Could you rerun the CI @jahnvi480? |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@ffelixg Can you please check on why the tests are only failing on Windows. |
|
@jahnvi480 It's the issues I've commented on in #391. I changed the tests so that they pass and instead execute a separate branch for windows where they assert that the problematic cases return exceptions or corrupt data. My original issue regarding the buffer size is fixed and everything works as I would expect on Linux now. Since the windows driver returns latin1 data, which can't have multiple bytes per character, that issue was actually only affecting Linux/Macos. Regarding the additional issues I've found while writing the tests, I think you should discuss internally about how you want to restructure the API. The current behavior would require the User to write platform specific code, which is not great. My suggestion would be requesting wide chars from the odbc driver, like I wrote in the issue. Actually addressing that is probably outside the scope of this PR/issue. Maybe we could sneak in a fix for fetchmany/fetchall ignoring setdecoding, that seems fairly straight forward. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| uint64_t fetchBufferSize = columnSize + 1 /* null-termination */; | ||
| // Multiply by 4 because utf8 conversion by the driver might | ||
| // turn varchar(x) into up to 3*x (maybe 4*x?) bytes. | ||
| uint64_t fetchBufferSize = 4 * columnSize + 1 /* null-termination */; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any specific reason of this number or is it random?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are fetching varchar(x), the driver gives us x in the columnSize variable. The driver (msodbcsql18) however does not give us which collation the varchar(x) column uses. The Non-Windows drivers also convert the data to utf-8 no matter which collation is used by the column. So if it's a utf-8 collation, x equals the maximum number of bytes, so we're fine. If varchar(x) uses any other single byte collation, we must allocate the buffer large enough such that the result of a conversion to utf-8 fits. From my understanding, utf-8 tries to encode most characters of other collations with 2 bytes. Some later additions, like "€" to latin1 may require 3 bytes.
I've also verified this by iterating over all possible SQL Server collations and converting the result to utf-8. If there was a single byte collation with a character that requires more than 3 bytes, the following script would error. (posted this also in some copilot review comment)
SET NOCOUNT ON;
DECLARE @collation_name NVARCHAR(128);
DECLARE collation_cursor CURSOR FOR SELECT name FROM fn_helpcollations();
OPEN collation_cursor;
FETCH NEXT FROM collation_cursor INTO @collation_name;
WHILE @@FETCH_STATUS = 0
BEGIN
IF @collation_name NOT LIKE N'%_UTF8%'
BEGIN
DECLARE @sql NVARCHAR(MAX) = N'
declare @t1 table (a varchar(1) collate ' + @collation_name + N')
declare @t2 table (a varchar(4) collate Latin1_General_100_CI_AI_SC_UTF8)
insert into @t1 select top 256 cast(row_number() over(order by (select 1)) - 1 as binary(1)) a from sys.objects
insert into @t2 select cast(a as nvarchar(10)) from @t1
if (select max(datalength(a)) from @t2) > 3
throw 50000, ''datalength too big'', 1
';
EXEC sp_executesql @sql;
END
FETCH NEXT FROM collation_cursor INTO @collation_name;
END
CLOSE collation_cursor;
DEALLOCATE collation_cursor;Therefore x needs to be multiplied by 3 at least. +1 for the null terminator is fine, since the null terminator takes 1 byte no matter the encoding/collation.
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/ddbc_bindings.cppLines 2959-2970 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 3033-3044 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 3303-3314 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""mssql_python/pybind/ddbc_bindings.h📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.logger_bridge.hpp: 58.8%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.row.py: 66.2%
mssql_python.helpers.py: 67.5%
mssql_python.pybind.ddbc_bindings.cpp: 69.3%
mssql_python.pybind.ddbc_bindings.h: 71.9%
mssql_python.pybind.connection.connection.cpp: 73.6%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 84.1%🔗 Quick Links
|
Work Item / Issue Reference
Summary
This PR enlarges the fetch buffer size for char/varchar columns by a factor of 4 to account for characters which take up more space in utf8 than in whichever encoding the database is using.
It also adds a test, which passes at each commit and therefore tracks how the interface changes. I removed some of the fallback mechanisms and I hope that the evolution of the error messages over the different iterations of the test shows why that's preferable.
For the SQLGetData path for wchars, a
columnSize == 0check is added, to avoid needing the fallback branch fornvarchar(max). Previously, SQLGetData was called once with a buffer of length 0 and only then did the real attempt to fetch data start, after entering the fallback branch.columnSize == 0was actually also the only case where the fallback branch behaved correctly, anything else discarded the firstcolumnSizecharacters.A test that covers all of latin1 and documents behavior with unmapped characters is added as well.
Note that if the database is using a utf8 collation then a buffer of size
columnSizewould be enough, as it tells us the number of bytes and not the number of characters (this distinction only matters for utf8).