-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend #60105
base: main
Are you sure you want to change the base?
Fix BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend #60105
Conversation
The most recent change on main seems to also have 17 failing tasks so I think these failures are unrelated to the change in my PR but please let me know if that's wrong! |
|
||
# Try and greedily convert to string | ||
# Will fail if the object is bytes | ||
arr = arr_cls._from_sequence(arr, dtype=new_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this ideally return pandas.ArrowDtype(pyarrow.binary())
type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense thanks! so looks like the previous logic was not taking into account pyarrow types when doing this conversation so I've added logic similar to my initial change where we try to convert to a pyarrow string but then fall back to binary if we run into an invalid error (i.e. we tried to parse but it failed due to an encoding error). Please let me know what you think! Was also considering trying to type check the contents of arr
to see if it has string or bytes data but seems like greedily trying to convert ends up being better performance in most cases (since we might have to search the whole arr to see if one of the elements is a bytes sequence that can't be converted to a string)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely appreciate what we are trying to do here although I'm wary about trying to infer "non-first class" types at runtime.
Have you tried using an ADBC driver instead? That should be Arrow-native and yield the proper dtypes
@@ -968,9 +972,31 @@ def convert(arr): | |||
# i.e. maybe_convert_objects didn't convert | |||
arr = maybe_infer_to_datetimelike(arr) | |||
if dtype_backend != "numpy" and arr.dtype == np.dtype("O"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the wrong place to be doing this; in the sql.py module can't we read in the type of the database and only try to convert BINARY types to Arrow binary types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on some local testing using the ADBC driver I can confirm that it yields a pandas.core.arrays.arrow.array.ArrowExtensionArray
with ._dtype
of pandas.ArrowDtype
. When the query returns a bytes type column we get a .type
of bytes
, and likewise a .type
of string
is returned for a string type column. Seems like we don't need to do any conversions when using the ADBC driver as you've stated if I'm understanding correctly here!
Wondering if it makes sense to remove the code here trying to convert based on a dtype_backend != 'numpy'
since this will fix the cause of the exception in the issue? and maybe raise an exception when trying to use a pyarrow
dtype_backend
with the SQLiteDatabase
connection type here: https://github.com/pandas-dev/pandas/blob/main/pandas/io/sql.py#L695 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the general problem is that pandas does not have a first class "binary" data type, so I'm not sure how to solve this for anything but the pyarrow backend.
With the pyarrow backend, I think you can still move this logic to sql.py
and check the type of the column coming back from the database. If it is a binary type in the database, using the PyArrow binary type with that backend makes sense.
Not sure if @mroeschke has other thoughts to the general issue. This is likely another good use case to track in PDEP-13 #58455
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is the incorrect place to handle this conversion logic and this should only be a valid conversion for the pyarrow backend (ArrowExtensionArray._from_sequence
should be able to return a binary type with binary data.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in sql.py
it looks like the result of this conversion is being overwritten when the dtype_backend
is pyarrow
: https://github.com/pandas-dev/pandas/blob/main/pandas/io/sql.py#L181 and the dtype returned by the current logic is ArrowDtype(pa.binary())
for the example in the issue, so maybe just removing the conversion logic is all that's needed to resolve this issue? I've removed the block doing the conversion and added a test case showing that the resulting df has a dtype of ArrowDtype(pa.binary())
when the dtype_backend='pyarrow'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice - simplification is always good and I think this implementation looks reasonable :-)
Just need a few tweaks to the test case
pandas/tests/io/test_sql.py
Outdated
select cast(x'0123456789abcdef0123456789abcdef' as blob) a | ||
""" | ||
df = pd.read_sql(query, sqlite_buildin, dtype_backend=dtype_backend) | ||
assert df.a.values[0] == b"\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use our built-in test helpers instead? I think you can just do:
result = pd.read_sql(...)
expected = pd.DataFrame({"a": ...}, dtype=pd.ArrowDtype(pa.binary()))
tm.assert_frame_equal(result, expected)
What data type does this produce currently with the numpy_nullable
backend - object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sure, changed the testing logic over to using this! for numpy_nullable
and lib.no_default
the dtype returned is an object
pandas/tests/io/test_sql.py
Outdated
|
||
|
||
@pytest.mark.parametrize("dtype_backend", ["pyarrow", "numpy_nullable", lib.no_default]) | ||
def test_bytes_column(sqlite_buildin, dtype_backend): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_bytes_column(sqlite_buildin, dtype_backend): | |
def test_bytes_column(all_connectable, dtype_backend): |
Should test this against all databases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing! I was trying to pass in the cartesian product of the all_connectable
and dtype_backend
arrays using itertools.product
to @pytest.mark.parametrize
but was running into issues with the connections getting passed. I instead made it so the connectables are being passed in the parametrize
and then we loop through the dtypes in the test. Would love to know if there's a better way to do this so we're testing each dtype_backend
/connection
combination independently
pandas/tests/io/test_sql.py
Outdated
@pytest.mark.parametrize("dtype_backend", ["pyarrow", "numpy_nullable", lib.no_default]) | ||
def test_bytes_column(sqlite_buildin, dtype_backend): | ||
pa = pytest.importorskip("pyarrow") | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is well intentioned but can you remove the docstring? We don't use them in tests.
Instead, you can just add a comment pointing to the github issue number in the function body
trying to figure out why this test is failing https://github.com/pandas-dev/pandas/actions/runs/12484030959/job/34840808563?pr=60105 despite passing locally and in the other unit test checks - seems to be due to a library version mismatch between the different unit test CI checks and the |
@kastkeepitjumpinlikekangaroos when running that locally you would need to set an environment variable: PANDAS_FUTURE_INFER_STRING=1 python -m pytest pandas/tests/io/test_sql.py Can you reproduce the error if you do that? |
…pandas into fix-59242
@WillAyd ah yes thank you!! was able to reproduce and fix the failing test locally. Waiting on CI now but should be passing there too. Please let me know if there's another way to be checking for the |
pandas/tests/io/test_sql.py
Outdated
conn = request.getfixturevalue(con) | ||
pa = pytest.importorskip("pyarrow") | ||
|
||
dtype = "O" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm why is this defaulting to object out here? That seems like something to avoid for the pyarrow backend.
I think you are doing that in this test anyway, but the structure is confusing. I would just set up the expected data type from the backend up front, so something like:
if dtype_backend == "pyarrow":
exp_dtype = ...
else:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, let me know if the modified structure is less confusing!
pandas/tests/io/test_sql.py
Outdated
dtype = ( | ||
pd.ArrowDtype(pa.string()) | ||
if "adbc" not in con | ||
else pd.ArrowDtype(pa.opaque(pa.binary(), "bit", "PostgreSQL")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this an opaque type instead of just binary? That seems like a bug somewhere in this implementation or the driver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is coming from adbc-driver-postgresql
, I guess they've chosen opaque
because this is a bit of a weird type in postgres. https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-BIT-STRINGS vs https://www.postgresql.org/docs/current/datatype-binary.html
postgres=# select x'0011';
?column?
------------------
0000000000010001
(1 row)
postgres=# select pg_typeof(x'0011');
pg_typeof
-----------
bit
(1 row)
postgres=# select cast('0011' as bytea);
bytea
------------
\x30303131
(1 row)
postgres=# select pg_typeof(cast('0011' as bytea));
pg_typeof
-----------
bytea
(1 row)
@@ -2261,7 +2261,9 @@ def type(self): | |||
elif pa.types.is_null(pa_type): | |||
# TODO: None? pd.NA? pa.null? | |||
return type(pa_type) | |||
elif isinstance(pa_type, pa.ExtensionType): | |||
elif isinstance(pa_type, pa.ExtensionType) or isinstance( | |||
pa_type, pa.OpaqueType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm AFAIK the OpaqueType is returned when there is no suitable database type, but isn't binary already returned directly? It is a bit unclear what this additional isinstance
check is for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add this because adbc-driver-postgresql
is returning an OpaqueType
- without this included an exception is thrown
val = b"\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef" | ||
if "postgres" in con: | ||
if "adbc" in con: | ||
val = b"\x00\x00\x00\x80\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does postgres have different content than the other databases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the x'...'
syntax in sqlite and mysql represents the same data (called a blob literal in sqlite ]) but in postgres this is slighly different in that it's not a bytea
type , but instead a bit
. Most of the drivers return this as a string representation of the bits (which is why val = "0000000100100011010001..."
in one of the cases)
pandas/tests/io/test_sql.py
Outdated
conn = request.getfixturevalue(con) | ||
pa = pytest.importorskip("pyarrow") | ||
|
||
val = b"\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does #Eg
represent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the raw bytes returned from the hex string "0123456789abcdef0123456789abcdef"
- I've changed the test to use bytes.fromhex("0123456789abcdef0123456789abcdef")
to make things more clear here
if "adbc" in con: | ||
val = b"\x00\x00\x00\x80\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef" | ||
else: | ||
val = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be the binary representation of the bytes? I'm a bit confused why this test has so many different expected values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah exactly and this is due to postgres returning the bit
type which most of the drivers turn into this string
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.This PR allows for bytes based data to be returned instead of throwing an exception like before