-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: convert to datetime64[us] when TIMESTAMP or DATETIME is empty #858
base: main
Are you sure you want to change the base?
Conversation
pandas_gbq/gbq.py
Outdated
and pandas.api.types.is_datetime64_dtype(df[name]) | ||
and not pandas.api.types.is_datetime64_ns_dtype(df[name]) | ||
): | ||
df[name] = df[name].dt.as_unit("ns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd actually prefer we use us
as the units where possible. Technically, ns
can cause data loss, as BigQuery stores its DATETIME/TIMESTAMP values with microsecond precision.
Could we update the documentation, instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment. I agree that returning values in μs is preferable, but in that case, I would like to make changes to ensure that even when the DataFrame is empty, it returns values in μs. The background of this Pull Request is that we are using pandera, and pandera distinguishes between ns and μs, so we want to unify them.
class TableSchema:
timestamp: Series[Annotated[pd.DatetimeTZDtype, 'ns', 'UTC']] = pa.Field()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose that if pandas >= 2.0.0, pandas-gbq returns datetime[us]
; otherwise, it returns datetime[ns]
as before. What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose that if pandas >= 2.0.0, pandas-gbq returns datetime[us]; otherwise, it returns datetime[ns] as before. What do you think about this?
I like that!
b11b39c
to
99b0113
Compare
There's one failing system test after I updated them to look for the correct dtype on pandas 2.1.0 and above.
I took a look at the CSV file created and it looks like this:
For some reason, the TIMESTAMP column loads fine, but the DATETIME column BigQuery says there's too many errors when I try to load this as a DATETIME column (CSV loads fine if I say it's a STRING). I've spent a bit too much time on this right now, so I'll try to take another look next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I contributed a few commits to fix most of the failing tests, but there is one remaining failing system test on Python 3.12:
tests/system/test_to_gbq.py::test_dataframe_round_trip_with_table_schema[load_csv-issue365-extreme-datetimes]
E pandas_gbq.exceptions.GenericGBQException: Reason: 400 Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 3; errors: 1; max bad: 0; error percent: 0; reason: invalid, message: Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 3; errors: 1; max bad: 0; error percent: 0; reason: invalid, message: Error while reading data, error message: Invalid datetime string "1-01-01 00:00:00.000000"; line_number: 1 byte_offset_to_start_of_line: 0 column_index: 2 column_name: "datetime_col" column_type: DATETIME value: "1-01-01 00:00:00...."
Strangely, this format works fine for TIMESTAMP columns in CSV. Per https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv the datetime format is YYYY-MM-DD HH:MM:SS[.SSSSSS]
, which this seems to be. Needs some investigation to see why this isn't working now and only with Python 3.12 (newer pandas versions).
Fixes #852
In the document, pandas-gbq returns
datetime64[ns]
orobject
, but in pandas2, it returnsdatetime64[us]
.https://googleapis.dev/python/pandas-gbq/latest/reading.html#inferring-the-dataframe-s-dtypes