Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: convert to datetime64[us] when TIMESTAMP or DATETIME is empty #858

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

kitagry
Copy link

@kitagry kitagry commented Jan 19, 2025

Fixes #852

In the document, pandas-gbq returns datetime64[ns] or object, but in pandas2, it returns datetime64[us].

https://googleapis.dev/python/pandas-gbq/latest/reading.html#inferring-the-dataframe-s-dtypes

image

@kitagry kitagry requested review from a team as code owners January 19, 2025 07:19
@kitagry kitagry requested a review from tswast January 19, 2025 07:19
@product-auto-label product-auto-label bot added the size: s Pull request size is small. label Jan 19, 2025
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Jan 19, 2025
and pandas.api.types.is_datetime64_dtype(df[name])
and not pandas.api.types.is_datetime64_ns_dtype(df[name])
):
df[name] = df[name].dt.as_unit("ns")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually prefer we use us as the units where possible. Technically, ns can cause data loss, as BigQuery stores its DATETIME/TIMESTAMP values with microsecond precision.

Could we update the documentation, instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment. I agree that returning values in μs is preferable, but in that case, I would like to make changes to ensure that even when the DataFrame is empty, it returns values in μs. The background of this Pull Request is that we are using pandera, and pandera distinguishes between ns and μs, so we want to unify them.

class TableSchema:
  timestamp: Series[Annotated[pd.DatetimeTZDtype, 'ns', 'UTC']] = pa.Field()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose that if pandas >= 2.0.0, pandas-gbq returns datetime[us]; otherwise, it returns datetime[ns] as before. What do you think about this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose that if pandas >= 2.0.0, pandas-gbq returns datetime[us]; otherwise, it returns datetime[ns] as before. What do you think about this?

I like that!

@kitagry kitagry force-pushed the fix-datetime-dtype branch from b11b39c to 99b0113 Compare January 29, 2025 13:02
@product-auto-label product-auto-label bot added size: m Pull request size is medium. and removed size: s Pull request size is small. labels Jan 29, 2025
@kitagry kitagry changed the title fix: convert datetime64[us] to datetime64[ns] for TIMESTAMP and DATETIME fix: convert to datetime64[us] when TIMESTAMP or DATETIME is empty Jan 29, 2025
@tswast tswast added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 5, 2025
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 5, 2025
@tswast tswast added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels Feb 5, 2025
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 5, 2025
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 5, 2025
@tswast tswast added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels Feb 5, 2025
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 5, 2025
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 5, 2025
@tswast tswast added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels Feb 5, 2025
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 5, 2025
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 5, 2025
@tswast tswast added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels Feb 5, 2025
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 5, 2025
@tswast tswast enabled auto-merge (squash) February 5, 2025 22:30
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 5, 2025
@tswast tswast disabled auto-merge February 6, 2025 18:37
@tswast
Copy link
Collaborator

tswast commented Feb 6, 2025

There's one failing system test after I updated them to look for the correct dtype on pandas 2.1.0 and above.

pytest 'tests/system/test_to_gbq.py::test_dataframe_round_trip_with_table_schema[load_csv-issue365-extreme-datetimes]'

I took a look at the CSV file created and it looks like this:

1,0001-01-01,1-01-01 00:00:00.000000,1-01-01 00:00:00.000000
2,1970-01-01,1970-01-01 00:00:00.000000,1970-01-01 00:00:00.000000
3,9999-12-31,9999-12-31 23:59:59.999999,9999-12-31 23:59:59.999999

For some reason, the TIMESTAMP column loads fine, but the DATETIME column BigQuery says there's too many errors when I try to load this as a DATETIME column (CSV loads fine if I say it's a STRING).

I've spent a bit too much time on this right now, so I'll try to take another look next week.

@tswast tswast added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels Feb 6, 2025
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 6, 2025
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 6, 2025
@tswast tswast added kokoro:force-run Add this label to force Kokoro to re-run the tests. owlbot:run Add this label to trigger the Owlbot post processor. labels Feb 6, 2025
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Feb 6, 2025
@yoshi-kokoro yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Feb 6, 2025
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I contributed a few commits to fix most of the failing tests, but there is one remaining failing system test on Python 3.12:

tests/system/test_to_gbq.py::test_dataframe_round_trip_with_table_schema[load_csv-issue365-extreme-datetimes]
E           pandas_gbq.exceptions.GenericGBQException: Reason: 400 Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 3; errors: 1; max bad: 0; error percent: 0; reason: invalid, message: Error while reading data, error message: CSV processing encountered too many errors, giving up. Rows: 3; errors: 1; max bad: 0; error percent: 0; reason: invalid, message: Error while reading data, error message: Invalid datetime string "1-01-01 00:00:00.000000"; line_number: 1 byte_offset_to_start_of_line: 0 column_index: 2 column_name: "datetime_col" column_type: DATETIME value: "1-01-01 00:00:00...."

Strangely, this format works fine for TIMESTAMP columns in CSV. Per https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv the datetime format is YYYY-MM-DD HH:MM:SS[.SSSSSS], which this seems to be. Needs some investigation to see why this isn't working now and only with Python 3.12 (newer pandas versions).

@tswast tswast assigned tswast and unassigned suzmue Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timestamp returns a different type depending on whether the data is empty or not.
4 participants