Skip to content

Type conversion on load datetime64[ns] ->datetime64[ns, UTC] #216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shraik opened this issue Mar 30, 2025 · 23 comments
Open

Type conversion on load datetime64[ns] ->datetime64[ns, UTC] #216

shraik opened this issue Mar 30, 2025 · 23 comments
Labels
bug Something isn't working

Comments

@shraik
Copy link

shraik commented Mar 30, 2025

When loading from pandas in the table with dates, the UTC timezone is added to the dtype.
This is confusing.
Is this correct or a bug?

Package Version


crate 2.0.0
pandas 2.2.3
SQLAlchemy 2.0.39
sqlalchemy-cratedb 0.42.0.dev0

test

import sqlalchemy as sa
import pandas as pd

data = {
    "date_1": ["2020-01-01", "2021-01-01", "2022-01-01", "2023-01-01", "2027-12-30"],
    "date_2": ["2020-09-24", "2020-10-24", "2020-11-24", "2020-12-24", "2027-09-24"],
}
df_data = pd.DataFrame.from_dict(data, dtype="datetime64[ns]")
print(df_data.dtypes)
print(df_data.sort_values(by="date_1").reset_index(drop=True))

dburi = "crate://panduser:[email protected]:4200?ssl=false"
engine = sa.create_engine(dburi, echo=False)
conn = engine.connect()

df_data.to_sql(
    "test_date",
    conn,
    if_exists="replace",
    index=False,
)
conn.exec_driver_sql("REFRESH TABLE test_date;")
df_load = pd.read_sql_table("test_date", conn)


print("\ndataframe after loading")
df_load = df_load.sort_values(by="date_1").reset_index(drop=True)
print(df_load.dtypes)
print(df_load)

Schema
Image

Output:

date_1    datetime64[ns]
date_2    datetime64[ns]
dtype: object
      date_1     date_2
0 2020-01-01 2020-09-24
1 2021-01-01 2020-10-24
2 2022-01-01 2020-11-24
3 2023-01-01 2020-12-24
4 2027-12-30 2027-09-24

dataframe after loading
date_1    datetime64[ns, UTC]
date_2    datetime64[ns, UTC]
dtype: object
                     date_1                    date_2
0 2020-01-01 00:00:00+00:00 2020-09-24 00:00:00+00:00
1 2021-01-01 00:00:00+00:00 2020-10-24 00:00:00+00:00
2 2022-01-01 00:00:00+00:00 2020-11-24 00:00:00+00:00
3 2023-01-01 00:00:00+00:00 2020-12-24 00:00:00+00:00
4 2027-12-30 00:00:00+00:00 2027-09-24 00:00:00+00:00

After loading, to remove the time zone, I do this

df2 = df_load.select_dtypes("datetimetz")
df_load[df2.columns] = df2.apply(lambda x: x.dt.tz_convert(None))
@amotl
Copy link
Member

amotl commented Mar 30, 2025

Hi @shraik. Thanks for another report about potential type mapping improvements. We will look into it.

When inserting dates, the UTC timezone is added to the dtype. This is confusing. Is this correct or a bug?

We will probably start investigating by comparing against PostgreSQL in order to get a feeling whether the behavior is intended with CrateDB, or if anything else should be improved, most likely within the SQLAlchemy dialect implementation.

@shraik
Copy link
Author

shraik commented Mar 30, 2025

I checked on the docker version postgres:13, time zone is not added.

For testing I added the library:

pip install psycopg2
#Successfully installed psycopg2-2.9.10

and changed 2 lines

dburi = "postgresql://panduser:[email protected]:5432/pandas_base"
# conn.exec_driver_sql("REFRESH TABLE test_date;")

Maybe this will help you in testing.

@amotl
Copy link
Member

amotl commented Mar 30, 2025

Hi again. We've investigated your observations, thank you again.

The outcome is that it is currently expected behavior, because CrateDB does not store DATE types natively. They will be stored as BIGINT types, in the same spirit like TIMESTAMP types, and on their way back, they naturally converge into timezone-aware DATETIME types, because that's probably the default mapping. Weird, but in this case expected.

However, pandas provides easy workaround support to adjust the type mapping for date and datetime columns, using the parse_dates option to read_sql_table.

pd.read_sql_table("test_date", conn, parse_dates={"date_1": "date", "date_2": "date"})

I think it is a good idea to add this to our documentation in one way or another, so let's keep the issue open as a notice for that.
While it's a slight obstacle, are you able to work with that outcome?

@shraik
Copy link
Author

shraik commented Mar 31, 2025

Yes, I can work with this date conversion option, it's not a problem.
The loading option you suggested requires specifying the names of the columns with dates before loading. This is not very convenient if you need to read several tables with different structures. For my task, it turned out to be easier to make a universal wrapper for searching and removing time zones from the loaded dataframe, as I indicated above.

PS:
Since the returned data contains the UTC time zone, perhaps the data type name in the schema should be renamed from "timestamp without time zone"?
:)

@amotl
Copy link
Member

amotl commented Mar 31, 2025

Hi @shraik.

Yes, I can work with this date conversion option, it's not a problem.

Excellent, thanks.

The loading option you suggested requires specifying the names of the columns with dates before loading. This is not very convenient if you need to read several tables with different structures. For my task, it turned out to be easier to make a universal wrapper for searching and removing time zones from the loaded dataframe, as I indicated above.

I see, and I also kind of expected that. Without doing schema introspection before, it is certainly inconvenient. We may add such an "universal wrapper" to sqlalchemy_cratedb.support.pandas easily, if it might be handy for and others. Those are also my original intentions with GH-128, so feel free to make a start by sharing your code, possibly on behalf of a pull request?

I think the snippet you've shared above would already make an excellent start.

df2 = df_load.select_dtypes("datetimetz")
df_load[df2.columns] = df2.apply(lambda x: x.dt.tz_convert(None))

Since the returned data contains the UTC time zone, perhaps the data type name in the schema should be renamed from "timestamp without time zone"?

We will be happy to improve anything where you can spot flaws, in order to incrementally improve. Is it in this case a particular spot in the documentation you are referring to?

@amotl
Copy link
Member

amotl commented Mar 31, 2025

Since the returned data contains the UTC time zone, perhaps the data type name in the schema should be renamed from "timestamp without time zone"?

We will be happy to improve anything where you can spot flaws, in order to incrementally improve. Is it in this case a particular spot in the documentation you are referring to?

Ah! Currently, when storing dt.date objects through pandas, they will manifest as TIMESTAMP WITHOUT TIME ZONE, and when reading them back, you will get an aware datetime object instead? Is it this anomaly you are looking at here, and asking to eventually improve, by instead using TIMESTAMP WITH TIME ZONE for storing?

@shraik
Copy link
Author

shraik commented Mar 31, 2025

I meant the display of the type in the web interface. There it is indicated without a time zone, but when loading I received dates with a time zone. This is where the initial misunderstanding came from.

Image

@amotl
Copy link
Member

amotl commented Mar 31, 2025

Thank you for clarifying. We will see if we can improve on those little details here, given that DATE is in a twilight zone anyway.

The DATE type was not designed to allow time-of-day information (i.e., it is supposed to have a resolution of one day).
However, CrateDB allows you violate that constraint by casting any number of milliseconds within limits to a DATE type. The result is then returned as a TIMESTAMP.

I don't see a reason not to converge DATE types into the physical TIMESTAMP WITH TIME ZONE type, as you are suggesting, when possible without introducing other quirks.

What do you think, @surister, @matriv, or @kneth?

@amotl
Copy link
Member

amotl commented Mar 31, 2025

[...] to eventually improve, by instead using TIMESTAMP WITH TIME ZONE for storing what has been handed in using a DATE type [...]

I wonder if it's the custom JSON encoder in the lower level Python driver that would need to be improved here, specifically where dt.date objects are handled?

@matriv
Copy link
Contributor

matriv commented Mar 31, 2025

In my opinion, those dates as strings that you insert in the first place should be indeed be stored into a timestamp WITHOUT time zone type. When retrieving them as objects then I guess it's mandatory to have a timezone, so UTC should be the correct one set there. I don't think we should change anything regarding this behavior, unless there is some other object type which doesn't have a timezone.

This topic is always weird. For PostgreSQL if you store timestamp WITH timezone, and you store a timestamp like: 2025-03-21 11:22:33.123 Europe/Berlin the timezone info is only used to store it internally as 2025-03-21 09:22:33.123 UTC (in UTC milliseconds). Then if you retrieve such a column with JDBC client on a system set on Europe/Athens you will receive: 2025-03-21 12:22:33.123 Europe/Athens. So the original timezone is used only to convert and store in UTC and then lost forever, you can never know what it was.

@amotl
Copy link
Member

amotl commented Mar 31, 2025

That's all true, thank you. Still,

The DATE type was not designed to allow time-of-day information (i.e., it is supposed to have a resolution of one day).

So, when looking at this specific detail, even with or without the other obstacles about loosing the timezone information when storing dates or times, which is always the case, I think it does not matter much which data type will be selected, i.e. it won't harm to choose the timezone-aware one in order to get rid of this miniature I/O anomaly?

@amotl
Copy link
Member

amotl commented Mar 31, 2025

Compare

start investigating by comparing against PostgreSQL in order to get a feeling whether the behavior is intended with CrateDB

I've used your example program, now also at pandas_cratedb_date_type.py, to check and compare PostgreSQL and CrateDB.

Observations

Both store ingress DATE types in this context using the same data type, which is TIMESTAMP WITHOUT TIME ZONE.

PostgreSQL

postgres=# \d+ test_date
                                                  Table "public.test_date"
 Column |            Type             | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+-----------------------------+-----------+----------+---------+---------+-------------+--------------+-------------
 date_1 | timestamp without time zone |           |          |         | plain   |             |              |
 date_2 | timestamp without time zone |           |          |         | plain   |             |              |

CrateDB

cr> show create table test_date;
+------------------------------------------------+
| SHOW CREATE TABLE doc.test_date                |
+------------------------------------------------+
| CREATE TABLE IF NOT EXISTS "doc"."test_date" ( |
|    "date_1" TIMESTAMP WITHOUT TIME ZONE,       |
|    "date_2" TIMESTAMP WITHOUT TIME ZONE        |
| )                                              |
| CLUSTERED INTO 4 SHARDS                        |
| WITH (                                         |
|    column_policy = 'strict',                   |
|    number_of_replicas = '0-1'                  |
| )                                              |
+------------------------------------------------+

Recap

@matriv: There is an I/O anomaly when using pandas and CrateDB, the outcome is different than with PostgreSQL. What we are looking at here is if we could possibly improve the situation?

PostgreSQL

dataframe before
date_1    datetime64[ns]

dataframe after
date_1    datetime64[ns]

CrateDB

dataframe before
date_1    datetime64[ns]

dataframe after
date_1    datetime64[ns, UTC]

@matriv
Copy link
Contributor

matriv commented Apr 1, 2025

@amotl Could you please clarify what do you propose?

For me DATE should be timestamp WITHOUT time zone, because its hour/min/sec/millis are all 0.
and it actually only denotes a day in whatever context the client wants to use it.
If tz info is added then, imho, it becomes confusing, it means that when storing for example 2025-31-03 in a client tz UTC+1, it will converted to 2025-30-03 23:00:00.000, or the opposite if you try to retrieve it with a client on a specific timezone. So even adding UTC, imho, is not the correct way to go.

Unless, I'm confused, and you propose something different.

@amotl
Copy link
Member

amotl commented Apr 1, 2025

Hi. I think it's clear that both database servers behave in the same way, using the data type TIMESTAMP WITHOUT TIME ZONE for storing this field, so it's all good on this end. However, as outlined above, when using CrateDB with pandas, the returned data type is timezone-aware (datetime64[ns, UTC]), while it shouldn't (datetime64[ns]).

My intention is to find the flaw, and mitigate it when possible, because it's confusing to users.

@matriv
Copy link
Contributor

matriv commented Apr 1, 2025

So the fix should be there, when we return date and timestamp WITHOUT time zone we shouldn't add the UTC timezone, because in a previous comment you said:

I don't see a reason not to converge DATE types into the physical TIMESTAMP WITH TIME ZONE type, as you are suggesting, when possible without introducing other quirks.

which imho is not the way to go.

@amotl
Copy link
Member

amotl commented Apr 1, 2025

Yes, you are right. Hereby I am retracting my previous statement officially. Sorry if that stirred confusion. CrateDB does the same like PostgreSQL, so it's all right in this regard. The fix needs to be applied somewhere in the Python client layers, when possible. Thanks! 🍀

@shraik
Copy link
Author

shraik commented Apr 2, 2025

Another example of strange behavior.
If you save dates with time zones, there will be no time zone when loading.
Why isn't the UTC time zone added in this case?

Using:
Package Version


crate 2.0.0
geojson 3.2.0
greenlet 3.1.1
numpy 2.2.4
orjson 3.10.15
pandas 2.2.3
pip 25.0.1
psycopg2 2.9.10
pyarrow 19.0.1
python-dateutil 2.9.0.post0
pytz 2025.1
six 1.17.0
SQLAlchemy 2.0.39
sqlalchemy-cratedb 0.42.0.dev0
typing_extensions 4.12.2
tzdata 2025.1
urllib3 2.3.0
verlib2 0.2.0

Test-example:

import sqlalchemy as sa
import pandas as pd

data = {
    "date_1": ["2020-01-01", "2021-01-01", "2022-01-01", "2023-01-01", "2027-12-30"],
    "date_2": ["2020-09-24", "2020-10-24", "2020-11-24", "2020-12-24", "2027-09-24"],
}
df_data = pd.DataFrame.from_dict(data, dtype="datetime64[ns]")

df_data[["date_1", "date_2"]] = df_data[["date_1", "date_2"]].apply(
    lambda x: x.dt.tz_localize("Asia/krasnoyarsk")
)

print(df_data.dtypes)
print(df_data.sort_values(by="date_1").reset_index(drop=True))

dburi = "crate://panduser:[email protected]:4200?ssl=false"

engine = sa.create_engine(dburi, echo=False)
conn = engine.connect()

df_data.to_sql(
    "test_date",
    conn,
    if_exists="replace",
    index=False,
)
conn.exec_driver_sql("REFRESH TABLE test_date;")
df_load = pd.read_sql_table("test_date", conn)

print("\ndataframe after loading")
df_load = df_load.sort_values(by="date_1").reset_index(drop=True)
print(df_load.dtypes)
print(df_load)

Output:

date_1    datetime64[ns, Asia/Krasnoyarsk]
date_2    datetime64[ns, Asia/Krasnoyarsk]
dtype: object
                     date_1                    date_2
0 2020-01-01 00:00:00+07:00 2020-09-24 00:00:00+07:00
1 2021-01-01 00:00:00+07:00 2020-10-24 00:00:00+07:00
2 2022-01-01 00:00:00+07:00 2020-11-24 00:00:00+07:00
3 2023-01-01 00:00:00+07:00 2020-12-24 00:00:00+07:00
4 2027-12-30 00:00:00+07:00 2027-09-24 00:00:00+07:00

dataframe after loading
date_1    datetime64[ns]
date_2    datetime64[ns]
dtype: object
               date_1              date_2
0 2019-12-31 17:00:00 2020-09-23 17:00:00
1 2020-12-31 17:00:00 2020-10-23 17:00:00
2 2021-12-31 17:00:00 2020-11-23 17:00:00
3 2022-12-31 17:00:00 2020-12-23 17:00:00
4 2027-12-29 17:00:00 2027-09-23 17:00:00

Image

@amotl
Copy link
Member

amotl commented Apr 3, 2025

Thank you very much for your report again. We will also look into this. It feels like the type mapper needs more improvements.

@amotl
Copy link
Member

amotl commented Apr 11, 2025

Hi again,

we just identified the spot in pandas where CrateDB/SQLAlchemy follows a different code path than PostgreSQL/SQLAlchemy.

The CrateDB dialect currently apparently returns sqltype.timezone=True, that's why we can observe this outcome you are presenting here.

# we have a timezone capable type
if not sqltype.timezone:
    return datetime
return DatetimeTZDtype

We will see what we can do about it.

With kind regards,
Andreas.

@amotl
Copy link
Member

amotl commented Apr 11, 2025

That's a little pure-SQLAlchemy reproducer which demonstrates the problem around column.type.timezone.

def reflect():
    dburi = "crate://"
    #dburi = "postgresql://postgres@localhost:5433/"
    engine = sa.create_engine(dburi)
    with engine.connect() as conn:
        conn.execute(sa.text("CREATE TABLE IF NOT EXISTS t2 (date TIMESTAMP WITHOUT TIME ZONE)"))
        conn.commit()
    metadata = sa.MetaData()
    inspector = sa.inspect(engine)
    table = sa.Table("t2", metadata)
    inspector.reflect_table(table, None)
    for column in table.columns:
        print("column:", column, column.type, column.type.timezone)

@amotl
Copy link
Member

amotl commented Apr 11, 2025

Well, that's an obvious and silly mixup flaw coming from GH-24, where it needs to be fixed was fixed now.

"timestamp": sqltypes.TIMESTAMP,
"timestamp with time zone": sqltypes.TIMESTAMP(timezone=False),
"timestamp without time zone": sqltypes.TIMESTAMP(timezone=True),

@amotl
Copy link
Member

amotl commented Apr 11, 2025

We just applied a fix per 04f475d, and released sqlalchemy-cratedb==0.42.0.dev2. Can you to validate that this resolves the problem you observed?

@amotl amotl added the bug Something isn't working label Apr 11, 2025
@shraik
Copy link
Author

shraik commented Apr 15, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants