Type conversion on load datetime64[ns] ->datetime64[ns, UTC] #216

shraik · 2025-03-30T11:40:20Z

When loading from pandas in the table with dates, the UTC timezone is added to the dtype.
This is confusing.
Is this correct or a bug?

Package Version

crate 2.0.0
pandas 2.2.3
SQLAlchemy 2.0.39
sqlalchemy-cratedb 0.42.0.dev0

test

import sqlalchemy as sa
import pandas as pd

data = {
    "date_1": ["2020-01-01", "2021-01-01", "2022-01-01", "2023-01-01", "2027-12-30"],
    "date_2": ["2020-09-24", "2020-10-24", "2020-11-24", "2020-12-24", "2027-09-24"],
}
df_data = pd.DataFrame.from_dict(data, dtype="datetime64[ns]")
print(df_data.dtypes)
print(df_data.sort_values(by="date_1").reset_index(drop=True))

dburi = "crate://panduser:[email protected]:4200?ssl=false"
engine = sa.create_engine(dburi, echo=False)
conn = engine.connect()

df_data.to_sql(
    "test_date",
    conn,
    if_exists="replace",
    index=False,
)
conn.exec_driver_sql("REFRESH TABLE test_date;")
df_load = pd.read_sql_table("test_date", conn)


print("\ndataframe after loading")
df_load = df_load.sort_values(by="date_1").reset_index(drop=True)
print(df_load.dtypes)
print(df_load)

Schema

Output:

date_1    datetime64[ns]
date_2    datetime64[ns]
dtype: object
      date_1     date_2
0 2020-01-01 2020-09-24
1 2021-01-01 2020-10-24
2 2022-01-01 2020-11-24
3 2023-01-01 2020-12-24
4 2027-12-30 2027-09-24

dataframe after loading
date_1    datetime64[ns, UTC]
date_2    datetime64[ns, UTC]
dtype: object
                     date_1                    date_2
0 2020-01-01 00:00:00+00:00 2020-09-24 00:00:00+00:00
1 2021-01-01 00:00:00+00:00 2020-10-24 00:00:00+00:00
2 2022-01-01 00:00:00+00:00 2020-11-24 00:00:00+00:00
3 2023-01-01 00:00:00+00:00 2020-12-24 00:00:00+00:00
4 2027-12-30 00:00:00+00:00 2027-09-24 00:00:00+00:00

After loading, to remove the time zone, I do this

df2 = df_load.select_dtypes("datetimetz")
df_load[df2.columns] = df2.apply(lambda x: x.dt.tz_convert(None))

The text was updated successfully, but these errors were encountered:

amotl · 2025-03-30T14:52:40Z

Hi @shraik. Thanks for another report about potential type mapping improvements. We will look into it.

When inserting dates, the UTC timezone is added to the dtype. This is confusing. Is this correct or a bug?

We will probably start investigating by comparing against PostgreSQL in order to get a feeling whether the behavior is intended with CrateDB, or if anything else should be improved, most likely within the SQLAlchemy dialect implementation.

shraik · 2025-03-30T15:55:56Z

I checked on the docker version postgres:13, time zone is not added.

For testing I added the library:

pip install psycopg2
#Successfully installed psycopg2-2.9.10

and changed 2 lines

dburi = "postgresql://panduser:[email protected]:5432/pandas_base"
# conn.exec_driver_sql("REFRESH TABLE test_date;")

Maybe this will help you in testing.

amotl · 2025-03-30T20:16:43Z

Hi again. We've investigated your observations, thank you again.

The outcome is that it is currently expected behavior, because CrateDB does not store DATE types natively. They will be stored as BIGINT types, in the same spirit like TIMESTAMP types, and on their way back, they naturally converge into timezone-aware DATETIME types, because that's probably the default mapping. Weird, but in this case expected.

However, pandas provides easy workaround support to adjust the type mapping for date and datetime columns, using the parse_dates option to read_sql_table.

pd.read_sql_table("test_date", conn, parse_dates={"date_1": "date", "date_2": "date"})

I think it is a good idea to add this to our documentation in one way or another, so let's keep the issue open as a notice for that.
While it's a slight obstacle, are you able to work with that outcome?

shraik · 2025-03-31T13:42:35Z

Yes, I can work with this date conversion option, it's not a problem.
The loading option you suggested requires specifying the names of the columns with dates before loading. This is not very convenient if you need to read several tables with different structures. For my task, it turned out to be easier to make a universal wrapper for searching and removing time zones from the loaded dataframe, as I indicated above.

PS:
Since the returned data contains the UTC time zone, perhaps the data type name in the schema should be renamed from "timestamp without time zone"?
:)

amotl · 2025-03-31T14:59:40Z

Hi @shraik.

Yes, I can work with this date conversion option, it's not a problem.

Excellent, thanks.

The loading option you suggested requires specifying the names of the columns with dates before loading. This is not very convenient if you need to read several tables with different structures. For my task, it turned out to be easier to make a universal wrapper for searching and removing time zones from the loaded dataframe, as I indicated above.

I see, and I also kind of expected that. Without doing schema introspection before, it is certainly inconvenient. We may add such an "universal wrapper" to sqlalchemy_cratedb.support.pandas easily, if it might be handy for and others. Those are also my original intentions with GH-128, so feel free to make a start by sharing your code, possibly on behalf of a pull request?

I think the snippet you've shared above would already make an excellent start.

df2 = df_load.select_dtypes("datetimetz")
df_load[df2.columns] = df2.apply(lambda x: x.dt.tz_convert(None))

Since the returned data contains the UTC time zone, perhaps the data type name in the schema should be renamed from "timestamp without time zone"?

We will be happy to improve anything where you can spot flaws, in order to incrementally improve. Is it in this case a particular spot in the documentation you are referring to?

amotl · 2025-03-31T15:01:24Z

Since the returned data contains the UTC time zone, perhaps the data type name in the schema should be renamed from "timestamp without time zone"?

We will be happy to improve anything where you can spot flaws, in order to incrementally improve. Is it in this case a particular spot in the documentation you are referring to?

Ah! Currently, when storing dt.date objects through pandas, they will manifest as TIMESTAMP WITHOUT TIME ZONE, and when reading them back, you will get an aware datetime object instead? Is it this anomaly you are looking at here, and asking to eventually improve, by instead using TIMESTAMP WITH TIME ZONE for storing?

shraik · 2025-03-31T15:43:02Z

I meant the display of the type in the web interface. There it is indicated without a time zone, but when loading I received dates with a time zone. This is where the initial misunderstanding came from.

amotl · 2025-03-31T15:51:40Z

Thank you for clarifying. We will see if we can improve on those little details here, given that DATE is in a twilight zone anyway.

The DATE type was not designed to allow time-of-day information (i.e., it is supposed to have a resolution of one day).
However, CrateDB allows you violate that constraint by casting any number of milliseconds within limits to a DATE type. The result is then returned as a TIMESTAMP.

I don't see a reason not to converge DATE types into the physical TIMESTAMP WITH TIME ZONE type, as you are suggesting, when possible without introducing other quirks.

What do you think, @surister, @matriv, or @kneth?

amotl · 2025-03-31T15:54:38Z

[...] to eventually improve, by instead using TIMESTAMP WITH TIME ZONE for storing what has been handed in using a DATE type [...]

I wonder if it's the custom JSON encoder in the lower level Python driver that would need to be improved here, specifically where dt.date objects are handled?

matriv · 2025-03-31T16:06:25Z

In my opinion, those dates as strings that you insert in the first place should be indeed be stored into a timestamp WITHOUT time zone type. When retrieving them as objects then I guess it's mandatory to have a timezone, so UTC should be the correct one set there. I don't think we should change anything regarding this behavior, unless there is some other object type which doesn't have a timezone.

This topic is always weird. For PostgreSQL if you store timestamp WITH timezone, and you store a timestamp like: 2025-03-21 11:22:33.123 Europe/Berlin the timezone info is only used to store it internally as 2025-03-21 09:22:33.123 UTC (in UTC milliseconds). Then if you retrieve such a column with JDBC client on a system set on Europe/Athens you will receive: 2025-03-21 12:22:33.123 Europe/Athens. So the original timezone is used only to convert and store in UTC and then lost forever, you can never know what it was.

amotl · 2025-03-31T17:48:43Z

That's all true, thank you. Still,

The DATE type was not designed to allow time-of-day information (i.e., it is supposed to have a resolution of one day).

So, when looking at this specific detail, even with or without the other obstacles about loosing the timezone information when storing dates or times, which is always the case, I think it does not matter much which data type will be selected, i.e. it won't harm to choose the timezone-aware one in order to get rid of this miniature I/O anomaly?

amotl · 2025-03-31T18:15:33Z

Compare

start investigating by comparing against PostgreSQL in order to get a feeling whether the behavior is intended with CrateDB

I've used your example program, now also at pandas_cratedb_date_type.py, to check and compare PostgreSQL and CrateDB.

Observations

Both store ingress DATE types in this context using the same data type, which is TIMESTAMP WITHOUT TIME ZONE.

PostgreSQL

postgres=# \d+ test_date
                                                  Table "public.test_date"
 Column |            Type             | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+-----------------------------+-----------+----------+---------+---------+-------------+--------------+-------------
 date_1 | timestamp without time zone |           |          |         | plain   |             |              |
 date_2 | timestamp without time zone |           |          |         | plain   |             |              |

CrateDB

cr> show create table test_date;
+------------------------------------------------+
| SHOW CREATE TABLE doc.test_date                |
+------------------------------------------------+
| CREATE TABLE IF NOT EXISTS "doc"."test_date" ( |
|    "date_1" TIMESTAMP WITHOUT TIME ZONE,       |
|    "date_2" TIMESTAMP WITHOUT TIME ZONE        |
| )                                              |
| CLUSTERED INTO 4 SHARDS                        |
| WITH (                                         |
|    column_policy = 'strict',                   |
|    number_of_replicas = '0-1'                  |
| )                                              |
+------------------------------------------------+

Recap

@matriv: There is an I/O anomaly when using pandas and CrateDB, the outcome is different than with PostgreSQL. What we are looking at here is if we could possibly improve the situation?

PostgreSQL

dataframe before
date_1    datetime64[ns]

dataframe after
date_1    datetime64[ns]

CrateDB

dataframe before
date_1    datetime64[ns]

dataframe after
date_1    datetime64[ns, UTC]

matriv · 2025-04-01T11:21:24Z

@amotl Could you please clarify what do you propose?

For me DATE should be timestamp WITHOUT time zone, because its hour/min/sec/millis are all 0.
and it actually only denotes a day in whatever context the client wants to use it.
If tz info is added then, imho, it becomes confusing, it means that when storing for example 2025-31-03 in a client tz UTC+1, it will converted to 2025-30-03 23:00:00.000, or the opposite if you try to retrieve it with a client on a specific timezone. So even adding UTC, imho, is not the correct way to go.

Unless, I'm confused, and you propose something different.

amotl · 2025-04-01T12:58:35Z

Hi. I think it's clear that both database servers behave in the same way, using the data type TIMESTAMP WITHOUT TIME ZONE for storing this field, so it's all good on this end. However, as outlined above, when using CrateDB with pandas, the returned data type is timezone-aware (datetime64[ns, UTC]), while it shouldn't (datetime64[ns]).

My intention is to find the flaw, and mitigate it when possible, because it's confusing to users.

matriv · 2025-04-01T14:08:09Z

So the fix should be there, when we return date and timestamp WITHOUT time zone we shouldn't add the UTC timezone, because in a previous comment you said:

I don't see a reason not to converge DATE types into the physical TIMESTAMP WITH TIME ZONE type, as you are suggesting, when possible without introducing other quirks.

which imho is not the way to go.

amotl · 2025-04-01T16:36:02Z

Yes, you are right. Hereby I am retracting my previous statement officially. Sorry if that stirred confusion. CrateDB does the same like PostgreSQL, so it's all right in this regard. The fix needs to be applied somewhere in the Python client layers, when possible. Thanks! 🍀

shraik · 2025-04-02T15:49:12Z

Another example of strange behavior.
If you save dates with time zones, there will be no time zone when loading.
Why isn't the UTC time zone added in this case?

Using:
Package Version

crate 2.0.0
geojson 3.2.0
greenlet 3.1.1
numpy 2.2.4
orjson 3.10.15
pandas 2.2.3
pip 25.0.1
psycopg2 2.9.10
pyarrow 19.0.1
python-dateutil 2.9.0.post0
pytz 2025.1
six 1.17.0
SQLAlchemy 2.0.39
sqlalchemy-cratedb 0.42.0.dev0
typing_extensions 4.12.2
tzdata 2025.1
urllib3 2.3.0
verlib2 0.2.0

Test-example:

import sqlalchemy as sa
import pandas as pd

data = {
    "date_1": ["2020-01-01", "2021-01-01", "2022-01-01", "2023-01-01", "2027-12-30"],
    "date_2": ["2020-09-24", "2020-10-24", "2020-11-24", "2020-12-24", "2027-09-24"],
}
df_data = pd.DataFrame.from_dict(data, dtype="datetime64[ns]")

df_data[["date_1", "date_2"]] = df_data[["date_1", "date_2"]].apply(
    lambda x: x.dt.tz_localize("Asia/krasnoyarsk")
)

print(df_data.dtypes)
print(df_data.sort_values(by="date_1").reset_index(drop=True))

dburi = "crate://panduser:[email protected]:4200?ssl=false"

engine = sa.create_engine(dburi, echo=False)
conn = engine.connect()

df_data.to_sql(
    "test_date",
    conn,
    if_exists="replace",
    index=False,
)
conn.exec_driver_sql("REFRESH TABLE test_date;")
df_load = pd.read_sql_table("test_date", conn)

print("\ndataframe after loading")
df_load = df_load.sort_values(by="date_1").reset_index(drop=True)
print(df_load.dtypes)
print(df_load)

Output:

date_1    datetime64[ns, Asia/Krasnoyarsk]
date_2    datetime64[ns, Asia/Krasnoyarsk]
dtype: object
                     date_1                    date_2
0 2020-01-01 00:00:00+07:00 2020-09-24 00:00:00+07:00
1 2021-01-01 00:00:00+07:00 2020-10-24 00:00:00+07:00
2 2022-01-01 00:00:00+07:00 2020-11-24 00:00:00+07:00
3 2023-01-01 00:00:00+07:00 2020-12-24 00:00:00+07:00
4 2027-12-30 00:00:00+07:00 2027-09-24 00:00:00+07:00

dataframe after loading
date_1    datetime64[ns]
date_2    datetime64[ns]
dtype: object
               date_1              date_2
0 2019-12-31 17:00:00 2020-09-23 17:00:00
1 2020-12-31 17:00:00 2020-10-23 17:00:00
2 2021-12-31 17:00:00 2020-11-23 17:00:00
3 2022-12-31 17:00:00 2020-12-23 17:00:00
4 2027-12-29 17:00:00 2027-09-23 17:00:00

amotl · 2025-04-03T05:56:25Z

Thank you very much for your report again. We will also look into this. It feels like the type mapper needs more improvements.

amotl · 2025-04-11T18:37:40Z

Hi again,

we just identified the spot in pandas where CrateDB/SQLAlchemy follows a different code path than PostgreSQL/SQLAlchemy.

The CrateDB dialect currently apparently returns sqltype.timezone=True, that's why we can observe this outcome you are presenting here.

# we have a timezone capable type
if not sqltype.timezone:
    return datetime
return DatetimeTZDtype

We will see what we can do about it.

With kind regards,
Andreas.

amotl · 2025-04-11T19:03:22Z

That's a little pure-SQLAlchemy reproducer which demonstrates the problem around column.type.timezone.

def reflect():
    dburi = "crate://"
    #dburi = "postgresql://postgres@localhost:5433/"
    engine = sa.create_engine(dburi)
    with engine.connect() as conn:
        conn.execute(sa.text("CREATE TABLE IF NOT EXISTS t2 (date TIMESTAMP WITHOUT TIME ZONE)"))
        conn.commit()
    metadata = sa.MetaData()
    inspector = sa.inspect(engine)
    table = sa.Table("t2", metadata)
    inspector.reflect_table(table, None)
    for column in table.columns:
        print("column:", column, column.type, column.type.timezone)

amotl · 2025-04-11T19:31:33Z

Well, that's an obvious and silly mixup flaw coming from GH-24, where it ~~needs to be fixed~~ was fixed now.

sqlalchemy-cratedb/src/sqlalchemy_cratedb/dialect.py

Lines 44 to 46 in cccf39e

    
           "timestamp": sqltypes.TIMESTAMP, 
        
           "timestamp with time zone": sqltypes.TIMESTAMP(timezone=False), 
        
           "timestamp without time zone": sqltypes.TIMESTAMP(timezone=True),

amotl · 2025-04-11T19:36:23Z

We just applied a fix per 04f475d, and released sqlalchemy-cratedb==0.42.0.dev2. Can you to validate that this resolves the problem you observed?

shraik · 2025-04-15T06:16:11Z

Hi, I checked, all the problems I described are now solved. Thanks for the correction. сб, 12 апр. 2025 г. в 02:36, Andreas Motl ***@***.***>:

…

We just applied a fix per 04f475d <04f475d>, and released sqlalchemy-cratedb==0.42.0.dev2 <https://pypi.org/project/sqlalchemy-cratedb/0.42.0.dev2/>. Can you to validate that this resolves the problem you observed? — Reply to this email directly, view it on GitHub <#216 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADIVDNFBPGMNUH354KADTMT2ZAKU3AVCNFSM6AAAAAB2CNZWHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJXHA3DMOBUHE> . You are receiving this because you were mentioned.Message ID: ***@***.***> *amotl* left a comment (crate/sqlalchemy-cratedb#216) <#216 (comment)> We just applied a fix per 04f475d <04f475d>, and released sqlalchemy-cratedb==0.42.0.dev2 <https://pypi.org/project/sqlalchemy-cratedb/0.42.0.dev2/>. Can you to validate that this resolves the problem you observed? — Reply to this email directly, view it on GitHub <#216 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADIVDNFBPGMNUH354KADTMT2ZAKU3AVCNFSM6AAAAAB2CNZWHCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOOJXHA3DMOBUHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

amotl added the bug Something isn't working label Apr 11, 2025

amotl mentioned this issue Apr 15, 2025

Preview: Bundle a few improvements #215

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Type conversion on load datetime64[ns] ->datetime64[ns, UTC] #216

Type conversion on load datetime64[ns] ->datetime64[ns, UTC] #216

shraik commented Mar 30, 2025

amotl commented Mar 30, 2025

shraik commented Mar 30, 2025

amotl commented Mar 30, 2025

shraik commented Mar 31, 2025

amotl commented Mar 31, 2025

amotl commented Mar 31, 2025 •

edited

Loading

shraik commented Mar 31, 2025

amotl commented Mar 31, 2025

amotl commented Mar 31, 2025 •

edited

Loading

matriv commented Mar 31, 2025

amotl commented Mar 31, 2025

amotl commented Mar 31, 2025 •

edited

Loading

matriv commented Apr 1, 2025

amotl commented Apr 1, 2025 •

edited

Loading

matriv commented Apr 1, 2025 •

edited

Loading

amotl commented Apr 1, 2025

shraik commented Apr 2, 2025 •

edited

Loading

amotl commented Apr 3, 2025

amotl commented Apr 11, 2025

amotl commented Apr 11, 2025 •

edited

Loading

amotl commented Apr 11, 2025 •

edited

Loading

amotl commented Apr 11, 2025

shraik commented Apr 15, 2025 via email

Type conversion on load datetime64[ns] ->datetime64[ns, UTC] #216

Type conversion on load datetime64[ns] ->datetime64[ns, UTC] #216

Comments

shraik commented Mar 30, 2025

amotl commented Mar 30, 2025

shraik commented Mar 30, 2025

amotl commented Mar 30, 2025

shraik commented Mar 31, 2025

amotl commented Mar 31, 2025

amotl commented Mar 31, 2025 • edited Loading

shraik commented Mar 31, 2025

amotl commented Mar 31, 2025

amotl commented Mar 31, 2025 • edited Loading

matriv commented Mar 31, 2025

amotl commented Mar 31, 2025

amotl commented Mar 31, 2025 • edited Loading

Compare

Observations

PostgreSQL

CrateDB

Recap

PostgreSQL

CrateDB

matriv commented Apr 1, 2025

amotl commented Apr 1, 2025 • edited Loading

matriv commented Apr 1, 2025 • edited Loading

amotl commented Apr 1, 2025

shraik commented Apr 2, 2025 • edited Loading

amotl commented Apr 3, 2025

amotl commented Apr 11, 2025

amotl commented Apr 11, 2025 • edited Loading

amotl commented Apr 11, 2025 • edited Loading

amotl commented Apr 11, 2025

shraik commented Apr 15, 2025 via email

amotl commented Mar 31, 2025 •

edited

Loading

amotl commented Mar 31, 2025 •

edited

Loading

amotl commented Mar 31, 2025 •

edited

Loading

amotl commented Apr 1, 2025 •

edited

Loading

matriv commented Apr 1, 2025 •

edited

Loading

shraik commented Apr 2, 2025 •

edited

Loading

amotl commented Apr 11, 2025 •

edited

Loading

amotl commented Apr 11, 2025 •

edited

Loading