-
-
Notifications
You must be signed in to change notification settings - Fork 19k
Description
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Reproducible Example
In a jupyter notebook
import pandas as pd
import numpy as np
n = 1_000_000
%time a = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz=None), n))
%time b = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz='UTC'), n))
PU times: user 1.88 s, sys: 11.9 ms, total: 1.89 s
Wall time: 1.88 s
CPU times: user 315 ms, sys: 3.99 ms, total: 319 ms
Wall time: 319 ms
Installed Versions
INSTALLED VERSIONS
commit : a671b5a
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-76-generic
Version : #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_GB.utf8
LANG : C.UTF-8
LOCALE : en_GB.UTF-8
pandas : 2.1.4
numpy : 1.24.4
pytz : 2023.3
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : 0.57.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : 0.21.0
tzdata : 2023.3
qtpy : 2.3.1
pyqt5 : None
Prior Performance
No response
Activity
[-]PERF: timezoned series created 10x faster than non-timezoned series[/-][+]PERF: timezoned series created 6x faster than non-timezoned series[/+]rhshadrach commentedon Jan 14, 2024
Thanks for the report, confirmed on main. Almost all the time is being spent in Series construction. Further investigations are welcome.
jrmylow commentedon Jan 16, 2024
I've started poking into this, posting a summary note before I pick this up later:
cProfile
the big difference comes up in the call toobjects_to_datetime64
herepandas/pandas/core/arrays/datetimes.py
Line 2368 in e379692
pandas/pandas/_libs/tslib.pyx
Line 414 in e379692
conversion.pyx
.pandas/pandas/_libs/tslibs/conversion.pyx
Line 773 in e379692
tzinfo
attribute, it calls out toconvert_datetime_to_tsobject
while case a calls out to(<_Timestamp>val)._as_creso(creso, round_ok=True)._value
At this stage I probably still need to find a way to confirm if there is a significant performance difference between these branches, and whether there is a reasonable fix.
jrmylow commentedon Jan 16, 2024
If I had to guess - it would seem that the
_TSObject
that is created in case b is lighter/faster than the_Timestamp
created in case a?mroeschke commentedon Jan 30, 2024
The difference is in the resolution conversion to
ns
AFAICT, the non-tz case actually does a resolution conversion from "s" to "ns" while the tz case just creates a new object "replacing" the "s" resolution with "ns".
Technically, both should return "s" resolution here.