Skip to content

PERF: timezoned series created 6x faster than non-timezoned series #56860

@soerenwolfers

Description

@soerenwolfers

Pandas version checks

  • I have checked that this issue has not already been reported.

    I have confirmed this issue exists on the latest version of pandas.

    I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

In a jupyter notebook

import pandas as pd
import numpy as np
n = 1_000_000
%time a = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz=None), n))
%time b = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz='UTC'), n))
PU times: user 1.88 s, sys: 11.9 ms, total: 1.89 s
Wall time: 1.88 s
CPU times: user 315 ms, sys: 3.99 ms, total: 319 ms
Wall time: 319 ms

Installed Versions

INSTALLED VERSIONS

commit : a671b5a
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-76-generic
Version : #83-Ubuntu SMP Thu Jun 15 19:16:32 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_GB.utf8
LANG : C.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.1.4
numpy : 1.24.4
pytz : 2023.3
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : 0.57.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : 0.21.0
tzdata : 2023.3
qtpy : 2.3.1
pyqt5 : None

Prior Performance

No response

Activity

added
Needs TriageIssue that has not been reviewed by a pandas team member
PerformanceMemory or execution speed performance
on Jan 13, 2024
changed the title [-]PERF: timezoned series created 10x faster than non-timezoned series[/-] [+]PERF: timezoned series created 6x faster than non-timezoned series[/+] on Jan 13, 2024
rhshadrach

rhshadrach commented on Jan 14, 2024

@rhshadrach
Member

Thanks for the report, confirmed on main. Almost all the time is being spent in Series construction. Further investigations are welcome.

added
TimezonesTimezone data dtype
ConstructorsSeries/DataFrame/Index/pd.array Constructors
Timestamppd.Timestamp and associated methods
and removed
Needs TriageIssue that has not been reviewed by a pandas team member
on Jan 14, 2024
jrmylow

jrmylow commented on Jan 16, 2024

@jrmylow
Contributor

I've started poking into this, posting a summary note before I pick this up later:

  • Using cProfile the big difference comes up in the call to objects_to_datetime64 here
    def objects_to_datetime64(
  • This calls out to Cython code next so it doesn't get picked up by cProfile
    cpdef array_to_datetime(
  • The key difference in the two cases appears to happen in the next call, which is conversion.pyx.
    cdef int64_t parse_pydatetime(
  • Specifically, because case b's values contain the tzinfo attribute, it calls out to convert_datetime_to_tsobject while case a calls out to (<_Timestamp>val)._as_creso(creso, round_ok=True)._value

At this stage I probably still need to find a way to confirm if there is a significant performance difference between these branches, and whether there is a reasonable fix.

jrmylow

jrmylow commented on Jan 16, 2024

@jrmylow
Contributor

If I had to guess - it would seem that the _TSObject that is created in case b is lighter/faster than the _Timestamp created in case a?

mroeschke

mroeschke commented on Jan 30, 2024

@mroeschke
Member

The difference is in the resolution conversion to ns

In [13]: %time a = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz=None), n))
CPU times: user 809 ms, sys: 6.04 ms, total: 815 ms
Wall time: 813 ms

In [14]: %time a = pd.Series(np.repeat(pd.Timestamp('2019-01-01 00:00:00', tz=None).as_unit("ns"), n))
CPU times: user 344 ms, sys: 3.31 ms, total: 347 ms
Wall time: 347 ms

AFAICT, the non-tz case actually does a resolution conversion from "s" to "ns" while the tz case just creates a new object "replacing" the "s" resolution with "ns".

Technically, both should return "s" resolution here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    ConstructorsSeries/DataFrame/Index/pd.array ConstructorsPerformanceMemory or execution speed performanceTimestamppd.Timestamp and associated methodsTimezonesTimezone data dtype

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @mroeschke@soerenwolfers@jrmylow@rhshadrach

        Issue actions

          PERF: timezoned series created 6x faster than non-timezoned series · Issue #56860 · pandas-dev/pandas