Performance of deep DataTrees #9511

New issue

Open

Performance of deep DataTrees#9511

Labels

contrib-help-wantedtopic-DataTreetopic-internalstopic-performance

TomNicholas

opened

on Sep 17, 2024

Member

What is your issue?

The DataTree structure was not designed with performance of very large trees in mind. It doesn't do anything obviously wasteful, but the priority has been making decisions about the data model and user API, with performance secondary. Now that the model is more established (or soon should be), we're in a better position to talk about improving performance.

There are two possible performance issues that @shoyer pointed out:

The internal structure is a lot of linked python classes, resulting in a lot of method calls to do things like tree traversal. This is good for clarity and evolving a prototype, but will introduce significant overhead per tree operation.
There are one or two places which might cause quadratic scaling with tree depth. In particular inserting a node via the DataTree.__init__ constructor will cause the entire tree to be checked for consistency, creating a tree by repeatedly using this constructor could be quadratically expensive. DataTree.from_dict could be optimized to remove this problem because it creates from the root, so you can just check subtrees as they are added.

I personally think that the primary use case of a DataTree is small numbers of nodes, each containing large arrays (rather than large numbers of nodes containing small arrays). But I'm sure someone will immediately be like "well in my use case I need a tree with 10k nodes" 😆

In fact because it is possible to represent huge amounts of archival data with a single DataTree, someone will probably do something like attempt to represent the entire CMIP6 catalog as a DataTree and then complain after hitting a performance limit...

If anyone has ideas for how to improve performance without changing user API let's use this issue to collate and track them.

(Note that this issue is different from the issue of dask in datatree. (xref #9355, #9502, #9504) Here I'm talking specifically about optimizations that can be performed even without dask installed.)

cc @Illviljan who I'm sure has thoughts about this

added

Contributor

I'd say my usecase is a lot of files with a lot of variables with different resolutions/groups.
I don't think I'm completely alone in this, this thread for example discusses thousands of files/nodes: #8925.
Here's some PRs that targets many variables, #7222, #9012

My experience with datatree is still limited, since I have had a hard time getting past working with

Waiting 1.5 minutes to actually have a datatree object. One bottleneck I saw was the excessive copying when using from_dict. When inside a backend I don't think mutability is much of a concern and therefore want no-copy paths. Add copy option in DataTree.from_dict #9193
- A related example is the fastpath argument in xr.Variable (debatable if public for 3rd party users/backends) is pretty much mandatory if you want fast backends for datasets.
reprs that takes 3 minutes, exploring or debugging is not fun then. (Datatree: Dynamically populate the HTML repr #9350,)
Slow interpolation and concatenation of the files when trying to make the fewer but larger arrays. (Parallelize map_over_subtree #9502 was an idea to improve that)

shoyer

Member

Trees with thousands of nodes are certainly a compelling use-case, especially with lazy data. A simple improvement would be to automatically truncate reprs when they get too large.

I guess we might be able to improve performance of large trees by up to ~10x with clever optimizations of the existing code, but if we need ~100x performance gains we will need to think about alternative strategies. There are limits on how far you can optimize pure Python code with thousands or millions of objects.

One solution that comes to mind, with minimal implications for Xarray's API, is lazy creation/loading of sub-trees. You would write something like open_datatree(..., load_depth=2) to only load in the first two levels of the tree into memory, with lower levels in the hierachy only populated when accessed/needed.

benbovy

Member

About the performance of the (static html) reprs, I'm afraid there's no way around truncating them for large trees.

There are much more possibilities with dynamic (widget) reprs supporting bi-directional communication. https://github.com/benbovy/xarray-fancy-repr doesn't work with DataTree yet but I think it would be pretty straightforward to support it (all the repr parts are already available as reusable react components). It would also work seamlessly with lazy loading of sub-trees. The "hardest" task would be to design some UI elements for navigating into large trees.

aladinor

Contributor

Hi everyone,

I've been working with hierarchical structures to store weather radar. We’re leveraging xradar and datatree to manage these datasets efficiently. Currently, we are using the standard WMO Cfradial2.1/FM301 format to build a datatree model using xradar. Then, the data is stored in Zarr format.

This data model stores historical weather radar datasets in Zarr format while supporting real-time updates as radar networks operate continuously. It leverages a Zarr-append pattern for seamless data integration.

I think our data model works, at least in this beta stage; however, as the dataset grows, we’ve noticed longer load times when opening/reading the Zarr store using open_datatree. As shown in the following snippet, the time to open the dataset grows as its size increases:

For ~15 GB in size, open_datatree takes around 5.73 seconds

For ~80 GB in size, open_datatree takes around 11.6 seconds

I've worked with larger datasets, which take more time to open/read.

The datatree structure contains 11 nodes, each representing a point where live-updating data is appended. This is a minimal reproducible example, in case you want to look at it.

import s3fs
import xarray as xr
from time import time


def main():
    print(xr.__version__)
    st = time()
    ## S3 bucket connection
    URL = 'https://js2.jetstream-cloud.org:8001/'
    path = f'pythia/radar/erad2024'
    fs = s3fs.S3FileSystem(anon=True,
                           client_kwargs=dict(endpoint_url=URL))
    file = s3fs.S3Map(f"{path}/zarr_radar/Guaviare_test.zarr", s3=fs)

    # opening datatree stored in zarr
    dtree = xr.backends.api.open_datatree(
        file,
        engine='zarr',
        consolidated=True,
        chunks={}
    )
    print(f"total time: {time() -st}")


if __name__ == "__main__":
    main()

and the output is

2024.9.1.dev23+g52f13d44
total time: 5.198976516723633

For more information about the data model, you can check this raw2zarr GitHub repo and the poster we presented at the ScyPy conference.

aladinor

Contributor

Following up on my previous post, I found out that when using open_groups_as_dict, we create a StoreBackendEntrypoint() that allows us to retrieve the datasets for each node.

https://github.com/pydata/xarray/blob/f01096fef402485092c7132dfd042cc8f467ed09/xarray/backends/zarr.py#L1367C2-L1382C47

However, I discovered that using the open_dataset method instead of StoreBackendEntrypoint() improves the reading/opening time

        for path_group, store in stores.items():
            # store_entrypoint = StoreBackendEntrypoint()
            # 
            # with close_on_error(store):
            #     group_ds = store_entrypoint.open_dataset(
            #         store,
            #         mask_and_scale=mask_and_scale,
            #         decode_times=decode_times,
            #         concat_characters=concat_characters,
            #         decode_coords=decode_coords,
            #         drop_variables=drop_variables,
            #         use_cftime=use_cftime,
            #         decode_timedelta=decode_timedelta,
            #     )

            group_ds = open_dataset(
                filename_or_obj,
                store=store,
                group=path_group,
                engine="zarr",
                mask_and_scale=mask_and_scale,
                decode_times=decode_times,
                concat_characters=concat_characters,
                decode_coords=decode_coords,
                drop_variables=drop_variables,
                use_cftime=use_cftime,
                decode_timedelta=decode_timedelta,
                **kwargs,
            )
            group_name = str(NodePath(path_group))
            groups_dict[group_name] = group_ds

I got the following results by running a test locally over the minimum reproducible example.

2024.9.1.dev23+g52f13d44
total time: 3.808659553527832

We went from ~5.2 to 3.8 seconds (around 1.37x faster).

Please let me know your thoughts.

TomNicholas

mentioned this

on Oct 17, 2024

Slow open_datatree for zarr stores with many coordinate variables #9640

TomNicholas

MemberAuthor

@aladinor I've raised your problem as a separate issue in #9640, in an attempt to keep this issue as a "meta-issue" for discussing performance considerations in the overall implementation of datatree. Your issue seems to be specific to the zarr backend.

TomNicholas

MemberAuthor

About the performance of the (static html) reprs, I'm afraid there's no way around truncating them for large trees.

https://github.com/benbovy/xarray-fancy-repr doesn't work with DataTree yet but I think it would be pretty straightforward to support it

@benbovy that's interesting! #9633 updates the datatree (static) html repr to be up to date, so you or anyone else who is interested in playing with this (@jsignell perhaps?) can start from there.

The "hardest" task would be to design some UI elements for navigating into large trees.

Presumably you're referring to more than just clickable dropdown arrows here?

benbovy

mentioned this

on Dec 20, 2024

Integrate Treescope into Xarray's interactive HTML repr #9324

benbovy

mentioned this

on Feb 24, 2025

DataTree html repr is very slow #10052

jsignell

Contributor

One solution that comes to mind, with minimal implications for Xarray's API, is lazy creation/loading of sub-trees. You would write something like open_datatree(..., load_depth=2) to only load in the first two levels of the tree into memory, with lower levels in the hierachy only populated when accessed/needed.

This feels like the right path forward to get substantial improvements for deep trees. Naively I can even imagine it being useful to use load_depth=0 to just load a tree with the names of all the groups to allow walking the tree without loading any array-level data (even coordinate variables) until I get to the node of interest.

TomNicholas

MemberAuthor

This feels like the right path forward to get substantial improvements for deep trees.

It might be, but we need a much clearer idea of what is actually slow before we go anywhere near that. We need more benchmarks (xarray has a performance benchmark test suite that uses ASV), and identify and solve the known and unknown low-hanging fruit first.

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

contrib-help-wantedtopic-DataTreetopic-internalstopic-performance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Uh oh!

Performance of deep DataTrees #9511

What is your issue?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Performance of deep DataTrees #9511

Description

What is your issue?

Activity

Illviljan commented on Sep 17, 2024

shoyer commented on Sep 22, 2024

benbovy commented on Sep 30, 2024

aladinor commented on Oct 10, 2024

aladinor commented on Oct 10, 2024

TomNicholas commented on Oct 17, 2024

TomNicholas commented on Oct 17, 2024

jsignell commented on Jun 12, 2025

TomNicholas commented on Jun 16, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions