-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What is your issue?
The DataTree
structure was not designed with performance of very large trees in mind. It doesn't do anything obviously wasteful, but the priority has been making decisions about the data model and user API, with performance secondary. Now that the model is more established (or soon should be), we're in a better position to talk about improving performance.
There are two possible performance issues that @shoyer pointed out:
-
The internal structure is a lot of linked python classes, resulting in a lot of method calls to do things like tree traversal. This is good for clarity and evolving a prototype, but will introduce significant overhead per tree operation.
-
There are one or two places which might cause quadratic scaling with tree depth. In particular inserting a node via the
DataTree.__init__
constructor will cause the entire tree to be checked for consistency, creating a tree by repeatedly using this constructor could be quadratically expensive.DataTree.from_dict
could be optimized to remove this problem because it creates from the root, so you can just check subtrees as they are added.
I personally think that the primary use case of a DataTree
is small numbers of nodes, each containing large arrays (rather than large numbers of nodes containing small arrays). But I'm sure someone will immediately be like "well in my use case I need a tree with 10k nodes" 😆
In fact because it is possible to represent huge amounts of archival data with a single DataTree
, someone will probably do something like attempt to represent the entire CMIP6 catalog as a DataTree
and then complain after hitting a performance limit...
If anyone has ideas for how to improve performance without changing user API let's use this issue to collate and track them.
(Note that this issue is different from the issue of dask in datatree. (xref #9355, #9502, #9504) Here I'm talking specifically about optimizations that can be performed even without dask installed.)
cc @Illviljan who I'm sure has thoughts about this
Activity
Illviljan commentedon Sep 17, 2024
I'd say my usecase is a lot of files with a lot of variables with different resolutions/groups.
I don't think I'm completely alone in this, this thread for example discusses thousands of files/nodes: #8925.
Here's some PRs that targets many variables, #7222, #9012
My experience with datatree is still limited, since I have had a hard time getting past working with
from_dict
. When inside a backend I don't think mutability is much of a concern and therefore want no-copy paths. Add copy option in DataTree.from_dict #9193fastpath
argument inxr.Variable
(debatable if public for 3rd party users/backends) is pretty much mandatory if you want fast backends for datasets.shoyer commentedon Sep 22, 2024
Trees with thousands of nodes are certainly a compelling use-case, especially with lazy data. A simple improvement would be to automatically truncate reprs when they get too large.
I guess we might be able to improve performance of large trees by up to ~10x with clever optimizations of the existing code, but if we need ~100x performance gains we will need to think about alternative strategies. There are limits on how far you can optimize pure Python code with thousands or millions of objects.
One solution that comes to mind, with minimal implications for Xarray's API, is lazy creation/loading of sub-trees. You would write something like
open_datatree(..., load_depth=2)
to only load in the first two levels of the tree into memory, with lower levels in the hierachy only populated when accessed/needed.benbovy commentedon Sep 30, 2024
About the performance of the (static html) reprs, I'm afraid there's no way around truncating them for large trees.
There are much more possibilities with dynamic (widget) reprs supporting bi-directional communication. https://github.com/benbovy/xarray-fancy-repr doesn't work with DataTree yet but I think it would be pretty straightforward to support it (all the repr parts are already available as reusable react components). It would also work seamlessly with lazy loading of sub-trees. The "hardest" task would be to design some UI elements for navigating into large trees.
aladinor commentedon Oct 10, 2024
Hi everyone,
I've been working with hierarchical structures to store weather radar. We’re leveraging xradar and datatree to manage these datasets efficiently. Currently, we are using the standard WMO Cfradial2.1/FM301 format to build a datatree model using
xradar
. Then, the data is stored inZarr
format.This data model stores historical weather radar datasets in
Zarr
format while supporting real-time updates as radar networks operate continuously. It leverages a Zarr-append pattern for seamless data integration.I think our data model works, at least in this beta stage; however, as the dataset grows, we’ve noticed longer load times when opening/reading the
Zarr
store usingopen_datatree
. As shown in the following snippet, the time to open the dataset grows as its size increases:For ~15 GB in size,

open_datatree
takes around 5.73 secondsFor ~80 GB in size,
open_datatree
takes around 11.6 secondsI've worked with larger datasets, which take more time to open/read.
The datatree structure contains 11 nodes, each representing a point where live-updating data is appended. This is a minimal reproducible example, in case you want to look at it.
and the output is
For more information about the data model, you can check this
raw2zarr
GitHub repo and the poster we presented at the ScyPy conference.aladinor commentedon Oct 10, 2024
Following up on my previous post, I found out that when using
open_groups_as_dict
, we create aStoreBackendEntrypoint()
that allows us to retrieve thedatasets
for each node.https://github.com/pydata/xarray/blob/f01096fef402485092c7132dfd042cc8f467ed09/xarray/backends/zarr.py#L1367C2-L1382C47
However, I discovered that using the
open_dataset
method instead ofStoreBackendEntrypoint()
improves the reading/opening timeI got the following results by running a test locally over the minimum reproducible example.
We went from ~5.2 to 3.8 seconds (around 1.37x faster).
Please let me know your thoughts.
TomNicholas commentedon Oct 17, 2024
@aladinor I've raised your problem as a separate issue in #9640, in an attempt to keep this issue as a "meta-issue" for discussing performance considerations in the overall implementation of datatree. Your issue seems to be specific to the zarr backend.
TomNicholas commentedon Oct 17, 2024
@benbovy that's interesting! #9633 updates the datatree (static) html repr to be up to date, so you or anyone else who is interested in playing with this (@jsignell perhaps?) can start from there.
Presumably you're referring to more than just clickable dropdown arrows here?
jsignell commentedon Jun 12, 2025
This feels like the right path forward to get substantial improvements for deep trees. Naively I can even imagine it being useful to use
load_depth=0
to just load a tree with the names of all the groups to allow walking the tree without loading any array-level data (even coordinate variables) until I get to the node of interest.TomNicholas commentedon Jun 16, 2025
It might be, but we need a much clearer idea of what is actually slow before we go anywhere near that. We need more benchmarks (xarray has a performance benchmark test suite that uses ASV), and identify and solve the known and unknown low-hanging fruit first.