Make AbstractVariable a subtype of AbstractDiskArray #35

lupemba · 2025-05-03T14:01:13Z

This PR requires JuliaIO/DiskArrays.jl#255 and JuliaIO/DiskArrays.jl#260

See #32 for background.

This is a draft of how we can make AbstractVariable <: AbstractDiskArray plus how to use the AbstractSubDiskArray for views.

This draft does not address:

Performance
(solved) Documentation
PermutedDiskArray

This PR will probably contain a few breaking changes but most things will behave identically. One notable change is that SubVariable is not longer a subtype of AbstractVariable.

This update will requirer packages that implement the CommonDataModel interface to stop defining getindex and setindex. Instead they should define DiskArrays.readblock!and DiskArrays.writeblock! which most of them already do.

Project.toml

src/CommonDataModel.jl

src/cfvariable.jl

lupemba · 2025-05-03T14:06:20Z

src/groupby.jl

 Base.BroadcastStyle(::DefaultArrayStyle,::ReducedGroupedVariableStyle) = ReducedGroupedVariableStyle()
 Base.BroadcastStyle(::ReducedGroupedVariableStyle,::DefaultArrayStyle) = ReducedGroupedVariableStyle()

+Base.BroadcastStyle(::DiskArrays.ChunkStyle,::ReducedGroupedVariableStyle) = ReducedGroupedVariableStyle()


ReducedGroupedVariable use different broadcasting than DiskArrays. This might lead to some confusion.

src/memory_dataset.jl

lupemba · 2025-05-03T14:10:20Z

src/variable.jl

 and data are copied from `src` as well as the variable name (unless provide by `name`).
 """
-function defVar(dest::AbstractDataset,varname::SymbolOrString,srcvar::AbstractVariable; kwargs...)
+function defVar(dest::AbstractDataset,varname::SymbolOrString,srcvar::Union{AbstractVariable, SubVariable}; kwargs...)


I updated ::AbstractVariable to ::Union{AbstractVariable, SubVariable} the places I thought it made sense but I might have missed some.

Isn't SubVariable <: AbstractVariable anyway ?

yes, this is indeed the case

CommonDataModel.jl/src/types.jl

Line 158 in 68f2050

struct SubVariable{T,N,TA,TI,TAttrib,TV} <: AbstractVariable{T,N}

No, see the changes to types.jl

https://github.com/JuliaGeo/CommonDataModel.jl/pull/35/files#diff-525588e68b2421901be164965272940effa093de733a3fd36c4b9a4344b8c20cR55

SubVariable is no longer a subtype of AbstractVariable but is now a subtype of AbstractSubDiskArray

struct SubVariable{T,N,P,I,L} <: DiskArrays.AbstractSubDiskArray{T,N,P,I,L}

This is done to use the DiskArray implementation of views as discussed in JuliaGeo/NCDatasets.jl#274

Ah of course

I have add ::Union{AbstractVariable, SubVariable} all the places needed to extend the following functions for SubVariable: filter, coord, ancillaryvariables, groupby, select. I think that should cover the public interface of CommonDataModel

lupemba · 2025-05-03T14:11:00Z

src/variable.jl



+function DiskArrays.haschunks(v::AbstractVariable) 
+    storage, chunksizes = chunking(v) 


I tried reuse the exiting chunking method.

test/test_groupby.jl

test/test_scaling.jl

test/test_subvariable.jl

lupemba · 2025-05-03T14:15:33Z

@rafaqz @tiemvanderdeure @Alexander-Barth,
Here is my work so far. The tests passes when I add the DiskArrays changes from JuliaIO/DiskArrays.jl#255 and JuliaIO/DiskArrays.jl#260

rafaqz · 2025-05-03T18:33:15Z

Great, we should get those DiskArrays PRs in then (sorry got distracted with other things)

tiemvanderdeure · 2025-05-05T07:03:45Z

Yeah we just need JuliaIO/DiskArrays.jl#249 and then we can merge JuliaIO/DiskArrays.jl#255. I think both are pretty much ready to merge?

rafaqz · 2025-05-05T08:50:28Z

I just need to fix views in 249

lupemba · 2025-06-23T09:40:19Z

src/groupby.jl

+    aout,
+    indexes::Vararg{OrdinalRange, N}) where {T,N}
+
+    aout .= Base.getindex(gr,indexes...)


Currently Base.getindex is defined for ReducedGroupedVariable so the DiskArray.jl implementation is not used. An alternative would be to use the DiskArray getindex and implement more logic in readblock!. This is more ReducedGroupedVariable and I can not see any benefit of doing it right now.

lupemba · 2025-06-27T12:32:10Z

@rafaqz @tiemvanderdeure @Alexander-Barth,

This PR is now ready for review. There is still some unresolved comments that we will have to discuss before merging the PR.
I have made draft PRs to all the "datasets" packages to test that the this update will work.

Most of the changes are small but TIFFDatasets required a bit more work since it was not using DiskArrays.

I think the path forward should be.

Review this PR and solve outstanding comments.
Do some minor performance tests
Merge this PR and release CommonDataModel.jl v0.4
Review NCDatasets.jl PR and rerun CI.
Release new NCDatasets.jl version. Some of the other packages uses NCDatasets as a test dependency.
Update the remaining packages.

lupemba · 2025-06-27T12:33:45Z

The CI / Documentation is failing because https://psl.noaa.gov/thredds/fileServer/Datasets/noaa.oisst.v2.highres/sst.day.mean.2023.nc is unavailable at the moment.

felixcremer · 2025-08-13T13:07:26Z

I just started playing around with it in Rasters and this will also need some overhaul in Rasters.jl to bring it to the new behaviour.

rafaqz · 2025-08-13T23:33:56Z

Yeah Rasters had to hack around the old behavior a bit to get at the inner disk arrays, maybe some of that will break.

tiemvanderdeure · 2025-08-14T05:52:16Z

Probably we can just revert rafaqz/Rasters.jl#892 once all of this is working

Alexander-Barth · 2025-08-27T18:50:51Z

This looks really good ! Thank you @lupemba !!!

What kind of breaking changes do you expect (besides SubVariable is not longer a subtype of AbstractVariable) ?

lupemba · 2025-08-27T19:36:14Z

What kind of breaking changes do you expect (besides SubVariable is not longer a subtype of AbstractVariable) ?

@Alexander-Barth, I am glad you like the PR 😄 The main breaking change is that packages implementing CommonDataModel.jl have to define DiskArray.readblock! and DiskArray.writeblock! instead of Base.getindex and Base.setindex! when implementing a new AbstractVariable. This PR to ZarrDatasets.jl is a good example https://github.com/JuliaGeo/ZarrDatasets.jl/pull/13/files

This is the only other breaking change that I am aware of. The effect on the users of the ***Datasets.jl packages should be very minimal. I therefore think we can get away with a non breaking updates to the ***Datasets.jl packages.

lupemba · 2025-08-31T14:19:47Z

@Alexander-Barth and @rafaqz
It would be nice if you could approve this PR or let me know if anything more needs to be change. I would like to merge this within the next week so it doesn't drag out forever and go stale.

rafaqz · 2025-09-01T05:33:59Z

src/memory_dataset.jl

+
+    root = _root(v)
+    for idim = findall(size(v) .> sz)
+        dname = v.dimnames[idim]


Suggested change

dname = v.dimnames[idim]

dname = dimnames(v)[idim]

There is a bunch of direct field access in this PR for fields that have interface methods

Yes, this code is simply copied from the Base.setindex! method that was before. I think the best approach is to open a separate issue and PR to reduce "direct field access". This can be looked at after this PR is completed.
#42

I don't think we should expand the scope of this PR anymore.

Sure, it just looks like new code to me and you use assessor methods everywhere else...

lupemba · 2025-09-08T07:48:19Z

@Alexander-Barth and @rafaqz.
Is there something missing or can we merge this PR soon?
I would really like to get it over the finish line in the next couple of days.

rafaqz · 2025-09-08T08:36:41Z

I would too! Buy I can't really merge without an approval from @Alexander-Barth

Alexander-Barth · 2025-09-08T13:20:59Z

Thanks @lupemba thank you for great work!
I am wondering if we have already some performance test?
Maybe this is useful:
https://github.com/JuliaGeo/NCDatasets.jl/blob/master/test/perf/benchmark-julia-NCDatasets.jl

lupemba · 2025-09-08T17:10:43Z

Thanks @lupemba thank you for great work! I am wondering if we have already some performance test? Maybe this is useful: https://github.com/JuliaGeo/NCDatasets.jl/blob/master/test/perf/benchmark-julia-NCDatasets.jl

I don't have any linux machine where I can use sodu to run the benchmark. Theses results are therefore just from my mac with cache enabled using julia 1.11.4. I did two runs on each branch.

Prepare common data model 0.4 NCDatasets.jl#279 : run1 393.500 ms, run2 390.996 ms
NCDatasets.jl master : run1 254.573 ms, run2 255.408 ms

It looks like this PR makes NCDatasets.jl around 50% slower for the benchmark.

@Alexander-Barth, for me raw file io is rarely the bottleneck of my code (disregarding network file system overhead). Would it be acceptable to merge the PR like this and work on performance issues after?
Improvements to performance could require updates to DiskArray.jl.

rafaqz · 2025-09-09T01:23:34Z

Would be good to get a more pinpoint understanding of what is slower, and nice to get those timings to match. But we should note that we have been optimizing different things. If you had broadcasts in the tests they would be faster overall.

(and yes merging first then optimizing later is best for me, this PR will risk dying here if its not merged until performance is identical, and it will be easier to help if branches aren't needed for all packages involved)

lupemba · 2025-09-09T06:11:36Z

Would be good to get a more pinpoint understanding of what is slower

I will profile the benchmark to find out where the time increase comes from.

Alexander-Barth · 2025-09-09T12:51:07Z

Here are the numbers that I got on Linux (with drop-cache), compared to python and R:

Module	median	minimum	mean	std. dev.
R-ncdf4	0.545	0.509	0.546	0.016
python-netCDF4	0.693	0.685	0.695	0.008
julia-NCDatasets (master)	0.498	0.492	0.500	0.008
julia-NCDatasets (pr)	0.702	0.675	0.706	0.016

Before we were faster than R-ncdf4 and python-netCDF4 for this benchmark. Now it seems that we are about the same as python-netCDF4...
(but of course broadcasting is supported with the PR and much faster that the default fall-back)

Alexander-Barth · 2025-09-09T13:30:10Z

Maybe this block getindex_disk_nobatch! is creating the overhead. I don't see it in the master version.
Note that also in the master version is also using DiskArrays, but at deeper/different level. I am still surprised by this change in runtime.

Maybe we are allocating the memory unnecessarily?

Line 101, is an allocation if out is Nothing:

https://github.com/JuliaIO/DiskArrays.jl/blob/main/src/indexing.jl#L101
https://github.com/JuliaIO/DiskArrays.jl/blob/main/src/indexing.jl#L208-L214

tiemvanderdeure · 2025-09-09T13:41:08Z

src/cfvariable.jl

+    data = similar(aout, eltype(parent_var))
+    DiskArrays.readblock!(parent_var, data, indexes...)
+
+    aout .= CFtransformdata(data,fill_and_missing_values(v),scale_factor(v),add_offset(v),


Suggested change

aout .= CFtransformdata(data,fill_and_missing_values(v),scale_factor(v),add_offset(v),

CFtransformdata!(aout, data,fill_and_missing_values(v),scale_factor(v),add_offset(v),

time_origin(v),time_factor(v),maskingvalue(v))

using CFtransformdata! I think resolves the performance issue

As it currently is this allocates unnecessarily

I just quickly tried the change and it fixes the performance issues on my Mac. The slow down relative to master is now just 5% in the test I run.

Note that the change also requires removing eltype(v) from the function input.

lupemba · 2025-09-09T14:24:51Z

@rafaqz and @Alexander-Barth,

I have made a commit base on @tiemvanderdeure suggestion and I now just observe a 5% slowdown compared with the master branch for the benchmark 😄

Alexander-Barth · 2025-09-10T07:03:12Z

This is great! Thanks a lot to @lupemba @tiemvanderdeure and @rafaqz !

Alexander-Barth · 2025-09-10T07:42:41Z

Is there anything left before we make a release of CommonDataModel?

For the release notes:

CommonDataModel.AbstractVariable is now a subtype of DiskArrays.AbstractDiskArray
Broadcasting benefit now from the optimized methods from DiskArrays
CommonDataModel.SubVariable is not longer a subtype of CommonDataModel.AbstractVariable

It there more to be said about breaking behavior?

tiemvanderdeure · 2025-09-10T13:18:15Z

It there more to be said about breaking behavior?

Maybe add this?

This update will requirer packages that implement the CommonDataModel interface to stop defining getindex and setindex. Instead they should define DiskArrays.readblock!and DiskArrays.writeblock! which most of them already do.

Make AbstractVariable a subtype of AbstractDiskArray

81f24cd

lupemba commented May 3, 2025

View reviewed changes

felixcremer marked this pull request as ready for review May 15, 2025 16:36

lupemba mentioned this pull request May 28, 2025

add abstract types for views and permuted arrays JuliaIO/DiskArrays.jl#255

Merged

lupemba added 2 commits June 19, 2025 19:08

Bump DiskArrays

a30bf71

Bump version and minor fixes based on NCDatasets tests

2e06265

lupemba mentioned this pull request Jun 22, 2025

Prepare common data model 0.4 JuliaGeo/NCDatasets.jl#279

Merged

lupemba added 2 commits June 23, 2025 10:32

Forward readblock! and writeblock! CFVariable

7c382b8

Replace DiskArrays._replace_colon with two copied lines.

0321c61

lupemba marked this pull request as draft June 23, 2025 08:55

Implement DiskArrays.readblock! for ReducedGroupedVariable

c6cb2f0

lupemba commented Jun 23, 2025

View reviewed changes

Update documentation and add MetopDatasets.jl to integration tests

f9df8eb

lupemba marked this pull request as ready for review June 27, 2025 12:20

tiemvanderdeure mentioned this pull request Jul 8, 2025

DiskIndex is not inferred JuliaIO/DiskArrays.jl#266

Closed

Alexander-Barth and others added 5 commits August 9, 2025 18:33

add Base.propertynames (issue JuliaGeo#34)

1edfc96

remove trailing white-space

7232d96

GLMakie is not used

1472682

fix version numbers for documentation

5159cb7

Broken test works with new DiskArrays.

e9f6418

felixcremer mentioned this pull request Aug 13, 2025

Raster from sub Zarr group fails on getindex rafaqz/Rasters.jl#1001

Open

Merge branch 'main' into subtype-AbstractDiskArray

5d1a5e8

rafaqz approved these changes Sep 1, 2025

View reviewed changes

rafaqz reviewed Sep 1, 2025

View reviewed changes

tiemvanderdeure reviewed Sep 9, 2025

View reviewed changes

fix performance issue.

f31d810

Alexander-Barth merged commit 94433fa into JuliaGeo:main Sep 10, 2025
7 of 10 checks passed

Alexander-Barth mentioned this pull request Sep 10, 2025

Broadcast over CFVariable is very slow #21

Closed

This was referenced Sep 10, 2025

cfvariable type promotion works different for scalar indexing. #33

Closed

DiskArrays.jl compatability #8

Closed

DiskArrays for CFVariable #9

Closed

asinghvi17 mentioned this pull request Sep 28, 2025

Forward DiskArrays.jl methods to Variable JuliaGeo/GRIBDatasets.jl#34

Closed



		function DiskArrays.haschunks(v::AbstractVariable)
		storage, chunksizes = chunking(v)

	aout .= CFtransformdata(data,fill_and_missing_values(v),scale_factor(v),add_offset(v),
	CFtransformdata!(aout, data,fill_and_missing_values(v),scale_factor(v),add_offset(v),
	time_origin(v),time_factor(v),maskingvalue(v))

Uh oh!

Make AbstractVariable a subtype of AbstractDiskArray #35

Make AbstractVariable a subtype of AbstractDiskArray #35

Uh oh!

Conversation

lupemba commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lupemba commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rafaqz commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiemvanderdeure commented May 5, 2025

Uh oh!

rafaqz commented May 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lupemba commented Jun 27, 2025

Uh oh!

lupemba commented Jun 27, 2025

Uh oh!

felixcremer commented Aug 13, 2025

Uh oh!

rafaqz commented Aug 13, 2025

Uh oh!

tiemvanderdeure commented Aug 14, 2025

Uh oh!

Alexander-Barth commented Aug 27, 2025

Uh oh!

lupemba commented Aug 27, 2025

Uh oh!

lupemba commented Aug 31, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lupemba Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lupemba commented Sep 8, 2025

Uh oh!

rafaqz commented Sep 8, 2025

Uh oh!

Alexander-Barth commented Sep 8, 2025

Uh oh!

lupemba commented Sep 8, 2025

Uh oh!

rafaqz commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lupemba commented Sep 9, 2025

Uh oh!

lupemba commented May 3, 2025 •

edited

Loading

lupemba commented May 3, 2025 •

edited

Loading

rafaqz commented May 3, 2025 •

edited

Loading

lupemba Sep 1, 2025 •

edited

Loading

rafaqz commented Sep 9, 2025 •

edited

Loading

Alexander-Barth commented Sep 9, 2025 •

edited

Loading

Alexander-Barth commented Sep 9, 2025 •

edited

Loading

lupemba commented Sep 9, 2025 •

edited

Loading