Skip to content

Conversation

@lupemba
Copy link
Contributor

@lupemba lupemba commented May 3, 2025

This PR requires JuliaIO/DiskArrays.jl#255 and JuliaIO/DiskArrays.jl#260

See #32 for background.

This is a draft of how we can make AbstractVariable <: AbstractDiskArray plus how to use the AbstractSubDiskArray for views.

This draft does not address:

  • Performance
  • (solved) Documentation
  • PermutedDiskArray

This PR will probably contain a few breaking changes but most things will behave identically. One notable change is that SubVariable is not longer a subtype of AbstractVariable.

This update will requirer packages that implement the CommonDataModel interface to stop defining getindex and setindex. Instead they should define DiskArrays.readblock!and DiskArrays.writeblock! which most of them already do.

Base.BroadcastStyle(::DefaultArrayStyle,::ReducedGroupedVariableStyle) = ReducedGroupedVariableStyle()
Base.BroadcastStyle(::ReducedGroupedVariableStyle,::DefaultArrayStyle) = ReducedGroupedVariableStyle()

Base.BroadcastStyle(::DiskArrays.ChunkStyle,::ReducedGroupedVariableStyle) = ReducedGroupedVariableStyle()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReducedGroupedVariable use different broadcasting than DiskArrays. This might lead to some confusion.

and data are copied from `src` as well as the variable name (unless provide by `name`).
"""
function defVar(dest::AbstractDataset,varname::SymbolOrString,srcvar::AbstractVariable; kwargs...)
function defVar(dest::AbstractDataset,varname::SymbolOrString,srcvar::Union{AbstractVariable, SubVariable}; kwargs...)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated ::AbstractVariable to ::Union{AbstractVariable, SubVariable} the places I thought it made sense but I might have missed some.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't SubVariable <: AbstractVariable anyway ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is indeed the case

struct SubVariable{T,N,TA,TI,TAttrib,TV} <: AbstractVariable{T,N}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, see the changes to types.jl

https://github.com/JuliaGeo/CommonDataModel.jl/pull/35/files#diff-525588e68b2421901be164965272940effa093de733a3fd36c4b9a4344b8c20cR55

SubVariable is no longer a subtype of AbstractVariable but is now a subtype of AbstractSubDiskArray

struct SubVariable{T,N,P,I,L} <: DiskArrays.AbstractSubDiskArray{T,N,P,I,L}

This is done to use the DiskArray implementation of views as discussed in JuliaGeo/NCDatasets.jl#274

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah of course

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have add ::Union{AbstractVariable, SubVariable} all the places needed to extend the following functions for SubVariable: filter, coord, ancillaryvariables, groupby, select. I think that should cover the public interface of CommonDataModel



function DiskArrays.haschunks(v::AbstractVariable)
storage, chunksizes = chunking(v)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried reuse the exiting chunking method.

@lupemba
Copy link
Contributor Author

lupemba commented May 3, 2025

@rafaqz @tiemvanderdeure @Alexander-Barth,
Here is my work so far. The tests passes when I add the DiskArrays changes from JuliaIO/DiskArrays.jl#255 and JuliaIO/DiskArrays.jl#260

@rafaqz
Copy link
Member

rafaqz commented May 3, 2025

Great, we should get those DiskArrays PRs in then (sorry got distracted with other things)

@tiemvanderdeure
Copy link
Contributor

Yeah we just need JuliaIO/DiskArrays.jl#249 and then we can merge JuliaIO/DiskArrays.jl#255. I think both are pretty much ready to merge?

@rafaqz
Copy link
Member

rafaqz commented May 5, 2025

I just need to fix views in 249

@lupemba lupemba marked this pull request as draft June 23, 2025 08:55
aout,
indexes::Vararg{OrdinalRange, N}) where {T,N}

aout .= Base.getindex(gr,indexes...)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently Base.getindex is defined for ReducedGroupedVariable so the DiskArray.jl implementation is not used. An alternative would be to use the DiskArray getindex and implement more logic in readblock!. This is more ReducedGroupedVariable and I can not see any benefit of doing it right now.

@lupemba lupemba marked this pull request as ready for review June 27, 2025 12:20
@lupemba
Copy link
Contributor Author

lupemba commented Jun 27, 2025

@rafaqz @tiemvanderdeure @Alexander-Barth,

This PR is now ready for review. There is still some unresolved comments that we will have to discuss before merging the PR.
I have made draft PRs to all the "datasets" packages to test that the this update will work.

Most of the changes are small but TIFFDatasets required a bit more work since it was not using DiskArrays.

I think the path forward should be.

  • Review this PR and solve outstanding comments.
  • Do some minor performance tests
  • Merge this PR and release CommonDataModel.jl v0.4
  • Review NCDatasets.jl PR and rerun CI.
  • Release new NCDatasets.jl version. Some of the other packages uses NCDatasets as a test dependency.
  • Update the remaining packages.

@lupemba
Copy link
Contributor Author

lupemba commented Jun 27, 2025

The CI / Documentation is failing because https://psl.noaa.gov/thredds/fileServer/Datasets/noaa.oisst.v2.highres/sst.day.mean.2023.nc is unavailable at the moment.

@felixcremer
Copy link
Member

I just started playing around with it in Rasters and this will also need some overhaul in Rasters.jl to bring it to the new behaviour.

@rafaqz
Copy link
Member

rafaqz commented Aug 13, 2025

Yeah Rasters had to hack around the old behavior a bit to get at the inner disk arrays, maybe some of that will break.

@tiemvanderdeure
Copy link
Contributor

Probably we can just revert rafaqz/Rasters.jl#892 once all of this is working

@Alexander-Barth
Copy link
Member

This looks really good ! Thank you @lupemba !!!

What kind of breaking changes do you expect (besides SubVariable is not longer a subtype of AbstractVariable) ?

@lupemba
Copy link
Contributor Author

lupemba commented Aug 27, 2025

What kind of breaking changes do you expect (besides SubVariable is not longer a subtype of AbstractVariable) ?

@Alexander-Barth, I am glad you like the PR 😄 The main breaking change is that packages implementing CommonDataModel.jl have to define DiskArray.readblock! and DiskArray.writeblock! instead of Base.getindex and Base.setindex! when implementing a new AbstractVariable. This PR to ZarrDatasets.jl is a good example https://github.com/JuliaGeo/ZarrDatasets.jl/pull/13/files

This is the only other breaking change that I am aware of. The effect on the users of the ***Datasets.jl packages should be very minimal. I therefore think we can get away with a non breaking updates to the ***Datasets.jl packages.

@lupemba
Copy link
Contributor Author

lupemba commented Aug 31, 2025

@Alexander-Barth and @rafaqz
It would be nice if you could approve this PR or let me know if anything more needs to be change. I would like to merge this within the next week so it doesn't drag out forever and go stale.


root = _root(v)
for idim = findall(size(v) .> sz)
dname = v.dimnames[idim]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dname = v.dimnames[idim]
dname = dimnames(v)[idim]

There is a bunch of direct field access in this PR for fields that have interface methods

Copy link
Contributor Author

@lupemba lupemba Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this code is simply copied from the Base.setindex! method that was before. I think the best approach is to open a separate issue and PR to reduce "direct field access". This can be looked at after this PR is completed.
#42

I don't think we should expand the scope of this PR anymore.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it just looks like new code to me and you use assessor methods everywhere else...

@lupemba
Copy link
Contributor Author

lupemba commented Sep 8, 2025

@Alexander-Barth and @rafaqz.
Is there something missing or can we merge this PR soon?
I would really like to get it over the finish line in the next couple of days.

@rafaqz
Copy link
Member

rafaqz commented Sep 8, 2025

I would too! Buy I can't really merge without an approval from @Alexander-Barth

@Alexander-Barth
Copy link
Member

Thanks @lupemba thank you for great work!
I am wondering if we have already some performance test?
Maybe this is useful:
https://github.com/JuliaGeo/NCDatasets.jl/blob/master/test/perf/benchmark-julia-NCDatasets.jl

@lupemba
Copy link
Contributor Author

lupemba commented Sep 8, 2025

Thanks @lupemba thank you for great work! I am wondering if we have already some performance test? Maybe this is useful: https://github.com/JuliaGeo/NCDatasets.jl/blob/master/test/perf/benchmark-julia-NCDatasets.jl

I don't have any linux machine where I can use sodu to run the benchmark. Theses results are therefore just from my mac with cache enabled using julia 1.11.4. I did two runs on each branch.

It looks like this PR makes NCDatasets.jl around 50% slower for the benchmark.

@Alexander-Barth, for me raw file io is rarely the bottleneck of my code (disregarding network file system overhead). Would it be acceptable to merge the PR like this and work on performance issues after?
Improvements to performance could require updates to DiskArray.jl.

@rafaqz
Copy link
Member

rafaqz commented Sep 9, 2025

Would be good to get a more pinpoint understanding of what is slower, and nice to get those timings to match. But we should note that we have been optimizing different things. If you had broadcasts in the tests they would be faster overall.

(and yes merging first then optimizing later is best for me, this PR will risk dying here if its not merged until performance is identical, and it will be easier to help if branches aren't needed for all packages involved)

@lupemba
Copy link
Contributor Author

lupemba commented Sep 9, 2025

Would be good to get a more pinpoint understanding of what is slower

I will profile the benchmark to find out where the time increase comes from.

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Sep 9, 2025

Here are the numbers that I got on Linux (with drop-cache), compared to python and R:

Module median minimum mean std. dev.
R-ncdf4 0.545 0.509 0.546 0.016
python-netCDF4 0.693 0.685 0.695 0.008
julia-NCDatasets (master) 0.498 0.492 0.500 0.008
julia-NCDatasets (pr) 0.702 0.675 0.706 0.016

Before we were faster than R-ncdf4 and python-netCDF4 for this benchmark. Now it seems that we are about the same as python-netCDF4...
(but of course broadcasting is supported with the PR and much faster that the default fall-back)

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Sep 9, 2025

grafik

Maybe this block getindex_disk_nobatch! is creating the overhead. I don't see it in the master version.
Note that also in the master version is also using DiskArrays, but at deeper/different level. I am still surprised by this change in runtime.

Maybe we are allocating the memory unnecessarily?

Line 101, is an allocation if out is Nothing:

https://github.com/JuliaIO/DiskArrays.jl/blob/main/src/indexing.jl#L101
https://github.com/JuliaIO/DiskArrays.jl/blob/main/src/indexing.jl#L208-L214

data = similar(aout, eltype(parent_var))
DiskArrays.readblock!(parent_var, data, indexes...)

aout .= CFtransformdata(data,fill_and_missing_values(v),scale_factor(v),add_offset(v),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
aout .= CFtransformdata(data,fill_and_missing_values(v),scale_factor(v),add_offset(v),
CFtransformdata!(aout, data,fill_and_missing_values(v),scale_factor(v),add_offset(v),
time_origin(v),time_factor(v),maskingvalue(v))

using CFtransformdata! I think resolves the performance issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it currently is this allocates unnecessarily

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just quickly tried the change and it fixes the performance issues on my Mac. The slow down relative to master is now just 5% in the test I run.

Note that the change also requires removing eltype(v) from the function input.

@lupemba
Copy link
Contributor Author

lupemba commented Sep 9, 2025

@rafaqz and @Alexander-Barth,

I have made a commit base on @tiemvanderdeure suggestion and I now just observe a 5% slowdown compared with the master branch for the benchmark 😄

@Alexander-Barth
Copy link
Member

This is great! Thanks a lot to @lupemba @tiemvanderdeure and @rafaqz !

@Alexander-Barth Alexander-Barth merged commit 94433fa into JuliaGeo:main Sep 10, 2025
7 of 10 checks passed
@Alexander-Barth
Copy link
Member

Is there anything left before we make a release of CommonDataModel?

For the release notes:

  • CommonDataModel.AbstractVariable is now a subtype of DiskArrays.AbstractDiskArray
  • Broadcasting benefit now from the optimized methods from DiskArrays
  • CommonDataModel.SubVariable is not longer a subtype of CommonDataModel.AbstractVariable

It there more to be said about breaking behavior?

@tiemvanderdeure
Copy link
Contributor

It there more to be said about breaking behavior?

Maybe add this?

This update will requirer packages that implement the CommonDataModel interface to stop defining getindex and setindex. Instead they should define DiskArrays.readblock!and DiskArrays.writeblock! which most of them already do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants