- 
        Couldn't load subscription status. 
- Fork 6
Make AbstractVariable a subtype of AbstractDiskArray #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make AbstractVariable a subtype of AbstractDiskArray #35
Conversation
| Base.BroadcastStyle(::DefaultArrayStyle,::ReducedGroupedVariableStyle) = ReducedGroupedVariableStyle() | ||
| Base.BroadcastStyle(::ReducedGroupedVariableStyle,::DefaultArrayStyle) = ReducedGroupedVariableStyle() | ||
|  | ||
| Base.BroadcastStyle(::DiskArrays.ChunkStyle,::ReducedGroupedVariableStyle) = ReducedGroupedVariableStyle() | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReducedGroupedVariable use different broadcasting than DiskArrays. This might lead to some confusion.
| and data are copied from `src` as well as the variable name (unless provide by `name`). | ||
| """ | ||
| function defVar(dest::AbstractDataset,varname::SymbolOrString,srcvar::AbstractVariable; kwargs...) | ||
| function defVar(dest::AbstractDataset,varname::SymbolOrString,srcvar::Union{AbstractVariable, SubVariable}; kwargs...) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated ::AbstractVariable to ::Union{AbstractVariable, SubVariable} the places I thought it made sense but I might have missed some.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't SubVariable <: AbstractVariable anyway ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is indeed the case
CommonDataModel.jl/src/types.jl
Line 158 in 68f2050
| struct SubVariable{T,N,TA,TI,TAttrib,TV} <: AbstractVariable{T,N} | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, see the changes to types.jl
SubVariable is no longer a subtype of AbstractVariable but is now a subtype of AbstractSubDiskArray
struct SubVariable{T,N,P,I,L} <: DiskArrays.AbstractSubDiskArray{T,N,P,I,L}This is done to use the DiskArray implementation of views as discussed in JuliaGeo/NCDatasets.jl#274
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah of course
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have add ::Union{AbstractVariable, SubVariable} all the places needed to extend the following functions for SubVariable: filter, coord, ancillaryvariables, groupby, select. I think that should cover the public interface of CommonDataModel
|  | ||
|  | ||
| function DiskArrays.haschunks(v::AbstractVariable) | ||
| storage, chunksizes = chunking(v) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried reuse the exiting chunking method.
| @rafaqz @tiemvanderdeure @Alexander-Barth, | 
| Great, we should get those DiskArrays PRs in then (sorry got distracted with other things) | 
| Yeah we just need JuliaIO/DiskArrays.jl#249 and then we can merge JuliaIO/DiskArrays.jl#255. I think both are pretty much ready to merge? | 
| I just need to fix views in 249 | 
| aout, | ||
| indexes::Vararg{OrdinalRange, N}) where {T,N} | ||
|  | ||
| aout .= Base.getindex(gr,indexes...) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently Base.getindex is defined for ReducedGroupedVariable so the DiskArray.jl implementation is not used. An alternative would be to use the DiskArray getindex and implement more logic in readblock!.  This is more ReducedGroupedVariable and I can not see any benefit of doing it right now.
| @rafaqz @tiemvanderdeure @Alexander-Barth, This PR is now ready for review. There is still some unresolved comments that we will have to discuss before merging the PR. 
 Most of the changes are small but TIFFDatasets required a bit more work since it was not using DiskArrays. I think the path forward should be. 
 | 
| The CI / Documentation is failing because https://psl.noaa.gov/thredds/fileServer/Datasets/noaa.oisst.v2.highres/sst.day.mean.2023.nc is unavailable at the moment. | 
| I just started playing around with it in Rasters and this will also need some overhaul in Rasters.jl to bring it to the new behaviour. | 
| Yeah Rasters had to hack around the old behavior a bit to get at the inner disk arrays, maybe some of that will break. | 
| Probably we can just revert rafaqz/Rasters.jl#892 once all of this is working | 
| This looks really good ! Thank you @lupemba !!! What kind of breaking changes do you expect (besides SubVariable is not longer a subtype of AbstractVariable) ? | 
| 
 @Alexander-Barth, I am glad you like the PR 😄  The main breaking change is that packages implementing CommonDataModel.jl have to define  This is the only other breaking change that I am aware of. The effect on the users of the ***Datasets.jl packages should be very minimal. I therefore think we can get away with a non breaking updates to the ***Datasets.jl packages. | 
| @Alexander-Barth  and @rafaqz | 
|  | ||
| root = _root(v) | ||
| for idim = findall(size(v) .> sz) | ||
| dname = v.dimnames[idim] | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| dname = v.dimnames[idim] | |
| dname = dimnames(v)[idim] | 
There is a bunch of direct field access in this PR for fields that have interface methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this code is simply copied from the Base.setindex! method that was before. I think the best approach is to open a separate issue and PR to reduce "direct field access". This can be looked at after this PR is completed.
#42
I don't think we should expand the scope of this PR anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it just looks like new code to me and you use assessor methods everywhere else...
| @Alexander-Barth and @rafaqz. | 
| I would too! Buy I can't really merge without an approval from @Alexander-Barth | 
| Thanks @lupemba thank you for great work! | 
| 
 I don't have any linux machine where I can use sodu to run the benchmark. Theses results are therefore just from my mac with cache enabled using julia 1.11.4. I did two runs on each branch. 
 It looks like this PR makes NCDatasets.jl around 50% slower for the benchmark. @Alexander-Barth, for me raw file io is rarely the bottleneck of my code (disregarding network file system overhead). Would it be acceptable to merge the PR like this and work on performance issues after? | 
| Would be good to get a more pinpoint understanding of what is slower, and nice to get those timings to match. But we should note that we have been optimizing different things. If you had broadcasts in the tests they would be faster overall. (and yes merging first then optimizing later is best for me, this PR will risk dying here if its not merged until performance is identical, and it will be easier to help if branches aren't needed for all packages involved) | 
| 
 I will profile the benchmark to find out where the time increase comes from. | 
| Here are the numbers that I got on Linux (with drop-cache), compared to python and R: 
 Before we were faster than R-ncdf4 and python-netCDF4 for this benchmark. Now it seems that we are about the same as python-netCDF4... | 
|   Maybe this block  Maybe we are allocating the memory unnecessarily? Line 101, is an allocation if   https://github.com/JuliaIO/DiskArrays.jl/blob/main/src/indexing.jl#L101 | 
        
          
                src/cfvariable.jl
              
                Outdated
          
        
      | data = similar(aout, eltype(parent_var)) | ||
| DiskArrays.readblock!(parent_var, data, indexes...) | ||
|  | ||
| aout .= CFtransformdata(data,fill_and_missing_values(v),scale_factor(v),add_offset(v), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| aout .= CFtransformdata(data,fill_and_missing_values(v),scale_factor(v),add_offset(v), | |
| CFtransformdata!(aout, data,fill_and_missing_values(v),scale_factor(v),add_offset(v), | |
| time_origin(v),time_factor(v),maskingvalue(v)) | 
using CFtransformdata! I think resolves the performance issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it currently is this allocates unnecessarily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just quickly tried the change and it fixes the performance issues on my Mac. The slow down relative to master is now just 5% in the test I run.
Note that the change also requires removing eltype(v) from the function input.
| @rafaqz and @Alexander-Barth, I have made a commit base on @tiemvanderdeure suggestion and I now just observe a 5% slowdown compared with the master branch for the benchmark 😄 | 
| This is great! Thanks a lot to @lupemba @tiemvanderdeure and @rafaqz ! | 
| Is there anything left before we make a release of CommonDataModel? For the release notes: 
 It there more to be said about breaking behavior? | 
| 
 Maybe add this? 
 | 
This PR requires JuliaIO/DiskArrays.jl#255 and JuliaIO/DiskArrays.jl#260
See #32 for background.
This is a draft of how we can make
AbstractVariable <: AbstractDiskArrayplus how to use theAbstractSubDiskArrayfor views.This draft does not address:
PermutedDiskArrayThis PR will probably contain a few breaking changes but most things will behave identically. One notable change is that
SubVariableis not longer a subtype ofAbstractVariable.This update will requirer packages that implement the CommonDataModel interface to stop defining
getindexandsetindex. Instead they should defineDiskArrays.readblock!andDiskArrays.writeblock!which most of them already do.