Skip to content

Add functionality to use arbitrary singular values in ConcatDiskArrays #256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: fg/cat
Choose a base branch
from

Conversation

felixcremer
Copy link
Contributor

This adds the possibility to use any value instead of only missing in ConcatDiskArrays.
We can even mix the singular values.

@felixcremer
Copy link
Contributor Author

felixcremer commented Apr 11, 2025

@rafaqz this is the functionality that you wanted in #248 . This should make #248 good to go.

src/cat.jl Outdated
et = Base.nonmissingtype(eltype(arrays))
T = Union{Missing,eltype(et)}
T = promotetype(typeof(fill), eltype(et))
Copy link
Collaborator

@rafaqz rafaqz Apr 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the passed in array would still contain either arrays or missing values, but we would just return fill for those sections, so fill would need to be a struct field.

So what we have here is kinda half way. Probably we should either drop the fill keyword (and just rely on the array values) or keep it as a struct field and use missing in the array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to drop the fill keyword. I actual thought I already did so.
With my approach we can have different fill values in different parts of the concatenated array.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is that actually useful? I can't imagine a real use case for multiple fill values. And putting missing in the array seems clearer for representing missing arrays.

The missing value used in the array of arrays doesn't have to be coupled to the fill value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it might be a bit dangerous to allow arbitrary values to represent missings. OTOH one never knows when this might become useful for some reason. I am ok with both ways and would currently slightly prefer @felixcremer s solution because there is already an implementation. If we want to be a bit more on the safe side, we could introduce a small wrapper type like

struct MissingTile{F}
    fillvalue::F
end

that explicitly marks missing entries to be concatenated and how they should be filled. The main reason why the current solution might be problematic is for DiskArrays with arrays as eltype, but maybe this is overthinking things a bit too much.

felixcremer and others added 2 commits April 12, 2025 22:27
Co-authored-by: Rafael Schouten <[email protected]>
Co-authored-by: Rafael Schouten <[email protected]>
src/cat.jl Outdated
et = Base.nonmissingtype(eltype(arrays))
T = Union{Missing,eltype(et)}
T = promotetype(typeof(fill), eltype(et))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
T = promotetype(typeof(fill), eltype(et))
T = Union{Missing,eltype(et)}

Copy link
Collaborator

@rafaqz rafaqz Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you have to read all the fill values in the array and promote them? Missing is not necessarily in the type?

Co-authored-by: Rafael Schouten <[email protected]>
@felixcremer
Copy link
Contributor Author

Why do the tests not run on this pr?

@felixcremer
Copy link
Contributor Author

I took this for a spin and I am not sure, whether it is this change or #256 or ConcatDiskArray even before but saving a ConcatDiskArray where I have some missings takes a very long time. Saving the european dataset took the whole weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants