Estimate final size of a dataset #9954

uriii3 · 2025-01-16T12:00:36Z

uriii3
Jan 16, 2025

In one of the applications we are developing, we use this code to evaluate the estimated size that the dataset at hand will have when saved on a file. So, given a dataset:

coordinates_size = 1
    for coordinate_name in dataset.sizes:
        coordinates_size *= dataset[coordinate_name].size
estimate_size = (
        coordinates_size
        * len(list(dataset.data_vars))
        * dataset[list(dataset.data_vars)[0]].dtype.itemsize
        / 1048e3
    )

Basically, multiplying in every dimension the sizes of the values we are saving.

It is a bit more tricky, though, when we want to estimate the size but we are compressing the final file. Does any of you have an idea on how to tackle this?

I can just tell the users that the final size will be smaller than, or even multiply by a static factor, but I would like to know if there is some way to approximate it... some sort of 'predicting' the factor with the variance of the dataset and it's size?

Thanks in advance and hope that it is also intriguing for you!

Answered by TomNicholas

Jan 17, 2025

we use this code to evaluate the estimated size

Is there a reason why you don't just use ds.nbytes?

we want to estimate the size but we are compressing the final file

Knowing the compression factor exactly without looking at all the data is definitionally impossible, because the size of the final file is entirely dependent on the actual data values in your arrays: if your arrays all contained the same value repeated over and over then any decent compression algorithm should compress that down to almost nothing, but if your data contained random uncorrelated noise then lossless compression won't make it smaller at all (and may even get bigger).

some sort of 'predicting' the factor wit…

View full answer

TomNicholas · 2025-01-17T07:03:23Z

TomNicholas
Jan 17, 2025
Maintainer

we use this code to evaluate the estimated size

Is there a reason why you don't just use ds.nbytes?

we want to estimate the size but we are compressing the final file

Knowing the compression factor exactly without looking at all the data is definitionally impossible, because the size of the final file is entirely dependent on the actual data values in your arrays: if your arrays all contained the same value repeated over and over then any decent compression algorithm should compress that down to almost nothing, but if your data contained random uncorrelated noise then lossless compression won't make it smaller at all (and may even get bigger).

some sort of 'predicting' the factor with the variance of the dataset and it's size?

There might be some clever statistical way of estimating this but working out what the compression factor is beyond just a rough rule of thumb is probably a similar amount of work to just compressing it and finding out...

Regardless, whilst it's an interesting question, it's not really a question that has much to do with xarray specifically - you might be better off asking on stackoverflow or similar. Sorry!

1 reply

uriii3 Jan 20, 2025
Author

Didn't know about the ds.nbytes! Thank you, I'll try how it works!

Thanks for the clear explanation and discussion, I really appreciate it 😄 .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimate final size of a dataset #9954

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Estimate final size of a dataset #9954

uriii3 Jan 16, 2025

Replies: 1 comment · 1 reply

TomNicholas Jan 17, 2025 Maintainer

uriii3 Jan 20, 2025 Author

uriii3
Jan 16, 2025

Replies: 1 comment 1 reply

TomNicholas
Jan 17, 2025
Maintainer

uriii3 Jan 20, 2025
Author