Estimate final size of a dataset #9954
-
In one of the applications we are developing, we use this code to evaluate the estimated size that the dataset at hand will have when saved on a file. So, given a dataset: coordinates_size = 1
for coordinate_name in dataset.sizes:
coordinates_size *= dataset[coordinate_name].size
estimate_size = (
coordinates_size
* len(list(dataset.data_vars))
* dataset[list(dataset.data_vars)[0]].dtype.itemsize
/ 1048e3
) Basically, multiplying in every dimension the sizes of the values we are saving. It is a bit more tricky, though, when we want to estimate the size but we are compressing the final file. Does any of you have an idea on how to tackle this? I can just tell the users that the final size will be smaller than, or even multiply by a static factor, but I would like to know if there is some way to approximate it... some sort of 'predicting' the factor with the variance of the dataset and it's size? Thanks in advance and hope that it is also intriguing for you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Is there a reason why you don't just use
Knowing the compression factor exactly without looking at all the data is definitionally impossible, because the size of the final file is entirely dependent on the actual data values in your arrays: if your arrays all contained the same value repeated over and over then any decent compression algorithm should compress that down to almost nothing, but if your data contained random uncorrelated noise then lossless compression won't make it smaller at all (and may even get bigger).
There might be some clever statistical way of estimating this but working out what the compression factor is beyond just a rough rule of thumb is probably a similar amount of work to just compressing it and finding out... Regardless, whilst it's an interesting question, it's not really a question that has much to do with xarray specifically - you might be better off asking on stackoverflow or similar. Sorry! |
Beta Was this translation helpful? Give feedback.
Is there a reason why you don't just use
ds.nbytes
?Knowing the compression factor exactly without looking at all the data is definitionally impossible, because the size of the final file is entirely dependent on the actual data values in your arrays: if your arrays all contained the same value repeated over and over then any decent compression algorithm should compress that down to almost nothing, but if your data contained random uncorrelated noise then lossless compression won't make it smaller at all (and may even get bigger).