Using level_of_measurement='interval' with arbitrary values is extremely memory intensive and slow #38

agrizzli · 2025-01-17T08:57:09Z

I am running Krippendorff's alpha with level_of_measurement=interval for arbitrary data and without restricted domain values. This means, there are:

gold standard values [-1.0, 0.0, 1.0]. See attached targets.csv file
predicted values as real numbers within the mathematical interval -1.25 to 1.25. See attached preds.csv file

preds = np.loadtxt('preds.csv', delimiter=',')
targets = np.loadtxt('targets.csv', delimiter=',')

krippendorff.alpha(reliability_data=[preds, targets], level_of_measurement='interval')

When running computation, the process uses extremely high amounts of memory and is much slower compared to the computation with restricted domain values [-1.0, -0.5, 0.0, 0.5, 1.0]. The more items are included in the data, the slower and memory eager it becomes. I tested it on two systems for about 2,400 data points in total:

MacOS with python=3.9.6, numpy=1.26.4, and krippendorff=0.6.0

=> The computation needs more than 60 secs time and consumes 32GB of memory, but finishes.

Ubuntu Linux with python=3.10.12, numpy=1.26.4, and krippendorff=0.6.0.

=> The computation takes too long (I had to kill the process due to memory issues) and consumes some hundreds of GB within seconds. (Unfortunately, in one unsupervised case it even led to a complete depletion of RAM and OS killed important processes...)

The process appears to go crazy at the line 117

unnormalized_coincidences = value_counts[..., np.newaxis] * value_counts[:, np.newaxis, :] - diagonals

preds.csv
targets.csv

I am not sure whether this problem is related to this numpy bug report:
numpy/numpy#26395

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using level_of_measurement='interval' with arbitrary values is extremely memory intensive and slow #38

Using level_of_measurement='interval' with arbitrary values is extremely memory intensive and slow #38

agrizzli commented Jan 17, 2025 •

edited

Loading

Using level_of_measurement='interval' with arbitrary values is extremely memory intensive and slow #38

Using level_of_measurement='interval' with arbitrary values is extremely memory intensive and slow #38

Comments

agrizzli commented Jan 17, 2025 • edited Loading

agrizzli commented Jan 17, 2025 •

edited

Loading