Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using level_of_measurement='interval' with arbitrary values is extremely memory intensive and slow #38

Open
agrizzli opened this issue Jan 17, 2025 · 0 comments

Comments

@agrizzli
Copy link

agrizzli commented Jan 17, 2025

I am running Krippendorff's alpha with level_of_measurement=interval for arbitrary data and without restricted domain values. This means, there are:

  • gold standard values [-1.0, 0.0, 1.0]. See attached targets.csv file
  • predicted values as real numbers within the mathematical interval -1.25 to 1.25. See attached preds.csv file
preds = np.loadtxt('preds.csv', delimiter=',')
targets = np.loadtxt('targets.csv', delimiter=',')

krippendorff.alpha(reliability_data=[preds, targets], level_of_measurement='interval')

When running computation, the process uses extremely high amounts of memory and is much slower compared to the computation with restricted domain values [-1.0, -0.5, 0.0, 0.5, 1.0]. The more items are included in the data, the slower and memory eager it becomes. I tested it on two systems for about 2,400 data points in total:

  1. MacOS with python=3.9.6, numpy=1.26.4, and krippendorff=0.6.0

=> The computation needs more than 60 secs time and consumes 32GB of memory, but finishes.

  1. Ubuntu Linux with python=3.10.12, numpy=1.26.4, and krippendorff=0.6.0.

=> The computation takes too long (I had to kill the process due to memory issues) and consumes some hundreds of GB within seconds. (Unfortunately, in one unsupervised case it even led to a complete depletion of RAM and OS killed important processes...)

The process appears to go crazy at the line 117

unnormalized_coincidences = value_counts[..., np.newaxis] * value_counts[:, np.newaxis, :] - diagonals

preds.csv
targets.csv

I am not sure whether this problem is related to this numpy bug report:
numpy/numpy#26395

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant