After trying to fasten the kilosort/phy reading and subselecting units with sevral way:
It appears that the main slow down is to force the spike vector to be fully lexsorted:
spikes = spikes[np.lexsort((spikes["unit_index"], spikes["sample_index"], spikes["segment_index"]))]
It use to be partially lexsorted :
spikes = spikes[np.lexsort((spikes["sample_index"], spikes["segment_index"]))]
This commit recent commit 456265c has introduced this "need of fully lexsorted".
In short does 2 spikes with same sample_index need to be ordered by unit_index ?
Yes : we need full lexsort
Pro :
- It ensure always an order of the spiek vector because spike amplitude, amplitude scaling, spike location is a vector related to this spike vector.
Cons :
- it is a big slow down when selecting units
- constructing the spike vector (with many spikes) can be very slow
No : we need partial lexsort
Pro :
- We do not care 2 spikes of same sample_index to be ordered by units for any computation (CCG)
- We only need to ensure that the spike vector is always constructed/read/cached in the same consistent order (for instance by reading the phy/kilosort spike_clusters.npy)
Cons :
- some future part of code could rely on the fact this vector is fully lexsorted
My personnal taste is second case.
The main problem : the spikeinterface codebase use to be No : we need partial lexsort and in february 2025 we switched to Yes : we need full lexsort without too much backward compatibility in mind. Zarr is now opening the internal spike vector and force lexsort in the __init__() which is a big big slow down but also can break the order of spike amplitude, amplitide scaling, ... it this extension was computed before this change (v 0.104.0) and the analyzer was saved in the zarr format before this version.
I would strongly vote for removing the behavior (force full lexsort) and go back to previous situation. But we need to find a better way for backward compatibility to avoid the same mistake.
@grahamfindlay @alejoe91 @chrishalcrow @yger @tayheau
Any opinion on this ?
PS : the first reader able to find more than 10 typos in this post will have a glass of wine.
After trying to fasten the kilosort/phy reading and subselecting units with sevral way:
get_unit_spike_trainsand performance improvements #4502It appears that the main slow down is to force the spike vector to be fully lexsorted:
spikes = spikes[np.lexsort((spikes["unit_index"], spikes["sample_index"], spikes["segment_index"]))]It use to be partially lexsorted :
spikes = spikes[np.lexsort((spikes["sample_index"], spikes["segment_index"]))]This commit recent commit 456265c has introduced this "need of fully lexsorted".
In short does 2 spikes with same sample_index need to be ordered by unit_index ?
Yes : we need full lexsort
Pro :
Cons :
No : we need partial lexsort
Pro :
Cons :
My personnal taste is second case.
The main problem : the spikeinterface codebase use to be No : we need partial lexsort and in february 2025 we switched to Yes : we need full lexsort without too much backward compatibility in mind. Zarr is now opening the internal spike vector and force lexsort in the
__init__()which is a big big slow down but also can break the order of spike amplitude, amplitide scaling, ... it this extension was computed before this change (v 0.104.0) and the analyzer was saved in the zarr format before this version.I would strongly vote for removing the behavior (force full lexsort) and go back to previous situation. But we need to find a better way for backward compatibility to avoid the same mistake.
@grahamfindlay @alejoe91 @chrishalcrow @yger @tayheau
Any opinion on this ?
PS : the first reader able to find more than 10 typos in this post will have a glass of wine.