-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
Currently, pandas.IntervalArray suffer from 3 major limitations:
They are limited to data with the same closedness on both sides.no longer the case apparently- All datapoints are limited to the same closedness in the array. (i.e. the same array can only store closed intervals or only open intervals).
- Intervals do not allow missing values
- In particular one cannot represent unbounded intervals for data types that lack an actual infinity value like
int32.
- In particular one cannot represent unbounded intervals for data types that lack an actual infinity value like
- Some dtypes are not allowed like
string
As a practical application for (1) that I am very interested in is storing information about the range of valid values for the columns of another DataFrame.
Feature Description
Given the better integration with pyarrow since 2.0, we can recreate IntervalDtype using pyarrow.struct:
import pyarrow as pa
def arrow_interval_dtype(subtype):
fields = [
("lower_bound", subtype),
("upper_bound", subtype),
("lower_inclusive", pa.bool_()),
("upper_inclusive", pa.bool_()),
]
return pa.struct(fields)Contrary to the current IntervalDtype, this would solve all 3 major problems at once:
- Each element of the resulting
StructArraycan have separate closedness - Pyarrow datatypes all support missing values
- We can in principle use any ordered data type for the subtype.
Alternative Solutions
None.
Additional Context
Additionally, common request is adding extra operations for interval dtypes:
- ENH: Interval type should support intersection, union & overlaps & difference #21998
- ENH: Arithmetic operations on intervals #43629
- API: Implement interval-point joins #21901
- Features which Interval / IntervalIndex should probably have #19480
Additionally, one could imagine having a IntervalUnion type, that can represent finite unions of intervals, combining the interval type discussed here with pyarrow list-type. This type would naturally arise when performing unions of intervals, such as [0, 2]∪[3, 5]. The nice thing here is that the resulting space is mathematically closed under the standard set operations (union, intersection, complements, difference)