[C++][Compute] Add percentile rank function #45190

pitrou · 2025-01-07T15:54:50Z

Describe the enhancement requested

Arrow C++ already offers a rank function: https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions

It would be useful to add a percentile rank function according to this definition: https://en.wikipedia.org/wiki/Percentile_rank

Proposed API:

/// \brief Percentile rank options
class ARROW_EXPORT PercentileRankOptions : public FunctionOptions {
 public:
  explicit PercentileRankOptions(std::vector<SortKey> sort_keys = {},
                                 NullPlacement null_placement = NullPlacement::AtEnd,
                                 double factor = 1.0);
  /// Convenience constructor for array inputs
  explicit PercentileRankOptions(SortOrder order,
                                 NullPlacement null_placement = NullPlacement::AtEnd,
                                 double factor = 1.0)
      : PercentileRankOptions({SortKey("", order)}, null_placement, factor) {}

  static constexpr char const kTypeName[] = "PercentileRankOptions";
  static PercentileRankOptions Defaults() { return PercentileRankOptions(); }

  /// Column key(s) to order by and how to order by these sort keys.
  std::vector<SortKey> sort_keys;
  /// Whether nulls and NaNs are placed at the start or at the end
  NullPlacement null_placement;
  /// Factor to apply to the output.
  /// Use 1.0 for results in (0, 1), 100.0 for percentages, etc.
  double factor;
};

Component(s)

C++

The text was updated successfully, but these errors were encountered:

pitrou · 2025-01-07T16:22:59Z

cc @icexelloss

icexelloss · 2025-01-08T19:18:05Z

This looks good to me. One thing I might add is that there are a few different favors of percentile rank, e.g.

(1) scale to 1/n+1, 2/n+1, …, n/n+1 (0-1 exclusive)
(2) scale to 0/n-1, 1/n-1, … n-1/n-1 (0-1 inclusive)

So might be useful to expose the option for the user to choose which one.

pitrou · 2025-01-09T12:28:33Z

Do you have pointers to the existence of the "inclusive" flavor?

pitrou · 2025-01-09T12:30:45Z

Also, the formula given in https://en.wikipedia.org/wiki/Percentile_rank takes ties into account and matches neither of your two suggestions. It'd rather stick to the Wikipedia definition if there's no strong reason to do otherwise.

icexelloss · 2025-01-09T14:33:40Z

tldr: I agree with the wikipedia definition.

Full thought:

In Python, I actually didn't find any library that implements the wikipedia definition, which is quite surprising. The two reference implementation I found is Pandas and Scipy, both agree with each other but not the wikipedia one:

For the example input from wikipedia:

s = pd.Series([1, 2, 3, 3, 3, 4, 4, 5, 5, 7])

The wikipedia result is

0    0.05
1    0.15
2    0.35
3    0.35
4    0.35
5    0.60
6    0.60
7    0.80
8    0.80
9    0.95
dtype: float64

both Pandas and Scipy gives me

0    0.10
1    0.20
2    0.40
3    0.40
4    0.40
5    0.65
6    0.65
7    0.85
8    0.85
9    1.00
dtype: float64

However, the Pandas rank is not ideal because it is not (0, 1) exclusive and have issues, say calling norm.ppf(rank) (it will introduce inf values), so internally we "adjust" the Pandas rank scaling from (1/N, .... N/N) to (1/N+1, ... N/N+1) to make it (0-1) exclusive, but looking at the wikipedia definition it is probably better to use that instead of tweaking the scaling.

pitrou added the Type: enhancement label Jan 7, 2025

pitrou self-assigned this Jan 7, 2025

github-actions bot added the Component: C++ label Jan 7, 2025

This was referenced Jan 9, 2025

[C++][Compute] Refactor Rank implementation #45216

Open

GH-45216: [C++][Compute] Refactor Rank implementation #45217

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Compute] Add percentile rank function #45190

[C++][Compute] Add percentile rank function #45190

pitrou commented Jan 7, 2025

pitrou commented Jan 7, 2025

icexelloss commented Jan 8, 2025

pitrou commented Jan 9, 2025

pitrou commented Jan 9, 2025

icexelloss commented Jan 9, 2025

[C++][Compute] Add percentile rank function #45190

[C++][Compute] Add percentile rank function #45190

Comments

pitrou commented Jan 7, 2025

Describe the enhancement requested

Component(s)

pitrou commented Jan 7, 2025

icexelloss commented Jan 8, 2025

pitrou commented Jan 9, 2025

pitrou commented Jan 9, 2025

icexelloss commented Jan 9, 2025