Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Compute] Add percentile rank function #45190

Open
pitrou opened this issue Jan 7, 2025 · 5 comments
Open

[C++][Compute] Add percentile rank function #45190

pitrou opened this issue Jan 7, 2025 · 5 comments

Comments

@pitrou
Copy link
Member

pitrou commented Jan 7, 2025

Describe the enhancement requested

Arrow C++ already offers a rank function: https://arrow.apache.org/docs/cpp/compute.html#sorts-and-partitions

It would be useful to add a percentile rank function according to this definition: https://en.wikipedia.org/wiki/Percentile_rank

Proposed API:

/// \brief Percentile rank options
class ARROW_EXPORT PercentileRankOptions : public FunctionOptions {
 public:
  explicit PercentileRankOptions(std::vector<SortKey> sort_keys = {},
                                 NullPlacement null_placement = NullPlacement::AtEnd,
                                 double factor = 1.0);
  /// Convenience constructor for array inputs
  explicit PercentileRankOptions(SortOrder order,
                                 NullPlacement null_placement = NullPlacement::AtEnd,
                                 double factor = 1.0)
      : PercentileRankOptions({SortKey("", order)}, null_placement, factor) {}

  static constexpr char const kTypeName[] = "PercentileRankOptions";
  static PercentileRankOptions Defaults() { return PercentileRankOptions(); }

  /// Column key(s) to order by and how to order by these sort keys.
  std::vector<SortKey> sort_keys;
  /// Whether nulls and NaNs are placed at the start or at the end
  NullPlacement null_placement;
  /// Factor to apply to the output.
  /// Use 1.0 for results in (0, 1), 100.0 for percentages, etc.
  double factor;
};

Component(s)

C++

@pitrou
Copy link
Member Author

pitrou commented Jan 7, 2025

cc @icexelloss

@icexelloss
Copy link
Contributor

This looks good to me. One thing I might add is that there are a few different favors of percentile rank, e.g.

(1) scale to 1/n+1, 2/n+1, …, n/n+1 (0-1 exclusive)
(2) scale to 0/n-1, 1/n-1, … n-1/n-1 (0-1 inclusive)

So might be useful to expose the option for the user to choose which one.

@pitrou
Copy link
Member Author

pitrou commented Jan 9, 2025

Do you have pointers to the existence of the "inclusive" flavor?

@pitrou
Copy link
Member Author

pitrou commented Jan 9, 2025

Also, the formula given in https://en.wikipedia.org/wiki/Percentile_rank takes ties into account and matches neither of your two suggestions. It'd rather stick to the Wikipedia definition if there's no strong reason to do otherwise.

@icexelloss
Copy link
Contributor

tldr: I agree with the wikipedia definition.

Full thought:

In Python, I actually didn't find any library that implements the wikipedia definition, which is quite surprising. The two reference implementation I found is Pandas and Scipy, both agree with each other but not the wikipedia one:

For the example input from wikipedia:

s = pd.Series([1, 2, 3, 3, 3, 4, 4, 5, 5, 7])

The wikipedia result is

0    0.05
1    0.15
2    0.35
3    0.35
4    0.35
5    0.60
6    0.60
7    0.80
8    0.80
9    0.95
dtype: float64

both Pandas and Scipy gives me

0    0.10
1    0.20
2    0.40
3    0.40
4    0.40
5    0.65
6    0.65
7    0.85
8    0.85
9    1.00
dtype: float64

However, the Pandas rank is not ideal because it is not (0, 1) exclusive and have issues, say calling norm.ppf(rank) (it will introduce inf values), so internally we "adjust" the Pandas rank scaling from (1/N, .... N/N) to (1/N+1, ... N/N+1) to make it (0-1) exclusive, but looking at the wikipedia definition it is probably better to use that instead of tweaking the scaling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants