-
Notifications
You must be signed in to change notification settings - Fork 58
Description
First, I am fully aware that this request has already been made and declined (#450). However, I hope that this request is different because it includes a PR (#544) with a resolution that I hope works.
TLDR: instead of rejecting logical inputs for truth or estimates as errors, just internally convert them to factors and then everything else works fine, with no headaches for yardstick
developers.
This request started out as a bug report on hardhat
. It evolved into an explanation for why tidymodels
and yardstick
don't support logical outcomes for binary prediction. Please excuse me for the lengthy quotation, but I think it's important to state here the motivation for my PR.
- The real-world data has qualitative meaning, so users should be forced to encode the outcomes according to their qualitative meaning.
- Since binary data has intrinsic qualitative meaning, it should be encoded as factors, which is the native R datatype for encoding qualitative meaning, just like multinomial outcomes.
- Binary data lacks labels to distinguish the FALSE from the TRUE cases.
For the first point, I know you did not use the word "forced", but that is how I understood your point. And this is where I think I disagree. On one hand, I agree that as a package author, I should be disciplined to follow conventions such as tidymodels. On the other hand, I think that my packages should be sufficiently flexible to accept all reasonable kinds of input data that users provide. I consider it quite unreasonable to reject logical outcomes as "unreasonable" or "illegitimate" on the package level.
logical
is such a fundamental data format (for predictors and outcomes alike). It is the most natural and intuitive format for binary outcomes. I definitely think that any binary prediction modelling package should be able to cleanly and naturally handle logical outcomes without forcing the user to encode them as factors.For the second and third points, I understand the logic, but again, I don't think such logic should be imposed on users. The resolution to me is very natural: the two levels should be labelled as "FALSE" and "TRUE". I don't understand why that would be complicated. If users want something better than that, then they should follow your advice and encode the outcomes as factors. But that should be their choice.
One important point that you implied here but made explicit in your 2012 "rant" is that it is unclear which of the two binary cases should be considered the primary or "positive" class. However, here I think your argument is self-defeating (unless I misunderstand it). This is only a problem when your advice is followed and binary outcomes are encoded as factors--indeed, then there is no natural choice of the first or the second level as the positive case. … However, when the binary outcome is kept logical, then there is no issue. The clear and natural choice is that TRUE is the positive case. There's no ambiguity there. …
My point here is not to convince you or anyone else to change the tidymodels conventions, but my point is hopefully to convince you to give first-class support for "binary outcomes as logical". … I hope that my arguments are sufficient to respect the choice to support such functionality for package designers who disagree with this controversial point, rather than shutting out much of the fantastic tidymodels infrastructure from us just because of this point of disagreement.
So, as a proposed resolution to this issue, my PR does the following: instead of rejecting logical inputs as errors, it converts them to factors (and sets TRUE
to be the event_level
if applicable). This solution is very simple and works elegantly in the PR. Its only inconvenience is that it required adding the following code (or versions thereof) in almost 30 places throughout the package, right before any call to check_class_metric()
or check_prob_metric()
that check for valid inputs:
if (is.logical(truth)) {
event_level <- "second" # TRUE is second level of levels(factor(truth))
truth <- factor(truth)
estimate <- factor(estimate)
}
This is done in the PR (with a couple of tests) and it works great. Crucially, it doesn't mess with any code that assumes working with factors, so it should be easy to maintain and extend for future functionality. Additionally, since this converts a previous error condition to an accepted condition, it should not break any working code anywhere downstream or in users' scripts.
Could you please accept it? Again, the goal of this request is so that developers like me who believe that that binary outcomes are more naturally logicals than factors can still benefit from the fantastic yardstick
infrastructure., without creating any development or maintenance headaches for the yardstick
developers.
If this solution works for yardstick
, it can hopefully be extended to other tidymodels
packages as relevant.