-
Notifications
You must be signed in to change notification settings - Fork 19
Expectations 0.7.2 #137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
My first take on this: I'm assuming that a lot of the naming is still placeholder? I'm not commenting on naming here.
I'm confused by this? These three can all be set in problem.yaml, the last two will always have a value (since there is a default for when they are not explicitly set).
Why does the name have to be part of the result? Wouldn't it feel more natural to have a testcase (which has a name) map to a result? Also, give that the name is here, why is it optional? A testcase always has a name.
It would feel more natural to me if
This is not how aggregate verdicts are currently defined for scoring problems.
A testcase or group could also start with
I don't think we need these. Well... we obviously don't need them since it's "just" syntactic sugar, but I don't think we should have them.
...or value (which is what the code actually says)
This feels very strange. When and why would you use this?
Please elaborate on the lot of good it will do. This seems to be a point I'm not getting above.
Shouldn't they though? |
In my understanding these are not something you can write in
All of this is only for the purposes of precisely specifying how to go from the
Can you explain how they are defined? I'm not quite sure.
I definitely want them. It's super useful to specify
No, I don't think so. Generally, |
Strongly disagreed. (Again, it feels as we are rehashing conversations from three months ago.) I very much want to be able to do this. (A quadratic time submission that passes the partially_accepted/thore.py:
sample: accepted
group1: accepted
group2: accepted
group3: time limit exceeded
group4: wrong answer For an IOI-style problem, I have 20 of these submissions. I need the abbreviations. |
No, I would say that it's quite unclear. We need to improve that: It's clearly not simply "first error" though.
Ok, yeah, we probably (i.e. definitely) need some kind of shorthand. Do you feel the same about all of them?
It allows you to say things about the submissions (rather than the testcases). I.e. this submission will be judge as X. It can also allow for short circuiting. Seems to me to be the most basic thing we would want to express. |
Yes, I agree that we need some kind of shorthand. |
I agree that shorthand and verifying aggregate verdict/score are both very nice to have. |
It does not have to (the same effect would indeed be achieved with a higher-order datatype such as a map), it’s merely the conceptually most minimal way to give sense to “the first one” for a set of #result_for_testcase: [#name]: #result and then define In short
No, I want fine-grained targeting of testdata to always start with
Assume you have a two-testcase scoring problem (with Consider now: thore.py:
secret:
permitted_testcase_scores: [0, 50] This means that One the other hand, consider: ragnar.cpp:
secret:
permitted_aggregate_score: [0, 75] This means that Ragnar’s submission gets at most 75 points in total over both In principle, it makes sense to write: my_heuristic.cpp:
permitted_aggregate_score: [0,90] // total score is max 90
permitted_testcase_scores: [0,10] // individual score on each testcase is max 10 Since there are many different ways of defining scoring problems (in particular, in the Swedish Olympics, using both subtasks and a bespoke scoring output validator for the last subtask), it is useful to support both concepts. So: |
The definition of how verdicts are aggregated in the specification is this:
I think this is close, except for
Is there anything else wrong with it? My suggestion to clarify both issues in 0.7.1 above is // Aggregate verdict
// =================
// For a linearly ordered sequence of verdicts, the aggregate verdict is
// * the first occurrence of "WA", "TLE", or "RTE" if it exist
// * "AC" if the list is empty or only contains "AC" The linear ordering would be given by lexicographic ordering of |
I definitely agree that |
@thorehusfeldt
I would say that it should be
It should say that the result must be as if you ran the test cases in lexicographic order and and take the first non-accepted verdict. Rather than rewording the spec to carefully avoid saying which order things should run, I would suggest adding something explicitly allowing "optimizations" (or whatever it should be called?) as long as the externally observable behavior is as specified.
What do you mean? The definition quoted clearly is restricted to pass-fail problems, and just removing "For pass-fail problems" gives us an incorrect description of how the verdict of scoring problems are determined. I agree that the description should not be limited to pass-fail problems.
I still don't understand the purpose of this text? The canonical description of how verdict are aggregated should be in the spec, and we shouldn't have a separate copy (or much worse, a different definition) in the spec of the expectations. If this text is just for completeness while discussing the proposal, I guess it's fine. I would rather just refer to the correct section of the spec though. (Currently https://www.kattis.com/problem-package-format/spec/2023-07-draft.html#grading). As pointed out above, the description is in fact not correct for scoring problems. Why is the ordering mentioned outside of the comment, and not in it? |
I guess then it is an error to specify |
What is your guess based on? I don't think it should be.
Yes, that sounds reasonable. |
Why was it removed? And what does that mean exactly? |
The spec doesn't currently define verdicts scoring problems. All it says is
So either we don't allow
I think that at the end of the meeting we agreed that if you want to exclude a submission from the default Since |
Agreed. I think we should define them.
Ok, I will look at that.
I guess I don't understand what this structure we are defining is. How can say that |
No, we agreed (and I feel "agreed" is a bit strong) that all submissions must be in a subdirectory of Some examples: accepted:
used_for_timing: true
accepted/slow.py:
used_for_timing: false
# The above is ok, it sets it on the directory (this would typically be implied), and overrides it on the submission
---
other/sol1.py:
used_for_timing: true
# This is ok, it only sets the value once
---
accepted/*.py:
used_for_timing: false
accepted/fast.py:
used_for_timing: true
# This is NOT ok, it sets the value twice on the same level (the submission) In any case... There were originally two main use cases we wanted to cover w.r.t.
Later we also touched on:
...but I'm not sure if we actually think anybody would ever want this, or if it was more theoretical. We also touched on "the same but opposite", i.e. that There are 2 kinds of margins (lower and upper) and two kinds of exceptions (exclude and include), so the remaining very theoretical case would be:
Now, cases 1 & 2 is something that our users want to to often. There is clearly a need for this. Case 3 has never been requested, but I could maybe see how it could be useful? Case 4 has never been requested and I don't see how it would be useful or even reasonable. All these cases, but certainly 1 & 2 can be done quite nicely with the semantics showed in the examples below. What happens if we instead would switch to the "you have to move it to a different directory" line of thinking (but keep the "all submissions must be in a subdirectory of I don't think this last option is better, mostly because the feedback in the past has been lukewarm, but I certainly don't think it's terrible. |
Regarding abbreviations again. It seems we all agree that we should have some kind of shorthand. Below are some kinds of things I would like to be able to do with a shorthand. I'll avoid using the currently defined abbreviations in the examples, they could clearly cover several of these needs, but not all. Don't get too bogged down on the syntax, but of course, feel free to comment on it. other/sol1.py: AC
# gets a verdict of AC (i.e. aggregated verdict)
other/sol2.py: WA
# gets a verdict of WA
other/sol3.py: 50
# gets a score of 50 (i.e. aggregated score)
other/sol3.py: 80-100
# gets a sore in the 80-100 range
wrong_answer/sol4.py:
sample: AC
# gets a aggregated verdict for the data group `sample`
other/sol5.py:
secret/group1: 20
secret/group2: 20
secret/group3: 20
secret/group4: 0-40
# gets a score of 20 on groups 1-3 and in the 0-40 range on group 4
accepted/kinda_sloy.py: do_not_use_for_timing
# Don't use for timing. This one does not feel important, it's barely shorter than the non-shorthand version. The most common and simple things that you can't do (well) with the older spec, and that we wanted to add support of with the expectations framework were the following, in order of how commonly it's been requested:
These things should be easy to do. With the examples above they would be. Thoughts? |
Quoting from #135, but I think it's still relevant
But this only works because there are no other files that also match those globs (and it's a bit hacky). What you really would want to be able to say is thore.py: wrong answer
thore.py:
sample: accepted But that is not legal YAML. There is a workaround that almost always work (obviously you can created cases were it doesn't, but they are not reasonable), although it still feels hacky: thore.py: wrong answer
thore.py*:
sample: accepted The same kind of things apply to the shorthands I suggested just above. E.g.: thore.py: 80
thore.py:
used_for_timing: false But the longhand is not much longer so I think it's fine: thore.py:
score: 80
used_for_timing: false In fact, I just realise that with a long enough file name (including in this case, so not very long at all) the "longhand" is shorter. :) |
Minor comments
|
Back to timing I have found two ways to express fine-grained timing-related constraints as part of the expectations framework and laid them out in proposals 0.5 and 0.6. There is a less satisfying (to me) third way that I explain in the current issue, called 0.7). To summarise:
I have not seen a well-defined proposal for an overridable boolean flag that would make sense of the following: accepted: accepted
accepted/thore:
with_margin: true
accepted/*.py:
with_margin: false
accepted/thore.py:
secret: { with_margin: true } The only suggestion so far is Arnar’s, which maybe is the following:
For this to make sense, it becomes important, e.g., whether the Hence my decision. The default is |
Here are Fredrik’s desires expressed in the actual syntax of 0.7, with some comments. other/sol1.py:
permitted_testcase_verdicts: ["AC"]
# gets a verdict of AC (i.e. aggregated verdict)
# Thore: I would just create `other_ac` (as below) and put sol1.py there
other/sol2.py: wrong answer
# gets a verdict of WA
other/sol3.py: 50
# gets a score of 50 (i.e. aggregated score)
# Thore: Note that in 0.7 this means the running time is less than timelimit (without margin).
other/sol3.py: [80, 100]
# gets a sore in the 80-100 range
wrong_answer/sol4.py:
sample: accepted
# gets a aggregated verdict for the data group `sample`
other/sol5.py:
secret/group1: 20
secret/group2: 20
secret/group3: 20
secret/group4: [0, 40]
# gets a score of 20 on groups 1-3 and in the 0-40 range on group 4
other_ac/kinda_slow.py:
permitted_testcase_verdicts: ["AC"]
# Don't use for timing. This one does not feel important, it's barely shorter than the non-shorthand version. |
No. The most important thing for me and Johan, as expressed clearly in August, is to be able to express required and permitted testcase verdicts per testgroup. This is the entire core of the proposal, and has been so since August, in each of its iterations (of which we now have the 7th.) The minimal expressiveness we want is this: partially_accepted/quadratic.py:
sample:
permitted_testgroup_verdicts: ["AC"]
secret/group1:
permitted_testgroup_verdicts: ["AC"]
secret/group2:
permitted_testgroup_verdicts: ["TLE", "AC"]
required_testgroup_verdicts: ["TLE"]
secret/group3:
permitted_testgroup_verdicts: ["TLE", "AC", "WA"]
required_testgroup_verdicts: ["WA"] (Everything else, in particular the abbreviations, is syntactic sugar.) In particular, many authors of scoring problems think of subtasks as “this should kill the brute-force submission with The decision of whether subtask one should get 15 or 21 points is different from that. Points-per-subtask are defined in Johan was really admirably clear about this in August, and his explanation was exactly what got me to completely change the expectations proposal to be about expressing required and permitted testcase verdicts per testgroup (instead of aggregate numeral scores or aggregate verdicts). This is the right way of doing it, as agreed on in August. Every single proposal about expectations since then – three months of exchanges – have been grounded in that mindset. |
Are you
IMO the Using this definition for aggregation verdicts is a bit problematic for scoring problems. The relationship between verdict and score should be such that the score may only exist and be non-0 if the verdict is AC. We should specify clearly whether the score *exists and is 0 for rejected verdicts, or if it simply isn't defined, but it doesn't change the general point here.
I don't think anybody was second guessing YAML, I agree that changing the file format is orthogonal to and out of scope of this discussion. The question that (I think) you are answering to was how to best express (in YAML) the gist of: thore.py: wrong answer
thore.py:
sample: accepted Given that that is not valid YAML. Not, which other format to switch to where it would be legal. Do you have any thoughts on that? |
As mentioned, I strongly prefer to keep to the set of verdicts defined elsewhere (i.e. CLICS). It does behave really well when applied to subsets of testcases though. Fundamentally, I see the time limit verification as a separate process from the verdict, and I don't think these two concepts should be forced together.
This does fix the inconsistency with CLICS. Other than that is functionally identical to 0.5, meaning it will have the same benefits and drawbacks, except being less concise (and beautiful). That also means my fundamental issue above applies here as well.
We had put some limits on how to override it, but it's not non-overridable. In fact if it's truly non-overridable it does not solve the two main (and only?) issues this was designed to solve. Namely, "I want exclude an accepted submission from the normal time limit verification" and "I want to exclude and time limit exceeded submission from the normal time limit verification". I don't quite agree with the "difficult to decode for the human reader". I would say that that is rather the main benefits of this model. My reasoning is that having the ability to tag submission with "exclude this from margin stuff" is very close to the main use cases.
Towards the end of the call we had (or at least worked towards) a definition that said that more path segments overrides fewer path segments, and setting different values with the same number of path segments is an error. We never talked about applying this to testcases but that would makes sense, so with this we get:
Is that what you intended it to mean?
My intent when suggesting this was definitely to view the default to be explicitly set. I have not heard anyone saying or implying something else.
Is this referring to:
I think we all agree that the default is |
Replying to this comment from Thore
|
Those things are not mutually exclusive. Are you saying that what I listed was:
I would say that "being able to express required and permitted testcase verdicts per testgroup" is a very powerful (and very useful) feature, but is neither "simple" nor "commonly requested". Note that not being "commonly requested" doesn't mean that many would want it. I think most would, but that they had not thought about it. So, what you said can be "the most imprtant thing" at the same time as what I listed was "The most common and simple things that you can't do (well) with the older spec, and that we wanted to add support of with the expectations framework". Are we actually disagreeing on something here? If so, what? |
The definition of the
This definition subsumes the case where There is indeed a definition of “submission verdict” in the august draft of the spec:
This is well-defined for nonempty Note that (Let me repeat that I am not even advocating to add |
unintended mistakes, plural. You say "unfortunate formulation about running testcases", I assume that's one of the mistakes you're referring to (and strictly speaking I agree), what other mistake(s) are you referring to?
For pass-fail problems, I agree. I also think that it is consistent, i.e. the "submission verdict" of a pass-fail problem is the same as the "aggregate verdict" of all the testcase verdicts ordered alphabetically on testcase path.
Why is a definition of "aggregate verdict" for arbitrary subsets needed "to make sense of even basic operations of the entire problem package"? Based on the fact that you are not advocating for having an "aggregate verdict", I'm assuming that you don't see a need for it? I wanted to add what we named "submission verdict" above (which I originally called "verdict" and we renamed to "aggregate verdict"), and I see a use for that. I would argue that "aggregate verdicts" are not so relevant for arbitrary subsets, there the concepts of |
I agree, and furthermore I believe there is consensus on this point. To be clear, consensus that we want "to be able to express required and permitted testcase verdicts per testgroup", not that we don't want anything else. And to even clearer, I'm not implying that you are saying that we want nothing else.
That sounds good. We have that in the current (and most/all previous) proposals. I also think there is consensus here.
Agreed. My point when arguing "against" the abbreviations was to focus syntactic sugar (because it does have some cost) on the "common and simple" cases. In any case, the details of the syntactic sugar is less important than the underlying logic (but it's not unimportant).
I'm not entirely sure I understand exactly what you mean, but I think I agree.
Agreed.
Agreed.
That's a much more opinionated (and subjective) claim. I definitely think it's fine (and useful) to do verification by specifying the expected score for a set of submission. This is how it's typically done at IOI.
And we seem to agree on all that?
Why does it have to be instead of and not in addition to?
|
I'll try to summarize a little bit... Things we agree on:
Important things that we disagree on:
Things that we don't quite agree on, but that are less important:
After summarizing this it definitely feels to me that we are very close. Did I miss something? |
Note on regex/prefix matching One clarification: the matching rule that I like is that
This means that In particular, the submission Also, the testcases in testgroup I find this notationally very lightweight, and extremely simple to communicate. (And implementation is already done, because |
As a user I think this is pretty nice. I was going to write that it's limited as an implementer since ''all the regex dialects out there are slightly different'', but it turns out at least for python there is a quite nice list of things that are supported. And since anyway we'll mostly do this in python I'm OK with it. https://docs.python.org/3/library/re.html#regular-expression-syntax Some remarks:
|
I don't like using a specific language's regex implementation as a specification, as it basically forces everyone to use that language (Python in this case) to implement it. I know there's no single uniform regex specification, but I think we should stick to some sufficiently generic subset of regex that is supported in essentially all languages, for example POSIX Extended Regular Expressions, or maybe Perl Regex. But I also agree with Ragnar that simple globbing (possibly with |
I will not fight globbing, and I understand the very good point about My main concern is that I am emotionally attached to: accepted: accepted # don’t want to write `accepted/*`
partially_accepted/thore.py:
secret/group1: accepted # don’t want to write `secret/group1/*`
secret/group2: wrong answer Maybe the “and directories” in Ragnar’s post above is exactly what I’m looking for. Let me try:
Here, it is understood that
I think I like this. |
This sounds very good! Pythons glob has It would also be nice to do brace expansion as well, |
OK, I now have a minimal expectations framework ( Only intersting observation (after actually implementing it and validating data): YAML doesn’t think "accepted/*.py": accepted ## matches {accepted/th.py, accepted/ragnar.py}
"*/th.py" : accepted ## matches {accepted/th.py, wrong_answer/th.py}.
---
brute_force.py:
"*/*-small-*" : accepted ## matches secret/group3/032-small-disconnected I think that’s perfectly fine. I will today
|
Closing, evolved into #145 |
Only |
That would depend on the (possibly undocumented) implementation of your YAML parser or your editor. The YAML specification itself is quite clear, but some tools are more lenient and make educated guesses about what “looks like a string”. (A situtation that I find disastrous for a format used for specification, but that’s where we are. There are other famous examples with In any case, the definition of “what looks like a string to YAML” is outside of the expectations framework. In the above example, the tools that I tried do the right thing for |
I don't understand your answer. It seems tangential, but I might be missing something? My point was that if it was only the string If it's in fact any string starting with Which of the above are you saying that it is? |
You are not missing anything; there is not issue I’m worried about, I used the formulation “interesting observation” above, not “end of the world”. I implemented the syntax of the expectations framework (as you can see in #145), including an online validator of the current specification that you can play around with in your browser. (Link repeated: https://cuelang.org/play/?id=baug8IzTJVU#cue@export@cue) You can add and remove The only observation is that when we previously sketched examples for desirable syntax like: */th.py: accepted then we can’t have that. Instead, we should think of it as "*/th.py": accepted |
Uh oh!
There was an error while loading. Please reload this page.
Here is my best shot at Expectations 0.7.2, much of it in CUE:
Changed in 0.7.1: Removed
false
from the values forwith_margin
Changed in 0.7.2: Restricted
expectation
keys based onproblem.type
What changed?
The most visible change is that the
#expectations
struct is now a lot larger. Here it is again:The key names distinguish between
testcase_verdicts
andaggregate_verdict
. Symmetrically, thescore
field has split into two (and this will do a lot of good!).The new boolean
with_margin
has also been added. In ototal, 3 new fields, and several renamings. Have a look.Also,
#result
got a new fieldWe need this for sorting verdicts (so we can compute aggregate verdicts), and for applying the scoring aggregation rules (which depends on
testdata.yaml
file contents and their full path names).I added the
!
to make it explicitly mandatory.I didn’t bother to specify the aggregate verdict in CUE, it’s not clearer than the prose expression I wrote down.
Note that none of the abbreviations use aggregated verdicts or scores. (But they do use
with_margin
)The text was updated successfully, but these errors were encountered: