Skip to content

Meaning of submissions/<verdict> directories #87

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RagnarGrootKoerkamp opened this issue Jul 24, 2023 · 8 comments · Fixed by #158
Closed

Meaning of submissions/<verdict> directories #87

RagnarGrootKoerkamp opened this issue Jul 24, 2023 · 8 comments · Fixed by #158

Comments

@RagnarGrootKoerkamp
Copy link
Collaborator

RagnarGrootKoerkamp commented Jul 24, 2023

Currently they are specified as

  • accepted: Accepted as a correct solution for all test files
  • wrong_answer: Wrong answer for some test file, but is not too slow and does not crash for any test file
  • time_limit_exceeded: Too slow for some test file. May also give wrong answer but not crash for any test file
  • run_time_error: Crashes for some test file

This is basically what you get if you grade with 'priorities', where AC < WA < TLE < RTE. In practice though, ICPC uses 'first failure' mode.
This means that these directories do not actually correspond to the expected final vedict.

Does anybody implement priority-based verdicts for testing these submissions?
If not, I'd say we should change them to just mean the final verdict (using whatever grader is in place) must be X

@RagnarGrootKoerkamp
Copy link
Collaborator Author

See also #17

@niemela
Copy link
Member

niemela commented Jul 25, 2023

For accepted pretty much any interpretation means the same.

For time_limit_exceeded it's not enough ti just "judge as usual and use the final verdict", since we must also make sure that it there is a sufficient margin. Also, we definitely want to allow these to not always give the right answer, and "normal judging" might then judge them as WA instead.

For wrong_answer and run_time_error both interpretations could work, but "X for some test file" is more like what we do for time_limit_exceeded than "final verdict is X".

@jsannemo
Copy link
Contributor

Whenever I write ICPC problems, I usually want the semantics:

  • accepted is always correct, never too slow, and never crashes.
  • time_limit_exceeded must get TLE on some test case. It should usually be correct and crash-free, but that's not a hard requirement.
  • wrong_answer must get WA on some test case. It should usually be fast enough and crash-free, but that's not a hard requirement.
  • run_time_error: here goes general crap that happens to crash sometimes, but could just as well TLE or WA. I never want to force that something crashes (sometimes I try to trigger a failing assert, but that could just as well print nothing instead of crashing).

The most common examples of non-exclusive TLE/WA/RTE failures:

  • I write a brute force that's too slow, but also crashes because of too high memory requirements on some test case.
  • I write an incorrect solution, that due to the bug can result in some zero-division and crash.
  • I write some randomized heuristic solution that sometimes finds a sub-optimal answer, and sometimes takes too many iterations to find an answer.

Sometimes, an RTE submission might even get accepted! Memory bugs are tricky that way. Most of the time those aren't very interesting though; in particular, it's not something you want or can validate against anyway.

In conclusion, I agree with Fredrik on TLE, think that the same interpretation should be used for WA, but that RTE can be a kind of "whatever" folder (I would personally scrap it in favor of a general "rejected" one).

@RagnarGrootKoerkamp
Copy link
Collaborator Author

RagnarGrootKoerkamp commented Jul 27, 2023

Ok so @jsannemo basically prefers WA and TLE to mean 'at least one testcase gives WA resp. TLE'. That sounds reasonable to me.It is also easy to implement and still allows 'lazy judging' since you can stop as soon as one case gives the expected WA/TLE verdict.

This differs from the current spec, but seems backwards compatible, since a final first-error verdict of X implies at least one case gives X.

I would then keep this consistent for RTE, where at least one case should RTE. I actually find quite a lot of actual RTE submissions in our archives, e.g. when explicit assertions in submissions fail.

I'm also happy to add rejected which only implies at least one non-AC testcase.

Re timing:

  • TLE should be > timelimit*buffer
  • AC should be < timelimit/buffer
  • 'not slow enough' TLE (<timelimit*buffer) can be in rejected
  • 'AC not used for timing': rejected isn't the right place, not clear where instead. maybe mixed or ignored or just in submissions/ directly.
  • 'either AC or TLE is ok': not clear, but probably same mixed/, ignored/ or submissions/ as the case above

@RagnarGrootKoerkamp
Copy link
Collaborator Author

Hmm, with X implies at least one case of X, putting a bruteforce in TLE does not guarantee it's AC on all other cases. But that's OK; we will be able to use the more fine grained expectations.yaml for that

@Tagl
Copy link
Collaborator

Tagl commented Jul 27, 2023

I support letting run_time_error behave similarly to the other two.
We can add rejected for not AC.

Sidenote: should it not be runtime_error?

@RagnarGrootKoerkamp
Copy link
Collaborator Author

Ok, let's add rejected.

Suggestions for a name for anything? Assuming we want a specific subdir for it. (Otherwise I'd spec that anything outside the specified dirs is allowed but ignored by default.)

This whole runtime vs run time vs running time (which means something else) situation is a mess in the english language :/
Renaming to runtime_error is OK for me, but note that we have TLE = time limit exceeded so RTE = run time error is consisent; we don't call it RE = runtime error.

@Tagl
Copy link
Collaborator

Tagl commented Jul 28, 2023

No need to change, seems runtime, run-time and run time are all used commonly and interchangeably.

Do we want to disallow other subdirs than the ones we define? I think the answer to that determines the answer to having anything or something similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants