-
Notifications
You must be signed in to change notification settings - Fork 19
Specify results, verdicts, time llimit margins, and expectations (0.5) #133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I would argue that the current naming is depth 3 ( It's a bit unclear what us meant by "rename as above", are you suggesting to replace I assume that the |
Shouldn't this be
I would argue that the output is always validated, but not always correct. So, maybe
Should we actually allow score of 0? (I think maybe we should, but I'm not 100% sure). |
This seems to assume that |
It's a bit unclear what exactly "never compared to anything above" actually means, but I do think I understand the gist of it. This sounds reasonable to me.
I don't like this alternative. Mixing strings and numbers (and calling that variable "time") sounds worse in every way.
I'm not sure I agree that |
But that's not how you write it. The CUE-specification says non-negative. |
Memory use: Team message: programming language: |
It's unclear to me what "required" means here. Are you saying that it's only defined if |
Here is a complete specification of Expectations-no-minus. Pay attention to
To me, this seems like a lot of new infrastructure and redundancy merely to avoid the #registry
#registry: close({[string]: #root_expectations})
// Problem-wide constants
// ======================
// Set by the judging environment
time_limit: >0
tle_margin: >1 // a.k.a. problem.limits.time_multipliers.ac_to_time_limit
ac_margin: >1 // a.k.a. problem.limits.time_multipliers.time_limit_to_tle
// Testcase Result
// ===============
// The result of running a submission on a testcase
#result: {
time!: >=0
if time <= time_limit {
// When the submission terminates within the time limit
// its exit status is either a success or a failure:
terminated_successfully!: bool
if terminated_successfully {
// The output of successfully terminated submissions is
// validated by the output validator, the resulting values are:
output_validated!: bool
message?: string
score?: >=0
}
}
}
// Testcase Verdict
// ================
//
// The verdict of a testcase summarises the result as one of four values
#verdict: "AC" | "WA" | "RTE" | "TLE"
#verdict_for_result: {
// The verdict for the _result
_result: #result
if _result.time >= time_limit {"TLE"}
if _result.time < time_limit {
if !_result.terminated_successfully {"RTE"}
if _result.terminated_successfully {
if !_result.output_validated {"WA"}
if _result.output_validated { "AC"}
}
}
}
}
// Testcase Timing
// ================
//
// The timing of a testcase summarises the result as one of four values
#timing: "fast enough with margin" | "fast enough" | "too slow" | "too slow with margin"
#timing_for_result: {
// The verdict for the _result
_result: #result
let t = _result.time
if t >= time_limit {
if t >= time_limit * tle_margin {"too slow with margin"}
if t < time_limit * tle_margin {"too slow"}
}
if t < time_limit {
if t >= time_limit / ac_margin {"fast enough"}
if t < time_limit / ac_margin {"fast enough with margin"}
}
}
// Expectations
// ============
#root_expectations: {
// Set expectations for all testcases
#expectation | #range | #abbreviation
// And/or set them per testgroup:
[=~"^(sample|secret)"]: #expectation | #range | #abbreviation
}
// Often-used expectations are specified in terms of abbreviations
#abbreviation: "accepted" | "wrong answer" | "runtime exception" | "time limit exceeded" | "does not terminate" | "not accepted"
// Scoring problems can set the range
#range: number | ordered_tuple
ordered_tuple: tuple=[number, number & >=tuple[0]]
// In general, we can set fine-grained expectations in terms of which verdicts and timings
// are allowed and disallowed
#expectation: {
permitted_verdicts?: [...#verdict] // only these verdicts may appear
required_verdicts?: [...#verdict] // at least one of these verdicts must appear
permitted_timings?: [...#timing] // only these timings may appear
required_timings?: [...#timing] // at least one of these timings must appear
message?: string // this judgemessage must appear
score?: #range
}
// Each abbreviation stands for a set of required and permitted verdicts, as follows:
_expectation_for_abbreviation: {
_abbreviation: #abbreviation
if _abbreviation == "accepted" {
permitted_verdicts: ["AC"]
permitted_timings: ["fast enough with margin"]
}
if _abbreviation == "wrong answer" {
permitted_verdicts: ["AC", "WA"]
required_verdicts: ["WA"]
}
if _abbreviation == "runtime exception" {
permitted_verdicts: ["AC", "RTE"]
required_verdicts: ["RTE"]
}
if _abbreviation == "time limit exceeded" {
permitted_verdicts: ["AC", "TLE"]
required_timings: ["too slow with margin"]
}
if _abbreviation == "does not terminate" {
permitted_verdicts: ["AC", "RTE", "TLE"]
required_verdicts: ["RTE", "TLE"]
}
if _abbreviation == "not accepted" {
required_verdicts: ["RTE", "TLE", "WA"]
}} & #expectation As one of many use case beyond specifying the semantics of the defaults required_verdicts: ["AC"] Note that this is not the same expectation as submissions/sluggish_ac and specifying the above |
I do disagree with that, but there is a way to get around it by writing the output validator such that a matching judge message appears so I guess it's fine to not have it included. |
Became #135 |
Uh oh!
There was an error while loading. Please reload this page.
Goals
Primary Design Goals
Precisely specify the meaning of “verdict” and “result”
Define the semantics for setting time limit within “safety” margins from judge submissions
Provide clear semantics for expectations
Secondary Design Goals
Allow bespoke expectation profiles beyond "accepted", "wrong answer", etc.
Expectations for judge messages (“on some testcase, the submissions produces
guess out of bounds
”) so that we can specify full test coverage for custom output validatorsExpectations for scores (”this submission gets at most 65 points on all testcases”)
Allow expectations to be specified for test groups like
sample
orsecret/group1
or evensecret/03-petersen_graph
Provide a schema for expectations specified in a separate
expectations.yaml
fileConstraints
Out of scope:
Givens
Relevant metadata is given in
problem.yaml
specified as followsTiming-related abbreviations
For readability, the present document introduce aliases to the metdata specified in
#problem
:RfC1: Reconsider the key names in
#problem
: avoid depth 4, rename as aboveProposal: Results and Verdicts
Result
We define the result of running a submission on a testcase for a single-pass problem.
Decisions made here, this is RfC2.
This defines “result” (not “outcome” or “verdict”).
time!
is not optional. Every result has atime
, even those that are aborted or crashed. We promise that time is never compared to anything abovetle_margin * time_limit
, so any value above that is the same.Alternatives:
A sum type
time!: "aborted" | >= 0 & <= tle_margin * time_limit
or eventime!: "crashed" | "aborted" | >= 0 & <= tle_margin * time_limit
, by whatever names we can come up with for crashed/failed/terminated/aborted.Every result instead has
aborted!: bool
; only those results who are not aborted must have atime!: >= 0 & <= tle_margin * time_limit
. The problem, here, and above, is thataborted
means many things (such as a runtime exception due to recursion stack overflow.)Careful distinction between successful termination (of the process) and successful output validation; note the required field
output_validated!
: every submission thatterminated_successfully
has its output validated.Scores are positive
The judge message is called
message
. Alternative:judge_message
.Nothing else is part of the result. (Memory use, team message, programming language.) Maybe it should.
Should
result.score
be a required field for problem.type == "scoring"? (I think yes.)Is this good enough for multipass problems as well, including interactive problems? (Honest question.) What needs to be understood is wich execution times are cumulative, for instance. This is out of scope for me, but worth thinking about.
Verdict
The
#verdict
is a shorthand of the#result
.There are six verdicts:
(Note2: Verdicts may or may not be part of the information shown to the solver by a contest system, either indivivudally or in some summary.
However, the behaviour of contest systems is beyond the scope of this specification.)
Decisions made here, this is RfC3:
A
#verdict
is derived from a#result
, so it is something a submission has on a single testcase.Compared to the as-of-fall-2022 obsolete terminology of
problemtools
’s default grader, the specification introduces the verdictsTLE-
andAC-
defined below. Read them as “accepted without margin” and “time limit exceeded without margin.” There can be many other names for these, such asTLEW/ACW
orTLE?/AC?
orTLEw/ACw
. An alternative is to introduceAC!
for “accepted with margin” andTLE!
for “accepted with margin”, which are arguably clearer (but change established semantics.)CE
orJE
are not verdicts. (Making them verdicts requires a richer#result
.)What verdicts mean
We can specify the semantics of each
#verdict
quite readably in CUE:I think this is uncontroversial, but let’s call it RfC4.
Proposal: Expectations
The role of expectations is to
during problem development to ensure expected behaviour of author submissions on testdata as the problem definition, testdata, and author submissions are changed or added.
define which author submissions are used to set the time limit
Expectations defined
An expectation (for a submission) is defined for a set of results, primarily in terms of verdicts.
A set of results satisfies an expectation if
for each result, the result's verdict is contained in
permitted
at least one of the verdicts in
required
appears among the verdicts of the set of resultsthe
message
appears (as a substring) in at least oneresult.message
among the verdicts of the set of resultsevery
result.score
is insideexpectation.range
RfC5:
should
expectation.score
be a required field forproblem.type == "scoring"
?do we need to be able to express “ each of these verdicts must appear” (in addition to the above
required
, which means some of these verdicts must appear.) To me, this just seems lazy: instead,make a separate submission or a separate testgroup and specify each of the required verdicts separately.
Common abbreviations
Typically a problem author will not use the fine-grained
#expectations
struct,but instead use a common abbreviation:
Note the TODOs; this RfC6.
Note the difference between the
#abbreviation
with valueaccepted
and the#verdict
with valueAC
.The former,
accepted
, is a claim about a set of verdicts, the latter,AC
, pertains to a single testcase.the abbreviations can be used as the names of directories; for instance a submission in
<problemname>/wrong_answer
is supposed to satisfy the expectations specified by abbreviationwrong answer
.Expectations for testdata
Terminology
author submission
: A submission written by a problem author, typically used during development. It is evaluated by the developer framework.
solver submission
: A submission written by a solver, typically during a contest or course after the problem is installed. It is evaluated by a judging system.
submission
: A author submission or a solver submission
execution time
: prefer this to running time (because of the confusion with runtime, which is something very different
Spelling:
The text was updated successfully, but these errors were encountered: