Description
Goals
Primary Design Goals
-
Precisely specify the meaning of “verdict” and “result”
-
Define the semantics for setting time limit within “safety” margins from judge submissions
-
Provide clear semantics for expectations
Secondary Design Goals
-
Allow bespoke expectation profiles beyond "accepted", "wrong answer", etc.
-
Expectations for judge messages (“on some testcase, the submissions produces
guess out of bounds
”) so that we can specify full test coverage for custom output validators -
Expectations for scores (”this submission gets at most 65 points on all testcases”)
-
Allow expectations to be specified for test groups like
sample
orsecret/group1
or evensecret/03-petersen_graph
-
Provide a schema for expectations specified in a separate
expectations.yaml
file
Constraints
- Stay close to existing terminology and traditions
Out of scope:
- how should a contest system give feedback to the solver
- interactive multipass
Givens
Relevant metadata is given in problem.yaml
specified as follows
#problem {
type: "pass-fail" | "scoring"
limits?: {
time_limit: number
time_multipliers: {
ac_to_time_limit: number
time_limit_to_tle: number
}
...
}
...
}
Timing-related abbreviations
For readability, the present document introduce aliases to the metdata specified in #problem
:
let time_limit = problem.limits.time_limit
let ac_margin = problem.limits.time_multipliers.ac_to_time_limit
let tle_margin = problem.limits.time_multipliers.time_limit_to_tle
RfC1: Reconsider the key names in #problem
: avoid depth 4, rename as above
Proposal: Results and Verdicts
Result
We define the result of running a submission on a testcase for a single-pass problem.
// The result of running a submission on a testcase
#result: {
time!: >=0
if time <= time_limit {
// When the submission terminates within the time limit
// its exit status is either a success or a failure:
terminated_successfully!: bool
if terminated_successfully {
// The output successfully terminated then the submissions is
// validated by the output validator, the resulting values are:
output_validated!: bool
message?: string
score?: >=0
}
}
}
Decisions made here, this is RfC2.
-
This defines “result” (not “outcome” or “verdict”).
-
time!
is not optional. Every result has atime
, even those that are aborted or crashed. We promise that time is never compared to anything abovetle_margin * time_limit
, so any value above that is the same.
Alternatives:-
A sum type
time!: "aborted" | >= 0 & <= tle_margin * time_limit
or eventime!: "crashed" | "aborted" | >= 0 & <= tle_margin * time_limit
, by whatever names we can come up with for crashed/failed/terminated/aborted. -
Every result instead has
aborted!: bool
; only those results who are not aborted must have atime!: >= 0 & <= tle_margin * time_limit
. The problem, here, and above, is thataborted
means many things (such as a runtime exception due to recursion stack overflow.)
-
-
Careful distinction between successful termination (of the process) and successful output validation; note the required field
output_validated!
: every submission thatterminated_successfully
has its output validated. -
Scores are positive
-
The judge message is called
message
. Alternative:judge_message
. -
Nothing else is part of the result. (Memory use, team message, programming language.) Maybe it should.
-
Should
result.score
be a required field for problem.type == "scoring"? (I think yes.) -
Is this good enough for multipass problems as well, including interactive problems? (Honest question.) What needs to be understood is wich execution times are cumulative, for instance. This is out of scope for me, but worth thinking about.
Verdict
The #verdict
is a shorthand of the #result
.
There are six verdicts:
#verdict: "AC" | "AC-" | "WA" | "RTE" | "TLE" | "TLE-"
(Note2: Verdicts may or may not be part of the information shown to the solver by a contest system, either indivivudally or in some summary.
However, the behaviour of contest systems is beyond the scope of this specification.)
Decisions made here, this is RfC3:
-
A
#verdict
is derived from a#result
, so it is something a submission has on a single testcase. -
Compared to the as-of-fall-2022 obsolete terminology of
problemtools
’s default grader, the specification introduces the verdictsTLE-
andAC-
defined below. Read them as “accepted without margin” and “time limit exceeded without margin.” There can be many other names for these, such asTLEW/ACW
orTLE?/AC?
orTLEw/ACw
. An alternative is to introduceAC!
for “accepted with margin” andTLE!
for “accepted with margin”, which are arguably clearer (but change established semantics.) -
CE
orJE
are not verdicts. (Making them verdicts requires a richer#result
.)
What verdicts mean
We can specify the semantics of each #verdict
quite readably in CUE:
#verdict_for_result: {
// The verdict for the _result
_result: #result
let t = _result.time
if t > time_limit*tle_margin {"TLE"}
if t > time_limit && t <= time_limit*tle_margin {"TLE-"}
if t <= time_limit {
if !_result.terminated_successfully {"RTE"}
if _result.terminated_successfully {
if !_result.output_validated {"WA"}
if _result.output_validated {
if t <= time_limit/ac_margin {"AC"}
if t > time_limit/ac_margin {"AC-"}
}
}
}
}
I think this is uncontroversial, but let’s call it RfC4.
Proposal: Expectations
The role of expectations is to
-
during problem development to ensure expected behaviour of author submissions on testdata as the problem definition, testdata, and author submissions are changed or added.
-
define which author submissions are used to set the time limit
Expectations defined
An expectation (for a submission) is defined for a set of results, primarily in terms of verdicts.
#expectation: {
permitted?: [...#verdict] // only these verdicts may appear
required?: [...#verdict] // at least one of these verdicts must appear
message?: string // this judgemessage must appear
score?: #range
}
#range: number | ordered_tuple
ordered_tuple: tuple=[number, number & >=tuple[0]]
A set of results satisfies an expectation if
-
for each result, the result's verdict is contained in
permitted
-
at least one of the verdicts in
required
appears among the verdicts of the set of results -
the
message
appears (as a substring) in at least oneresult.message
among the verdicts of the set of results -
every
result.score
is insideexpectation.range
RfC5:
-
should
expectation.score
be a required field forproblem.type == "scoring"
? -
do we need to be able to express “ each of these verdicts must appear” (in addition to the above
required
, which means some of these verdicts must appear.) To me, this just seems lazy: instead,
make a separate submission or a separate testgroup and specify each of the required verdicts separately.
Common abbreviations
Typically a problem author will not use the fine-grained #expectations
struct,
but instead use a common abbreviation:
#abbreviation: "accepted" | "wrong answer" | "runtime exception" | "time limit exceeded" | "does not terminate" | "not accepted"
_expectation_for_abbreviation: {
_abbreviation: #abbreviation
if _abbreviation == "accepted" {
permitted: ["AC"]
}
if _abbreviation == "wrong answer" {
permitted: ["AC", "WA"] // TODO is AC- also permitted?
required: ["WA"]
}
if _abbreviation == "runtime exception" {
permitted: ["AC", "RTE"] // TODO is AC- also permitted?
required: ["RTE"]
}
if _abbreviation == "time limit exceeded" {
permitted: ["AC", "AC-", "TLE", "TLE-"] // TODO not WA and not RTE, right?
required: ["TLE"]
}
if _abbreviation == "does not terminate" {
permitted: ["AC", "AC-", "RTE", "TLE", "TLE-"] // TODO not WA, right?
required: ["RTE", "TLE"]
}
if _abbreviation == "not accepted" {
required: ["RTE", "TLE", "WA"] // TODO should this include "TLE-"? // TODO is this useful at all?
}
} & #expectation
-
Note the TODOs; this RfC6.
-
Note the difference between the
#abbreviation
with valueaccepted
and the#verdict
with valueAC
.
The former,accepted
, is a claim about a set of verdicts, the latter,AC
, pertains to a single testcase. -
the abbreviations can be used as the names of directories; for instance a submission in
<problemname>/wrong_answer
is supposed to satisfy the expectations specified by abbreviationwrong answer
.
Expectations for testdata
#root_expectations: {
// Set expectations for all testcases
#expectation | #range | #abbreviation
// And/or set them per testgroup or testcase
[=~"^(sample|secret)"]: #expectation | #range | #abbreviation
}
Terminology
author submission
: A submission written by a problem author, typically used during development. It is evaluated by the developer framework.
solver submission
: A submission written by a solver, typically during a contest or course after the problem is installed. It is evaluated by a judging system.
submission
: A author submission or a solver submission
execution time
: prefer this to running time (because of the confusion with runtime, which is something very different
Spelling:
- time limit
- testcase
- testgroup
- testdata
- runtime