Skip to content

Specify results, verdicts, time llimit margins, and expectations (0.5) #133

Closed
@thorehusfeldt

Description

@thorehusfeldt

Goals

Primary Design Goals

  1. Precisely specify the meaning of “verdict” and “result”

  2. Define the semantics for setting time limit within “safety” margins from judge submissions

  3. Provide clear semantics for expectations

Secondary Design Goals

  1. Allow bespoke expectation profiles beyond "accepted", "wrong answer", etc.

  2. Expectations for judge messages (“on some testcase, the submissions produces guess out of bounds”) so that we can specify full test coverage for custom output validators

  3. Expectations for scores (”this submission gets at most 65 points on all testcases”)

  4. Allow expectations to be specified for test groups like sample or secret/group1 or even secret/03-petersen_graph

  5. Provide a schema for expectations specified in a separate expectations.yaml file

Constraints

  1. Stay close to existing terminology and traditions

Out of scope:

  • how should a contest system give feedback to the solver
  • interactive multipass

Givens

Relevant metadata is given in problem.yaml specified as follows

#problem {
    type: "pass-fail" | "scoring"
    limits?: {
        time_limit: number
        time_multipliers: {
            ac_to_time_limit: number
            time_limit_to_tle: number
        }   
        ...
    }
    ...
}

Timing-related abbreviations

For readability, the present document introduce aliases to the metdata specified in #problem:

let time_limit = problem.limits.time_limit
let ac_margin = problem.limits.time_multipliers.ac_to_time_limit
let tle_margin = problem.limits.time_multipliers.time_limit_to_tle

RfC1: Reconsider the key names in #problem: avoid depth 4, rename as above

Proposal: Results and Verdicts

Result

We define the result of running a submission on a testcase for a single-pass problem.

// The result of running a submission on a testcase
#result: {
	time!: >=0
	if time <= time_limit {
		// When the submission terminates within the time limit
		// its exit status is either a success or a failure:
		terminated_successfully!: bool
		if terminated_successfully {
			// The output successfully terminated then the submissions is
			// validated by the output validator, the resulting values are:
			output_validated!: bool
			message?:          string
			score?:            >=0
		}
	}
}

Decisions made here, this is RfC2.

  1. This defines “result” (not “outcome” or “verdict”).

  2. time! is not optional. Every result has a time, even those that are aborted or crashed. We promise that time is never compared to anything above tle_margin * time_limit, so any value above that is the same.
    Alternatives:

    1. A sum type time!: "aborted" | >= 0 & <= tle_margin * time_limit or even time!: "crashed" | "aborted" | >= 0 & <= tle_margin * time_limit, by whatever names we can come up with for crashed/failed/terminated/aborted.

    2. Every result instead has aborted!: bool; only those results who are not aborted must have a time!: >= 0 & <= tle_margin * time_limit. The problem, here, and above, is that aborted means many things (such as a runtime exception due to recursion stack overflow.)

  3. Careful distinction between successful termination (of the process) and successful output validation; note the required field output_validated!: every submission that terminated_successfully has its output validated.

  4. Scores are positive

  5. The judge message is called message. Alternative: judge_message.

  6. Nothing else is part of the result. (Memory use, team message, programming language.) Maybe it should.

  7. Should result.score be a required field for problem.type == "scoring"? (I think yes.)

  8. Is this good enough for multipass problems as well, including interactive problems? (Honest question.) What needs to be understood is wich execution times are cumulative, for instance. This is out of scope for me, but worth thinking about.

Verdict

The #verdict is a shorthand of the #result.
There are six verdicts:

#verdict: "AC" | "AC-" | "WA" | "RTE" | "TLE" | "TLE-"

(Note2: Verdicts may or may not be part of the information shown to the solver by a contest system, either indivivudally or in some summary.
However, the behaviour of contest systems is beyond the scope of this specification.)

Decisions made here, this is RfC3:

  1. A #verdict is derived from a #result, so it is something a submission has on a single testcase.

  2. Compared to the as-of-fall-2022 obsolete terminology of problemtools’s default grader, the specification introduces the verdicts TLE- and AC- defined below. Read them as “accepted without margin” and “time limit exceeded without margin.” There can be many other names for these, such as TLEW/ACW or TLE?/AC? or TLEw/ACw. An alternative is to introduce AC! for “accepted with margin” and TLE! for “accepted with margin”, which are arguably clearer (but change established semantics.)

  3. CE or JE are not verdicts. (Making them verdicts requires a richer #result.)

What verdicts mean

We can specify the semantics of each #verdict quite readably in CUE:

#verdict_for_result: {
	// The verdict for the _result
	_result:  #result
	let t = _result.time
	if t > time_limit*tle_margin {"TLE"}
	if t > time_limit && t <= time_limit*tle_margin {"TLE-"}
	if t <= time_limit {
		if !_result.terminated_successfully {"RTE"}
		if _result.terminated_successfully {
			if !_result.output_validated {"WA"}
			if _result.output_validated {
				if t <= time_limit/ac_margin {"AC"}
				if t > time_limit/ac_margin {"AC-"}
			}
		}
	}
}

I think this is uncontroversial, but let’s call it RfC4.

Proposal: Expectations

The role of expectations is to

  1. during problem development to ensure expected behaviour of author submissions on testdata as the problem definition, testdata, and author submissions are changed or added.

  2. define which author submissions are used to set the time limit

Expectations defined

An expectation (for a submission) is defined for a set of results, primarily in terms of verdicts.

#expectation: {
	permitted?: [...#verdict] // only these verdicts may appear
	required?: [...#verdict] // at least one of these verdicts must appear
	message?:                string // this judgemessage must appear
	score?:                  #range
}

#range: number | ordered_tuple
ordered_tuple: tuple=[number, number & >=tuple[0]]

A set of results satisfies an expectation if

  1. for each result, the result's verdict is contained in permitted

  2. at least one of the verdicts in required appears among the verdicts of the set of results

  3. the message appears (as a substring) in at least one result.message among the verdicts of the set of results

  4. every result.score is inside expectation.range

RfC5:

  1. should expectation.score be a required field for problem.type == "scoring"?

  2. do we need to be able to express “ each of these verdicts must appear” (in addition to the above required, which means some of these verdicts must appear.) To me, this just seems lazy: instead,
    make a separate submission or a separate testgroup and specify each of the required verdicts separately.

Common abbreviations

Typically a problem author will not use the fine-grained #expectations struct,
but instead use a common abbreviation:

#abbreviation: "accepted" | "wrong answer" | "runtime exception" | "time limit exceeded" | "does not terminate" | "not accepted"

_expectation_for_abbreviation: {
	_abbreviation: #abbreviation
	if _abbreviation == "accepted" {
		permitted: ["AC"]
	}
	if _abbreviation == "wrong answer" {
		permitted: ["AC", "WA"] // TODO is AC- also permitted?
		required: ["WA"]
	}
	if _abbreviation == "runtime exception" {
		permitted: ["AC", "RTE"] // TODO is AC- also permitted?
		required: ["RTE"]
	}
	if _abbreviation == "time limit exceeded" {
		permitted: ["AC", "AC-", "TLE", "TLE-"] // TODO not WA and not RTE, right?
		required: ["TLE"]
	}
	if _abbreviation == "does not terminate" {
		permitted: ["AC", "AC-", "RTE", "TLE", "TLE-"] // TODO not WA, right?
		required: ["RTE", "TLE"]
	}
	if _abbreviation == "not accepted" {
		required: ["RTE", "TLE", "WA"] // TODO should this include "TLE-"? // TODO is this useful at all?
	}
} & #expectation
  1. Note the TODOs; this RfC6.

  2. Note the difference between the #abbreviation with value accepted and the #verdict with value AC.
    The former, accepted, is a claim about a set of verdicts, the latter, AC, pertains to a single testcase.

  3. the abbreviations can be used as the names of directories; for instance a submission in <problemname>/wrong_answer is supposed to satisfy the expectations specified by abbreviation wrong answer.

Expectations for testdata

#root_expectations: {
	// Set expectations for all testcases
	#expectation | #range | #abbreviation

	// And/or set them per testgroup or testcase
	[=~"^(sample|secret)"]: #expectation | #range | #abbreviation
}

Terminology

author submission
: A submission written by a problem author, typically used during development. It is evaluated by the developer framework.

solver submission
: A submission written by a solver, typically during a contest or course after the problem is installed. It is evaluated by a judging system.

submission
: A author submission or a solver submission

execution time
: prefer this to running time (because of the confusion with runtime, which is something very different

Spelling:

  • time limit
  • testcase
  • testgroup
  • testdata
  • runtime

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions