Skip to content

Specify results, verdicts, time llimit margins, and expectations (0.5) #133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
thorehusfeldt opened this issue Nov 19, 2023 · 10 comments
Closed

Comments

@thorehusfeldt
Copy link
Contributor

thorehusfeldt commented Nov 19, 2023

Goals

Primary Design Goals

  1. Precisely specify the meaning of “verdict” and “result”

  2. Define the semantics for setting time limit within “safety” margins from judge submissions

  3. Provide clear semantics for expectations

Secondary Design Goals

  1. Allow bespoke expectation profiles beyond "accepted", "wrong answer", etc.

  2. Expectations for judge messages (“on some testcase, the submissions produces guess out of bounds”) so that we can specify full test coverage for custom output validators

  3. Expectations for scores (”this submission gets at most 65 points on all testcases”)

  4. Allow expectations to be specified for test groups like sample or secret/group1 or even secret/03-petersen_graph

  5. Provide a schema for expectations specified in a separate expectations.yaml file

Constraints

  1. Stay close to existing terminology and traditions

Out of scope:

  • how should a contest system give feedback to the solver
  • interactive multipass

Givens

Relevant metadata is given in problem.yaml specified as follows

#problem {
    type: "pass-fail" | "scoring"
    limits?: {
        time_limit: number
        time_multipliers: {
            ac_to_time_limit: number
            time_limit_to_tle: number
        }   
        ...
    }
    ...
}

Timing-related abbreviations

For readability, the present document introduce aliases to the metdata specified in #problem:

let time_limit = problem.limits.time_limit
let ac_margin = problem.limits.time_multipliers.ac_to_time_limit
let tle_margin = problem.limits.time_multipliers.time_limit_to_tle

RfC1: Reconsider the key names in #problem: avoid depth 4, rename as above

Proposal: Results and Verdicts

Result

We define the result of running a submission on a testcase for a single-pass problem.

// The result of running a submission on a testcase
#result: {
	time!: >=0
	if time <= time_limit {
		// When the submission terminates within the time limit
		// its exit status is either a success or a failure:
		terminated_successfully!: bool
		if terminated_successfully {
			// The output successfully terminated then the submissions is
			// validated by the output validator, the resulting values are:
			output_validated!: bool
			message?:          string
			score?:            >=0
		}
	}
}

Decisions made here, this is RfC2.

  1. This defines “result” (not “outcome” or “verdict”).

  2. time! is not optional. Every result has a time, even those that are aborted or crashed. We promise that time is never compared to anything above tle_margin * time_limit, so any value above that is the same.
    Alternatives:

    1. A sum type time!: "aborted" | >= 0 & <= tle_margin * time_limit or even time!: "crashed" | "aborted" | >= 0 & <= tle_margin * time_limit, by whatever names we can come up with for crashed/failed/terminated/aborted.

    2. Every result instead has aborted!: bool; only those results who are not aborted must have a time!: >= 0 & <= tle_margin * time_limit. The problem, here, and above, is that aborted means many things (such as a runtime exception due to recursion stack overflow.)

  3. Careful distinction between successful termination (of the process) and successful output validation; note the required field output_validated!: every submission that terminated_successfully has its output validated.

  4. Scores are positive

  5. The judge message is called message. Alternative: judge_message.

  6. Nothing else is part of the result. (Memory use, team message, programming language.) Maybe it should.

  7. Should result.score be a required field for problem.type == "scoring"? (I think yes.)

  8. Is this good enough for multipass problems as well, including interactive problems? (Honest question.) What needs to be understood is wich execution times are cumulative, for instance. This is out of scope for me, but worth thinking about.

Verdict

The #verdict is a shorthand of the #result.
There are six verdicts:

#verdict: "AC" | "AC-" | "WA" | "RTE" | "TLE" | "TLE-"

(Note2: Verdicts may or may not be part of the information shown to the solver by a contest system, either indivivudally or in some summary.
However, the behaviour of contest systems is beyond the scope of this specification.)

Decisions made here, this is RfC3:

  1. A #verdict is derived from a #result, so it is something a submission has on a single testcase.

  2. Compared to the as-of-fall-2022 obsolete terminology of problemtools’s default grader, the specification introduces the verdicts TLE- and AC- defined below. Read them as “accepted without margin” and “time limit exceeded without margin.” There can be many other names for these, such as TLEW/ACW or TLE?/AC? or TLEw/ACw. An alternative is to introduce AC! for “accepted with margin” and TLE! for “accepted with margin”, which are arguably clearer (but change established semantics.)

  3. CE or JE are not verdicts. (Making them verdicts requires a richer #result.)

What verdicts mean

We can specify the semantics of each #verdict quite readably in CUE:

#verdict_for_result: {
	// The verdict for the _result
	_result:  #result
	let t = _result.time
	if t > time_limit*tle_margin {"TLE"}
	if t > time_limit && t <= time_limit*tle_margin {"TLE-"}
	if t <= time_limit {
		if !_result.terminated_successfully {"RTE"}
		if _result.terminated_successfully {
			if !_result.output_validated {"WA"}
			if _result.output_validated {
				if t <= time_limit/ac_margin {"AC"}
				if t > time_limit/ac_margin {"AC-"}
			}
		}
	}
}

I think this is uncontroversial, but let’s call it RfC4.

Proposal: Expectations

The role of expectations is to

  1. during problem development to ensure expected behaviour of author submissions on testdata as the problem definition, testdata, and author submissions are changed or added.

  2. define which author submissions are used to set the time limit

Expectations defined

An expectation (for a submission) is defined for a set of results, primarily in terms of verdicts.

#expectation: {
	permitted?: [...#verdict] // only these verdicts may appear
	required?: [...#verdict] // at least one of these verdicts must appear
	message?:                string // this judgemessage must appear
	score?:                  #range
}

#range: number | ordered_tuple
ordered_tuple: tuple=[number, number & >=tuple[0]]

A set of results satisfies an expectation if

  1. for each result, the result's verdict is contained in permitted

  2. at least one of the verdicts in required appears among the verdicts of the set of results

  3. the message appears (as a substring) in at least one result.message among the verdicts of the set of results

  4. every result.score is inside expectation.range

RfC5:

  1. should expectation.score be a required field for problem.type == "scoring"?

  2. do we need to be able to express “ each of these verdicts must appear” (in addition to the above required, which means some of these verdicts must appear.) To me, this just seems lazy: instead,
    make a separate submission or a separate testgroup and specify each of the required verdicts separately.

Common abbreviations

Typically a problem author will not use the fine-grained #expectations struct,
but instead use a common abbreviation:

#abbreviation: "accepted" | "wrong answer" | "runtime exception" | "time limit exceeded" | "does not terminate" | "not accepted"

_expectation_for_abbreviation: {
	_abbreviation: #abbreviation
	if _abbreviation == "accepted" {
		permitted: ["AC"]
	}
	if _abbreviation == "wrong answer" {
		permitted: ["AC", "WA"] // TODO is AC- also permitted?
		required: ["WA"]
	}
	if _abbreviation == "runtime exception" {
		permitted: ["AC", "RTE"] // TODO is AC- also permitted?
		required: ["RTE"]
	}
	if _abbreviation == "time limit exceeded" {
		permitted: ["AC", "AC-", "TLE", "TLE-"] // TODO not WA and not RTE, right?
		required: ["TLE"]
	}
	if _abbreviation == "does not terminate" {
		permitted: ["AC", "AC-", "RTE", "TLE", "TLE-"] // TODO not WA, right?
		required: ["RTE", "TLE"]
	}
	if _abbreviation == "not accepted" {
		required: ["RTE", "TLE", "WA"] // TODO should this include "TLE-"? // TODO is this useful at all?
	}
} & #expectation
  1. Note the TODOs; this RfC6.

  2. Note the difference between the #abbreviation with value accepted and the #verdict with value AC.
    The former, accepted, is a claim about a set of verdicts, the latter, AC, pertains to a single testcase.

  3. the abbreviations can be used as the names of directories; for instance a submission in <problemname>/wrong_answer is supposed to satisfy the expectations specified by abbreviation wrong answer.

Expectations for testdata

#root_expectations: {
	// Set expectations for all testcases
	#expectation | #range | #abbreviation

	// And/or set them per testgroup or testcase
	[=~"^(sample|secret)"]: #expectation | #range | #abbreviation
}

Terminology

author submission
: A submission written by a problem author, typically used during development. It is evaluated by the developer framework.

solver submission
: A submission written by a solver, typically during a contest or course after the problem is installed. It is evaluated by a judging system.

submission
: A author submission or a solver submission

execution time
: prefer this to running time (because of the confusion with runtime, which is something very different

Spelling:

  • time limit
  • testcase
  • testgroup
  • testdata
  • runtime
@thorehusfeldt thorehusfeldt changed the title Specify results, verdicts, time llimit margins, and exprectations Specify results, verdicts, time llimit margins, and expectations Nov 19, 2023
@niemela
Copy link
Member

niemela commented Nov 26, 2023

RfC1: Reconsider the key names in #problem: avoid depth 4, rename as above

I would argue that the current naming is depth 3 (limits / time_multipliers / ac_to_time_limit or time_limit_to_tle), but also that it's more than needed.

It's a bit unclear what us meant by "rename as above", are you suggesting to replace time_multipliers and subkeys with tle_margin and ac_margin. I would prefer to have them have the time_ prefix.

I assume that the tle_margin = ....ac_to_time_limit and ac_margin = ....time_limit_to_tle is a type, and you meant it the other way around? Is that right?

@niemela
Copy link
Member

niemela commented Nov 26, 2023

if time <= time_limit {

Shouldn't this be < (strictly less)?

output_validated!: bool

I would argue that the output is always validated, but not always correct. So, maybe output_correct?

score?:            >=0

Should we actually allow score of 0? (I think maybe we should, but I'm not 100% sure).

@niemela
Copy link
Member

niemela commented Nov 26, 2023

if time <= time_limit {

2. We promise that time is never compared to anything above tle_margin * time_limit, so any value above that is the same.

This seems to assume that time_limit is known at the time of running every testcase, but that isn't necessarily the case.

@niemela
Copy link
Member

niemela commented Nov 26, 2023

time! is not optional. Every result has a time, even those that are aborted or crashed. We promise that time is never compared to anything above tle_margin * time_limit, so any value above that is the same. Alternatives:

It's a bit unclear what exactly "never compared to anything above" actually means, but I do think I understand the gist of it.

This sounds reasonable to me.

[Alt i]
A sum type time!: "aborted" | >= 0 & <= tle_margin * time_limit or even time!: "crashed" | "aborted" | >= 0 & <= tle_margin * time_limit, by whatever names we can come up with for crashed/failed/terminated/aborted.

I don't like this alternative. Mixing strings and numbers (and calling that variable "time") sounds worse in every way.

[Alt 2]
Every result instead has aborted!: bool; only those results who are not aborted must have a time!: >= 0 & <= tle_margin * time_limit. The problem, here, and above, is that aborted means many things (such as a runtime exception due to recursion stack overflow.)

I'm not sure I agree that aborted should include RTE. To me, in this context, aborted means that the system killed the submission. A runtime exception is the system killing itself. (i.e. aborted != crashed). Running out of memory is maybe a corner case as well? In any case, I don't like this alternative either.

@niemela
Copy link
Member

niemela commented Nov 26, 2023

4. Scores are positive

But that's not how you write it. The CUE-specification says non-negative.

@niemela
Copy link
Member

niemela commented Nov 26, 2023

6. Nothing else is part of the result. (Memory use, team message, programming language.) Maybe it should

Memory use:
if time is part of the result, why isn't memory use? They are equally useful in determining the verdict. In fact, why is time part of the result, we only actually need the verdict? I'm a bit confused of what the purpose of defining "result" (in this way) is?

Team message:
I think this is fine to leave out. I don't think we should be asserting on team messages.

programming language:
Same.

@niemela
Copy link
Member

niemela commented Nov 26, 2023

7. Should result.score be a required field for problem.type == "scoring"? (I think yes.)

It's unclear to me what "required" means here. Are you saying that it's only defined if time <= time_limit & terminated_successfully but if so it's required? It's a bit strange that terminated_successfully and output_validated are treated differently.

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Nov 27, 2023

Here is a complete specification of Expectations-no-minus.

Pay attention to

  1. the set of #verdict (which has lost AC- and TLE-) and
  2. the new concept #timing (which discretises result.time)
  3. The #expectations struct has learned two new keys, required_timings and permitted_timings.
  4. does not terminate has now changed semantics (I can no longer specify that the TLE execution exceeds timelimit with margin.)

To me, this seems like a lot of new infrastructure and redundancy merely to avoid the -s, but if this makes pill is easier to swallow: fine. Note that the #timings "ok" and "too slow" are never needed in any use case I can think of, so maybe they can be collapsed (but I don’t know what the collapsed timing should be called.)

#registry

#registry: close({[string]: #root_expectations})

// Problem-wide constants
// ======================
// Set by the judging environment
time_limit: >0
tle_margin: >1 // a.k.a.  problem.limits.time_multipliers.ac_to_time_limit
ac_margin:  >1 // a.k.a. problem.limits.time_multipliers.time_limit_to_tle

// Testcase Result
// ===============
// The result of running a submission on a testcase

#result: {
	time!: >=0
	if time <= time_limit {
		// When the submission terminates within the time limit
		// its exit status is either a success or a failure:
		terminated_successfully!: bool
		if terminated_successfully {
			// The output of successfully terminated submissions is
			// validated by the output validator, the resulting values are:
			output_validated!: bool
			message?:          string
			score?:            >=0
		}
	}
}

// Testcase Verdict
// ================
//
// The verdict of a testcase summarises the result as one of four values

#verdict: "AC" | "WA" | "RTE" | "TLE"

#verdict_for_result: {
	// The verdict for the _result
	_result: #result
	if _result.time >= time_limit {"TLE"}
	if _result.time < time_limit {
		if !_result.terminated_successfully {"RTE"}
		if _result.terminated_successfully {
			if !_result.output_validated {"WA"}
			if _result.output_validated { "AC"}
			}
		}
	}
}

// Testcase Timing
// ================
//
// The timing of a testcase summarises the result as one of four values

#timing: "fast enough with margin" | "fast enough" | "too slow" | "too slow with margin"

#timing_for_result: {
	// The verdict for the _result
	_result: #result
	let t = _result.time
	if t >= time_limit {
		if t >= time_limit * tle_margin {"too slow with margin"}
		if t < time_limit * tle_margin {"too slow"}
		}
	if t < time_limit {
		if t >= time_limit / ac_margin {"fast enough"}
		if t < time_limit / ac_margin {"fast enough with margin"}
	}
}

// Expectations
// ============

#root_expectations: {
	// Set expectations for all testcases
	#expectation | #range | #abbreviation

	// And/or set them per testgroup:
	[=~"^(sample|secret)"]: #expectation | #range | #abbreviation
}

// Often-used expectations are specified in terms of abbreviations

#abbreviation: "accepted" | "wrong answer" | "runtime exception" | "time limit exceeded" | "does not terminate" | "not accepted"

// Scoring problems can set the range

#range: number | ordered_tuple

ordered_tuple: tuple=[number, number & >=tuple[0]]

// In general, we can set fine-grained expectations in terms of which verdicts and timings
// are allowed and disallowed

#expectation: {
	permitted_verdicts?: [...#verdict] // only these verdicts may appear
	required_verdicts?: [...#verdict]  // at least one of these verdicts must appear
	permitted_timings?: [...#timing]   // only these timings may appear
	required_timings?: [...#timing]    // at least one of these timings must appear
	
	message?:                string // this judgemessage must appear
	score?:                  #range
}

// Each  abbreviation stands for a set of required and permitted verdicts, as follows:

_expectation_for_abbreviation: {
	_abbreviation: #abbreviation
	if _abbreviation == "accepted" {
		permitted_verdicts: ["AC"]
		permitted_timings: ["fast enough with margin"]
	}
	if _abbreviation == "wrong answer" {
		permitted_verdicts: ["AC", "WA"]
		required_verdicts: ["WA"]
	}
	if _abbreviation == "runtime exception" {
		permitted_verdicts: ["AC", "RTE"]
		required_verdicts: ["RTE"]
	}
	if _abbreviation == "time limit exceeded" {
		permitted_verdicts: ["AC", "TLE"]
		required_timings: ["too slow with margin"]
	}
	if _abbreviation == "does not terminate" {
		permitted_verdicts: ["AC", "RTE", "TLE"] 
		required_verdicts: ["RTE", "TLE"]
	}
	if _abbreviation == "not accepted" {
		required_verdicts: ["RTE", "TLE", "WA"]
	}} & #expectation

As one of many use case beyond specifying the semantics of the defaults /submissions directories, the framework allows, for instance, defining the #expectation

    required_verdicts: ["AC"]

Note that this is not the same expectation as accepted. A problem author could, for instance, adopt the policy of putting their java submissions into the directory

submissions/sluggish_ac

and specifying the above #expectations for that directory. (For instance, such as association could be registered in the expectations.yaml file.) This would be a good place to put submissions that “should actually still get AC, but may be pretty close to the timelimit”.

@Tagl
Copy link
Collaborator

Tagl commented Nov 27, 2023

Team message:
I think this is fine to leave out. I don't think we should be asserting on team messages.

I do disagree with that, but there is a way to get around it by writing the output validator such that a matching judge message appears so I guess it's fine to not have it included.

@thorehusfeldt
Copy link
Contributor Author

Became #135

@thorehusfeldt thorehusfeldt changed the title Specify results, verdicts, time llimit margins, and expectations Specify results, verdicts, time llimit margins, and expectations (0.5) Dec 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants