Specify results, verdicts, time llimit margins, and expectations (0.5)

# Goals

## Primary Design Goals


1. Precisely specify the meaning of “verdict” and “result”

2. Define the semantics for setting time limit within “safety” margins from judge submissions

3. Provide clear semantics for expectations


## Secondary Design Goals

1. Allow bespoke expectation profiles beyond "accepted",  "wrong answer", etc.

2. Expectations for judge messages (“on some testcase, the submissions produces `guess out of bounds`”) so that we can specify full test coverage for custom output validators

3. Expectations for scores (”this submission gets at most 65 points on all testcases”)

4. Allow expectations to be specified for test groups like `sample` or `secret/group1` or even `secret/03-petersen_graph`

5. Provide a schema for expectations specified in a separate `expectations.yaml` file

## Constraints

1. Stay close to existing terminology and traditions


## Out of scope:

* how should a contest system give feedback to the solver
* interactive multipass

# Givens

Relevant metadata is given in `problem.yaml` specified as follows

```cue
#problem {
    type: "pass-fail" | "scoring"
    limits?: {
        time_limit: number
        time_multipliers: {
            ac_to_time_limit: number
            time_limit_to_tle: number
        }   
        ...
    }
    ...
}
```


## Timing-related abbreviations

For readability, the present document introduce aliases to the metdata specified in `#problem`:
 
```cue
let time_limit = problem.limits.time_limit
let ac_margin = problem.limits.time_multipliers.ac_to_time_limit
let tle_margin = problem.limits.time_multipliers.time_limit_to_tle
```

RfC1: Reconsider the key names in `#problem`: avoid depth 4, rename as above

# Proposal: Results and Verdicts

## Result

We define the _result_ of running a submission on a testcase for a single-pass problem.

```cue
// The result of running a submission on a testcase
#result: {
	time!: >=0
	if time <= time_limit {
		// When the submission terminates within the time limit
		// its exit status is either a success or a failure:
		terminated_successfully!: bool
		if terminated_successfully {
			// The output successfully terminated then the submissions is
			// validated by the output validator, the resulting values are:
			output_validated!: bool
			message?:          string
			score?:            >=0
		}
	}
}
```

Decisions made here, this is RfC2.

1. This defines “result” (not “outcome” or “verdict”).

2. `time!` is not optional. Every result has a `time`, even those that are aborted or crashed. We promise that time is never compared to anything above `tle_margin * time_limit`, so any value above that is the same.
Alternatives: 


    1. A sum type `time!: "aborted" | >= 0 & <= tle_margin * time_limit` or even `time!: "crashed" | "aborted" |  >= 0 & <= tle_margin * time_limit`, by whatever names we can come up with for crashed/failed/terminated/aborted.

    2. Every result instead has `aborted!: bool`; only those results who are not aborted must have a `time!: >= 0 & <= tle_margin * time_limit`. The problem, here, and above, is that `aborted` means many things (such as a runtime exception due to recursion stack overflow.)

3. Careful distinction between successful termination (of the process) and successful output validation; note the required field `output_validated!`: _every_ submission that `terminated_successfully` has its output validated.

4. Scores are positive

5. The judge message is called `message`. Alternative: `judge_message`.

6. Nothing else is part of the result. (Memory use, team message, programming language.) Maybe it should.

7. Should `result.score` be a required field for problem.type == "scoring"? (I think yes.)

8. Is this good enough for multipass problems as well, including interactive problems? (Honest question.) What needs to be understood is wich execution times are cumulative, for instance. This is out of scope for me, but worth thinking about.

## Verdict

The `#verdict` is a shorthand of the `#result`. 
There are six verdicts:

```cue
#verdict: "AC" | "AC-" | "WA" | "RTE" | "TLE" | "TLE-"
```

(_Note2:_ Verdicts may or may not be part of the information shown to the solver by a contest system, either indivivudally or in some summary.
However, the behaviour of contest systems is beyond the scope of this specification.)

Decisions made here, this is RfC3:

1. A `#verdict` is derived from a `#result`, so it is something a _submission_ has on a _single testcase_.

2. Compared to the as-of-fall-2022 obsolete terminology of `problemtools`’s default grader, the specification introduces the verdicts `TLE-`  and `AC-` defined below. Read them as “accepted without margin” and “time limit exceeded without margin.” There can be many other names for these, such as `TLEW/ACW` or `TLE?/AC?` or `TLEw/ACw`. An alternative is to introduce `AC!` for “accepted with margin” and `TLE!` for “accepted with margin”, which are arguably clearer (but change established semantics.)

3. `CE` or `JE` are not verdicts. (Making them verdicts requires a richer `#result`.)


### What verdicts mean

We can specify the semantics of each `#verdict` quite readably in CUE:

```cue
#verdict_for_result: {
	// The verdict for the _result
	_result:  #result
	let t = _result.time
	if t > time_limit*tle_margin {"TLE"}
	if t > time_limit && t <= time_limit*tle_margin {"TLE-"}
	if t <= time_limit {
		if !_result.terminated_successfully {"RTE"}
		if _result.terminated_successfully {
			if !_result.output_validated {"WA"}
			if _result.output_validated {
				if t <= time_limit/ac_margin {"AC"}
				if t > time_limit/ac_margin {"AC-"}
			}
		}
	}
}
```

I think this is uncontroversial, but let’s call it RfC4.

# Proposal: Expectations

The role of expectations is to 

1. during problem development to ensure expected behaviour of author submissions on testdata as the problem definition, testdata, and author submissions are changed or added.

2. define which author submissions are used to set the time limit



## Expectations defined

An expectation (for a submission) is defined for a set of results, primarily in terms of verdicts.

```cue
#expectation: {
	permitted?: [...#verdict] // only these verdicts may appear
	required?: [...#verdict] // at least one of these verdicts must appear
	message?:                string // this judgemessage must appear
	score?:                  #range
}

#range: number | ordered_tuple
ordered_tuple: tuple=[number, number & >=tuple[0]]
```

A set of results _satisfies_ an expectation if 

1. for each result, the result's verdict is contained in `permitted`

2. at least one of the  verdicts in `required` appears among the verdicts of the set of results

3. the `message` appears (as a substring) in at least one `result.message` among the verdicts of the set of results

4. every `result.score` is inside `expectation.range` 


RfC5:

1. should `expectation.score` be a required field for `problem.type == "scoring"`?

2. do we need to be able to express “ _each_ of these verdicts must appear” (in addition to the above `required`, which means _some_ of these verdicts must appear.) To me, this just seems lazy: instead,
make a separate submission or a separate testgroup and specify each of the required verdicts separately.

## Common abbreviations

Typically a problem author will not use the  fine-grained `#expectations` struct, 
but instead use a common abbreviation:

```cue
#abbreviation: "accepted" | "wrong answer" | "runtime exception" | "time limit exceeded" | "does not terminate" | "not accepted"

_expectation_for_abbreviation: {
	_abbreviation: #abbreviation
	if _abbreviation == "accepted" {
		permitted: ["AC"]
	}
	if _abbreviation == "wrong answer" {
		permitted: ["AC", "WA"] // TODO is AC- also permitted?
		required: ["WA"]
	}
	if _abbreviation == "runtime exception" {
		permitted: ["AC", "RTE"] // TODO is AC- also permitted?
		required: ["RTE"]
	}
	if _abbreviation == "time limit exceeded" {
		permitted: ["AC", "AC-", "TLE", "TLE-"] // TODO not WA and not RTE, right?
		required: ["TLE"]
	}
	if _abbreviation == "does not terminate" {
		permitted: ["AC", "AC-", "RTE", "TLE", "TLE-"] // TODO not WA, right?
		required: ["RTE", "TLE"]
	}
	if _abbreviation == "not accepted" {
		required: ["RTE", "TLE", "WA"] // TODO should this include "TLE-"? // TODO is this useful at all?
	}
} & #expectation
```

1. Note the TODOs; this RfC6.

2. Note the difference between the `#abbreviation` with value `accepted` and the `#verdict` with value `AC`.
The former, `accepted`, is a claim about a _set_ of verdicts, the latter, `AC`, pertains to a _single_ testcase.

3. the abbreviations can be used as the names of directories; for instance a submission in `<problemname>/wrong_answer` is supposed to satisfy the expectations specified by abbreviation `wrong answer`.

## Expectations for testdata

```cue
#root_expectations: {
	// Set expectations for all testcases
	#expectation | #range | #abbreviation

	// And/or set them per testgroup or testcase
	[=~"^(sample|secret)"]: #expectation | #range | #abbreviation
}
```


# Terminology 

author submission
: A submission written by a problem author, typically used during development. It is evaluated by the developer framework.

solver submission
: A submission written by a solver, typically during a contest or course after the problem is installed. It is evaluated by a judging system.

submission
: A author submission or a solver submission

execution time
: prefer this to running time (because of the confusion with runtime, which is something very different


Spelling: 

* time limit
* testcase
* testgroup
* testdata
* runtime


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specify results, verdicts, time llimit margins, and expectations (0.5) #133

Goals

Primary Design Goals

Secondary Design Goals

Constraints

Out of scope:

Givens

Timing-related abbreviations

Proposal: Results and Verdicts

Result

Verdict

What verdicts mean

Proposal: Expectations

Expectations defined

Common abbreviations

Expectations for testdata

Terminology

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Specify results, verdicts, time llimit margins, and expectations (0.5) #133

Description

Goals

Primary Design Goals

Secondary Design Goals

Constraints

Out of scope:

Givens

Timing-related abbreviations

Proposal: Results and Verdicts

Result

Verdict

What verdicts mean

Proposal: Expectations

Expectations defined

Common abbreviations

Expectations for testdata

Terminology

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions