Skip to content

Expectations 0.7.2 #137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
thorehusfeldt opened this issue Dec 1, 2023 · 41 comments
Closed

Expectations 0.7.2 #137

thorehusfeldt opened this issue Dec 1, 2023 · 41 comments

Comments

@thorehusfeldt
Copy link
Contributor

thorehusfeldt commented Dec 1, 2023

Here is my best shot at Expectations 0.7.2, much of it in CUE:

Changed in 0.7.1: Removed false from the values for with_margin

Changed in 0.7.2: Restricted expectation keys based on problem.type

// The registry maps submission name patterns to expectations.
// A submission whose name matches a pattern must satisfy its expectations.
// (In particular, a submission can match several patterns, and thus
// needs to satisfy several expectations.)
// The semantics of "matching a pattern" are not defined here
// (possibilities: regexen, prefix, globbing).

#registry: close({[string]: #root_expectations})

// Problem-wide constants
// ======================
// Set by the judging environment
time_limit: >0
// Set in problem.yaml; these are problem.timing.tle_margin and problem.timing.ac_margin
tle_margin: >1
ac_margin:  >1

// Testcase Result
// ===============
// The result of running a submission on a testcase

#result: {
	testcase_name!: #name
	time!:         >=0
	if time <= time_limit {
		// When the submission terminates within the time limit
		// its exit status is either a success or a failure:
		terminated_successfully!: bool
		if terminated_successfully {
			// The output of successfully terminated submissions is
			// validated by the output validator, the resulting values are:
			output_validated!: bool
			message?:          string
			score?:            >=0
		}
	}
}

// The verdicts that can occur in the expectations framework are:

#verdict: "AC" | "WA" | "RTE" | "TLE"

// Testcase Verdict
// ================
//
// The verdict of (a submission's execution on) a single testcase is
// defined in terms of the testcase result in terms of time, termination
// status, and output validation as follows.

#verdict_for_result: {
	// The verdict for the _result
	_result: #result
	let t = _result.time
	if t >= time_limit {"TLE"}
	if t < time_limit {
		if !_result.terminated_successfully {"RTE"}
		if _result.terminated_successfully {
			if !_result.output_validated {"WA"}
			if _result.output_validated {"AC"}
		}
	}
}

// Aggregate verdict
// =================

// For a linearly ordered sequence of verdicts, the aggregate verdict is
// * the first occurrence of "WA", "TLE", or "RTE" if it exist
// * "AC" if the list is empty or only contains "AC"


// Expectations
// ============

#root_expectations: {
	// Set expectations for all testcases
	#expectation | #range | #abbreviation

	// And/or set them for testcases matching a pattern (such as testgroups)
	[=~"^(sample|secret)"]: #expectation | #range | #abbreviation
}

// Often-used expectations are specified in terms of abbreviations

#abbreviation: "accepted" | "wrong answer" | "runtime exception" | "time limit exceeded" | "does not terminate" | "not accepted"

// Scoring problems can set the range

#range: number | ordered_tuple

ordered_tuple: tuple=[number, number & >=tuple[0]]

// In general, we can set fine-grained expectations in terms of which verdicts, timing, // messages, scores are allowed and disallowed for a set of results R

#expectation: {
	permitted_testgroup_verdicts?: [...#verdict] // only these testcase verdicts may appear
	required_testgroup_verdicts?: [...#verdict]  // at least one of these testcase verdicts must appear
	message?:                   string // this judgemessage must appear
	with_margin?: true // Set m = max(r.time) over all r in R. Then t < timelimit / ac_margin or t >= timelimit * tle_margin

      // pass-fail problems only: 
	permitted_aggregate_verdict?: [...#verdict]  // the aggregate verdict must be in this list 
        // scoring problems only:	
	permitted_testcase_scores?: #range // all testcase scores be in range
	permitted_aggregate_score?: #range // the aggregate score must be in range
}

// Useful abbreviations
// ====================
//
// Each  abbreviation is a shorthand for common #expectations struct, as follows:

_expectation_for_abbreviation: {
	_abbreviation: #abbreviation
	if _abbreviation == "accepted" {
		permitted_testcase_verdicts: ["AC"]
		with_margin: true
	}
	if _abbreviation == "wrong answer" {
		permitted_testcase_verdicts: ["AC", "WA"]
		required_testcase_verdicts: ["WA"]
	}
	if _abbreviation == "runtime exception" {
		permitted_testcase_verdicts: ["AC", "RTE"]
		required_testcase_verdicts: ["RTE"]
	}
	if _abbreviation == "time limit exceeded" {
		permitted_testcase_verdicts: ["AC", "TLE"]
		required_testcase_verdicts: ["TLE"]
		with_margin: true
	}
	if _abbreviation == "does not terminate" {
		permitted_testcase_verdicts: ["AC", "RTE", "TLE"]
		required_testcase_verdicts: ["RTE", "TLE"]
	}
	if _abbreviation == "not accepted" {
		required_testcase_verdicts: ["RTE", "TLE", "WA"]
	}} & #expectation

What changed?

The most visible change is that the #expectations struct is now a lot larger. Here it is again:

#expectation: {
	permitted_testgroup_verdicts?: [...#verdict] // only these testcase verdicts may appear
	required_testgroup_verdicts?: [...#verdict]  // at least one of these testcase verdicts must appear
	permitted_aggregate_verdict?: [...#verdict]  // the aggregate verdict must be in this list 
	message?:                   string // this judgemessage must appear
	permitted_testcase_scores?: #range // all testcase scores be in range
	permitted_aggregate_score?: #range // the aggregate score must be in range
	with_margin?: true // Set m = max(r.time) over all r in R. Then t < timelimit / ac_margin or t >= timelimit * tle_margin
}
  1. The key names distinguish between testcase_verdicts and aggregate_verdict. Symmetrically, the score field has split into two (and this will do a lot of good!).

  2. The new boolean with_margin has also been added. In ototal, 3 new fields, and several renamings. Have a look.

  3. Also, #result got a new field

#result: {
   testcase_name!: #name
   ... 
}

We need this for sorting verdicts (so we can compute aggregate verdicts), and for applying the scoring aggregation rules (which depends on testdata.yaml file contents and their full path names).

I added the ! to make it explicitly mandatory.

  1. I didn’t bother to specify the aggregate verdict in CUE, it’s not clearer than the prose expression I wrote down.

  2. Note that none of the abbreviations use aggregated verdicts or scores. (But they do use with_margin)

@niemela
Copy link
Member

niemela commented Dec 1, 2023

My first take on this:

I'm assuming that a lot of the naming is still placeholder? I'm not commenting on naming here.

// Set by the judging environment
time_limit: >0
// Set in problem.yaml; these are problem.timing.tle_margin and problem.timing.ac_margin
tle_margin: >1
ac_margin:  >1

I'm confused by this? These three can all be set in problem.yaml, the last two will always have a value (since there is a default for when they are not explicitly set).

#result: {
	testcase_name: #name

Why does the name have to be part of the result? Wouldn't it feel more natural to have a testcase (which has a name) map to a result?

Also, give that the name is here, why is it optional? A testcase always has a name.

	if time <= time_limit {
		// When the submission terminates within the time limit
		// its exit status is either a success or a failure:
		terminated_successfully!: bool

It would feel more natural to me if terminated_successfully is always defined, same with output_validated. Why should these be conditional?

// For a linearly ordered sequence of verdicts, the aggregate verdict is
// * the first occurrence of "WA", "TLE", or "RTE" if it exist
// * "AC" if the list is empty or only contains "AC"

This is not how aggregate verdicts are currently defined for scoring problems.

	// And/or set them for testcases matching a pattern (such as testgroups)
	[=~"^(sample|secret)"]: #expectation | #range | #abbreviation

A testcase or group could also start with *.

// Often-used expectations are specified in terms of abbreviations

#abbreviation: "accepted" | "wrong answer" | "runtime exception" | "time limit exceeded" | "does not terminate" | "not accepted"

I don't think we need these. Well... we obviously don't need them since it's "just" syntactic sugar, but I don't think we should have them.

// Scoring problems can set the range

...or value (which is what the code actually says)

	permitted_testcase_scores?: #range // all testcase scores be in range

This feels very strange. When and why would you use this?

  1. The key names distinguish between testcase_verdicts and aggregate_verdict. Symmetrically, the score field has split into two (and this will do a lot of good!).

Please elaborate on the lot of good it will do. This seems to be a point I'm not getting above.

5. Note that none of the abbreviations use aggregated verdicts or scores. (But they do use with_margin)

Shouldn't they though?

@RagnarGrootKoerkamp
Copy link
Collaborator

// Set by the judging environment
time_limit: >0
// Set in problem.yaml; these are problem.timing.tle_margin and problem.timing.ac_margin
tle_margin: >1
ac_margin:  >1

I'm confused by this? These three can all be set in problem.yaml, the last two will always have a value (since there is a default for when they are not explicitly set).

In my understanding these are not something you can write in expectations.yaml since they're not part of the #registry. They're only here to specify #result.

#result: {
	testcase_name: #name

Why does the name have to be part of the result? Wouldn't it feel more natural to have a testcase (which has a name) map to a result?

Also, give that the name is here, why is it optional? A testcase always has a name.

	if time <= time_limit {
		// When the submission terminates within the time limit
		// its exit status is either a success or a failure:
		terminated_successfully!: bool

It would feel more natural to me if terminated_successfully is always defined, same with output_validated. Why should these be conditional?

All of this is only for the purposes of precisely specifying how to go from the #result (its running time, exit code, and correctness of output) on a single testcase to a #verdict for that testcase. But I agree it's a bit confusing. It may be better to write this down in a different way.

// For a linearly ordered sequence of verdicts, the aggregate verdict is
// * the first occurrence of "WA", "TLE", or "RTE" if it exist
// * "AC" if the list is empty or only contains "AC"

This is not how aggregate verdicts are currently defined for scoring problems.

Can you explain how they are defined? I'm not quite sure.

A testcase or group could also start with *.

// Often-used expectations are specified in terms of abbreviations

#abbreviation: "accepted" | "wrong answer" | "runtime exception" | "time limit exceeded" | "does not terminate" | "not accepted"

I don't think we need these. Well... we obviously don't need them since it's "just" syntactic sugar, but I don't think we should have them.

I definitely want them. It's super useful to specify wrong_answer/*: sample: accepted without having to fully write out what accepted means.

  1. Note that none of the abbreviations use aggregated verdicts or scores. (But they do use with_margin)

Shouldn't they though?

No, I don't think so. permitted and required already pin things down very precisely

Generally, permitted_testcase_verdicts and required_testcase_verdicts give me all I need. I don't really see what an additional permitted_aggregate_verdict gives me on top of that. @niemela Could you give an example of where exactly this would be useful?

@thorehusfeldt
Copy link
Contributor Author

I don't think we need [abbreviations]. Well... we obviously don't need them since it's "just" syntactic sugar, but I don't think we should have them.

Strongly disagreed. (Again, it feels as we are rehashing conversations from three months ago.)

I very much want to be able to do this. (A quadratic time submission that passes the n=16 testgroup, the n=1000 testgroup, times out on the n=100000 testgroup, and gets wrong answer on the testgroup with disconnected graphs.

partially_accepted/thore.py:
    sample: accepted
    group1: accepted
    group2: accepted
    group3: time limit exceeded
    group4: wrong answer

For an IOI-style problem, I have 20 of these submissions. I need the abbreviations.

@niemela
Copy link
Member

niemela commented Dec 1, 2023

[...]
This is not how aggregate verdicts are currently defined for scoring problems.

Can you explain how they are defined? I'm not quite sure.

No, I would say that it's quite unclear. We need to improve that:
https://www.kattis.com/problem-package-format/spec/2023-07-draft.html#grading

It's clearly not simply "first error" though.

I definitely want them. It's super useful to specify wrong_answer/*: sample: accepted without having to fully write out what accepted means.

Ok, yeah, we probably (i.e. definitely) need some kind of shorthand. Do you feel the same about all of them?

I don't really see what an additional permitted_aggregate_verdict gives me on top of that. @niemela Could you give an example of where exactly this would be useful?

It allows you to say things about the submissions (rather than the testcases). I.e. this submission will be judge as X. It can also allow for short circuiting. Seems to me to be the most basic thing we would want to express.

@niemela
Copy link
Member

niemela commented Dec 1, 2023

For an IOI-style problem, I have 20 of these submissions. I need the abbreviations.

Yes, I agree that we need some kind of shorthand.

@Tagl
Copy link
Collaborator

Tagl commented Dec 1, 2023

I agree that shorthand and verifying aggregate verdict/score are both very nice to have.

@thorehusfeldt thorehusfeldt changed the title Expectations 0.7 Expectations 0.7.1 Dec 1, 2023
@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Dec 1, 2023

Why does the name have to be part of the result? Wouldn't it feel more natural to have a testcase (which has a name) map to a result?

It does not have to (the same effect would indeed be achieved with a higher-order datatype such as a map), it’s merely the conceptually most minimal way to give sense to “the first one” for a set of #results (and we need this in order to humour your wish to express aggregate_verdict.) If you like adding more complexity, we can introduce yet another concept, namely

#result_for_testcase: [#name]: #result

and then define aggregate verdict in terms of a #result_for_testcase object instead of (like now) defining it for [...#result] (which means “list of #result”.) This would be more verbose (you’d have to write something like “the verdict of the value of the lexicographically first key of the #result_for_testcase map what maps to a result whose verdict is not "AC")

In short (x,y) is at a lower order of abstraction than lambda x: y. Remember that I’m specifying meaning not implementation, so I use the least complicated structures that suffice to express the semantics.

testcase or group could also start with *.

No, I want fine-grained targeting of testdata to always start with sample or secret. This is to avoid matching against data/invalid_inputs/foo.in, which should never be targeted by an expectation. This is a deliberate constraint to the syntax that aims to increase clarity of intent.

Please elaborate on the lot of good it will do

Assume you have a two-testcase scoring problem (with secret/a and secret/b). The max scores are 100. The scoring abbreviation rule is sum.

Consider now:

thore.py:
  secret:
    permitted_testcase_scores: [0, 50]

This means that thore.py scores at most 50 points on secret/a and at most 50 points on secret/b. (It’s a claim about a set of numbers.)

One the other hand, consider:

ragnar.cpp:
  secret:
    permitted_aggregate_score: [0, 75]

This means that Ragnar’s submission gets at most 75 points in total over both secret/a and secret/b. For instance by scoring 66 points on secret/a and 5 points on secret/b. (It’s a claim about a single number.)

In principle, it makes sense to write:

my_heuristic.cpp:
  permitted_aggregate_score: [0,90] // total score is max 90
  permitted_testcase_scores: [0,10]  // individual score on each testcase is max 10

Since there are many different ways of defining scoring problems (in particular, in the Swedish Olympics, using both subtasks and a bespoke scoring output validator for the last subtask), it is useful to support both concepts.

So: testcase score (for a set of testcases) and aggregate score (of a set of testcases) are not the same. It is extremely unfortunate that the terminology in the specification is what it is (but my valiant attempt in August at clarifying this failed, just like my attempt at terminologically separating (testcase) verdict and (aggregate) verdict.)

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Dec 1, 2023

The definition of how verdicts are aggregated in the specification is this:

For pass-fail problems, the verdict of a submission is the first non-accepted verdict, where test cases are run in lexicographical order of their full file paths (note that sample comes before secret in this order).

I think this is close, except for

  1. what is the verdict of the empty set of results?, and
  2. it should’t specify the order in which test cases are run
  3. I see no reason to restrict this to pass-fail problems.

Is there anything else wrong with it?

My suggestion to clarify both issues in 0.7.1 above is

// Aggregate verdict
// =================

// For a linearly ordered sequence of verdicts, the aggregate verdict is
// * the first occurrence of "WA", "TLE", or "RTE" if it exist
// * "AC" if the list is empty or only contains "AC"

The linear ordering would be given by lexicographic ordering of result.name.

@niemela
Copy link
Member

niemela commented Dec 1, 2023

No, I want fine-grained targeting of testdata to always start with sample or secret. This is to avoid matching against data/invalid_inputs/foo.in, which should never be targeted by an expectation. This is a deliberate constraint to the syntax that aims to increase clarity of intent.

I definitely agree that invalid_inputs should not be targeted. That said, invalid_inputs does not contain test cases test groups. How about allowing *, but only using the glob (or regexp?) to target test cases and groups. This is indeed what I thought we we're already doing.

@niemela
Copy link
Member

niemela commented Dec 1, 2023

The definition of how verdicts are aggregated in the specification is this:

For pass-fail problems, the verdict of a submission is the first non-accepted verdict, where test cases are run in lexicographical order of their full file paths (note that sample comes before secret in this order).

@thorehusfeldt
I think you are replying to my statement that "This is not how aggregate verdicts are currently defined for scoring problems"? I was only referring to scoring problems, the aggregation of pass-fail problem is pretty well specified and agrees with what you had said in this proposal.

  1. what is the verdict of the empty set of results?, and

I would say that it should be AC. It seems you agree.

2. it should’t specify the order in which test cases are run

It should say that the result must be as if you ran the test cases in lexicographic order and and take the first non-accepted verdict. Rather than rewording the spec to carefully avoid saying which order things should run, I would suggest adding something explicitly allowing "optimizations" (or whatever it should be called?) as long as the externally observable behavior is as specified.

3. I see no reason to restrict this to pass-fail problems.

What do you mean? The definition quoted clearly is restricted to pass-fail problems, and just removing "For pass-fail problems" gives us an incorrect description of how the verdict of scoring problems are determined. I agree that the description should not be limited to pass-fail problems.

My suggestion to clarify both issues in 0.7.1 above is

// Aggregate verdict
// =================

// For a linearly ordered sequence of verdicts, the aggregate verdict is
// * the first occurrence of "WA", "TLE", or "RTE" if it exist
// * "AC" if the list is empty or only contains "AC"

The linear ordering would be given by lexicographic ordering of result.name.

I still don't understand the purpose of this text? The canonical description of how verdict are aggregated should be in the spec, and we shouldn't have a separate copy (or much worse, a different definition) in the spec of the expectations. If this text is just for completeness while discussing the proposal, I guess it's fine. I would rather just refer to the correct section of the spec though. (Currently https://www.kattis.com/problem-package-format/spec/2023-07-draft.html#grading).

As pointed out above, the description is in fact not correct for scoring problems.

Why is the ordering mentioned outside of the comment, and not in it?

@thorehusfeldt
Copy link
Contributor Author

I guess then it is an error to specify permitted_aggregate_verdict for scoring problems, and an error to specify any kind of score-related expectation for pass-fail problems.

@thorehusfeldt thorehusfeldt changed the title Expectations 0.7.1 Expectations 0.7.2 Dec 1, 2023
@niemela
Copy link
Member

niemela commented Dec 1, 2023

I guess then it is an error to specify permitted_aggregate_verdict for scoring problems

What is your guess based on? I don't think it should be.

and an error to specify any kind of score-related expectation for pass-fail problems.

Yes, that sounds reasonable. pass-fail problems don't have a score.

@niemela
Copy link
Member

niemela commented Dec 1, 2023

Changed in 0.7.1: Removed false from the values for with_margin

Why was it removed? And what does that mean exactly?

@RagnarGrootKoerkamp
Copy link
Collaborator

I guess then it is an error to specify permitted_aggregate_verdict for scoring problems

What is your guess based on? I don't think it should be.

The spec doesn't currently define verdicts scoring problems. All it says is

For scoring problems, the concept of verdict is less important. In fact, some judging systems may choose to not even present a final verdict for a submission, but instead only present, for example, the score and the verdicts on the test cases the submission failed.

So either we don't allow permitted_aggregate_verdict for scoring problems because they aren't defined, or we define them. Can you propose something?

Changed in 0.7.1: Removed false from the values for with_margin

Why was it removed? And what does that mean exactly?

I think that at the end of the meeting we agreed that if you want to exclude a submission from the default with_margins: true for accepted and time_limit_exceeded, you have to move it to a different directory, and we do not allow 'orverriding' a with_margins: true with a 'more specific' with_margins: false constraint.

Since with_margin: false is the default it's never needed. I suppose we could still allow it for clarity though.

@niemela
Copy link
Member

niemela commented Dec 1, 2023

So either we don't allow permitted_aggregate_verdict for scoring problems because they aren't defined, or we define them.

Agreed. I think we should define them.

Can you propose something?

Ok, I will look at that.

Since with_margin: false is the default it's never needed.

I guess I don't understand what this structure we are defining is. How can say that with_margin: false is the default when we have now defined false to not be a value that with_margin can have? It can only be true or missing now, right?

@niemela
Copy link
Member

niemela commented Dec 1, 2023

I think that at the end of the meeting we agreed that if you want to exclude a submission from the default with_margins: true for accepted and time_limit_exceeded, you have to move it to a different directory, and we do not allow 'orverriding' a with_margins: true with a 'more specific' with_margins: false constraint.

No, we agreed (and I feel "agreed" is a bit strong) that all submissions must be in a subdirectory of /submissions, that you could set used_for_timing on a directory and override it on a submission (because there is a well defined ordering there), but you can't set and override it on a submission (because there is no defined ordering).

Some examples:

accepted:
   used_for_timing: true

accepted/slow.py:
   used_for_timing: false
# The above is ok, it sets it on the directory (this would typically be implied), and overrides it on the submission
---
other/sol1.py:
   used_for_timing: true
# This is ok, it only sets the value once
---
accepted/*.py:
   used_for_timing: false

accepted/fast.py:
   used_for_timing: true
# This is NOT ok, it sets the value twice on the same level (the submission)

In any case...

There were originally two main use cases we wanted to cover w.r.t. used_for_timing:

  1. I have an accepted submission (i.e. in accepted/) that I want to exclude from the normal time limit verification.
  2. I have an too slow submission (i.e. in time_limit_exceeded/) that I want to exclude from the normal time limit verification.

Later we also touched on:

  1. I have an rejected submission (typically in wrong_answer/, but could be run_time_error/`), but I want to include it ("as" an accepted solution) in the normal time limit verification.

...but I'm not sure if we actually think anybody would ever want this, or if it was more theoretical. We also touched on "the same but opposite", i.e. that wrong_answer/ would also default to used_for_timing: true. I don't think we want that, but it doesn't really change anything w.r.t. this.

There are 2 kinds of margins (lower and upper) and two kinds of exceptions (exclude and include), so the remaining very theoretical case would be:

  1. I have a submission that is not too slow (or at least not in time_limit_exceeded), but I want to include it ("as" an time_limit_exceeded solution) in the normal time limit verification.

Now, cases 1 & 2 is something that our users want to to often. There is clearly a need for this. Case 3 has never been requested, but I could maybe see how it could be useful? Case 4 has never been requested and I don't see how it would be useful or even reasonable.

All these cases, but certainly 1 & 2 can be done quite nicely with the semantics showed in the examples below.

What happens if we instead would switch to the "you have to move it to a different directory" line of thinking (but keep the "all submissions must be in a subdirectory of /submissions")? For the common cases 1 & 2 one would now have to move these submissions to another directory (I believe this though have gotten negative feedback in the past, so that's one strike I think). If we do we could as well define semantics for two additional directories accepted_no_margin/ and time_limit_exceeded_no_margin (with I hope obvious meanings). Now you wouldn't even have to set any values in the YAML, and used_for_timing would only be used to describe these default directories, but would never be used by an actual user. At that point, maybe we should remove this concept completely from the expectations framework, and have as a separate proess step (which it kinda is)?

I don't think this last option is better, mostly because the feedback in the past has been lukewarm, but I certainly don't think it's terrible.

@niemela
Copy link
Member

niemela commented Dec 1, 2023

Regarding abbreviations again. It seems we all agree that we should have some kind of shorthand.

Below are some kinds of things I would like to be able to do with a shorthand. I'll avoid using the currently defined abbreviations in the examples, they could clearly cover several of these needs, but not all. Don't get too bogged down on the syntax, but of course, feel free to comment on it.

other/sol1.py: AC 
# gets a verdict of AC (i.e. aggregated verdict)

other/sol2.py: WA
# gets a verdict of WA

other/sol3.py: 50
# gets a score of 50 (i.e. aggregated score)

other/sol3.py: 80-100
# gets a sore in the 80-100 range

wrong_answer/sol4.py:
   sample: AC
# gets a aggregated verdict for the data group `sample`

other/sol5.py:
   secret/group1: 20
   secret/group2: 20
   secret/group3: 20
   secret/group4: 0-40
# gets a score of 20 on groups 1-3 and in the 0-40 range on group 4

accepted/kinda_sloy.py: do_not_use_for_timing
# Don't use for timing. This one does not feel important, it's barely shorter than the non-shorthand version.

The most common and simple things that you can't do (well) with the older spec, and that we wanted to add support of with the expectations framework were the following, in order of how commonly it's been requested:

  • specify the expected score of a submission
  • specify the expected score of test data groups of a submission
  • exclude an accepted submission from the normal time limit verification
    • but mostly due to the submission actually being partially_accepted and you would actually want the accepted parts to be included and only the non-accepted parts to be excluded

These things should be easy to do. With the examples above they would be.

Thoughts?

@niemela
Copy link
Member

niemela commented Dec 1, 2023

Quoting from #135, but I think it's still relevant

You can't do both submission.py: wrong answer and submission.py: sample: accepted at the same time.

You can; my proposal is supposed to exactly make this possible. For a realistic example, consider

thore.*: wrong answer
*.py:
  sample: accepted

The submissions thore.py satisfies both rules, so both rules apply. In particular, the set of all testcase results must include WA, and the set of sample testcase results can only include AC. This is well defined (because all “my” rules have obvious compositionalities.)

But this only works because there are no other files that also match those globs (and it's a bit hacky).

What you really would want to be able to say is

thore.py: wrong answer
thore.py:
   sample: accepted

But that is not legal YAML.

There is a workaround that almost always work (obviously you can created cases were it doesn't, but they are not reasonable), although it still feels hacky:

thore.py: wrong answer
thore.py*:
   sample: accepted

The same kind of things apply to the shorthands I suggested just above. E.g.:

thore.py: 80
thore.py:
   used_for_timing: false

But the longhand is not much longer so I think it's fine:

thore.py:
   score: 80
   used_for_timing: false

In fact, I just realise that with a long enough file name (including in this case, so not very long at all) the "longhand" is shorter. :)

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Dec 2, 2023

Minor comments

  1. I am mildly in favour of defining aggregate_verdict of a set of results for all problems (not just pass-fail problems), and doing so in exactly the way laid out in 0.7.1. (Symmetrically, I’m fine with defining score for every testcase result and defining aggregate_score for every set of results even for pass-fail problems.) I think such uniform definitions are cleaner than defining two different families of “whatever we mean by verdict/result/outcome/aggregate when a submissions runs on a testcase.)
  2. I am agnostic about whether a problem of type scoring should be allowed to express an expectation for aggregate_verdict, and whether a problem of type pass-fail should be allowed to express a score-related expecation. (I lean toward no.)
  3. Yes, the decision of the problem package format to rely on YAML for its configurations leads to mild restrictions on syntax, such as the inability to specify two different expectations for the same submission in the same file using the same pattern. This is known. I see no reason to second-guess this decision.

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Dec 2, 2023

Back to timing

I have found two ways to express fine-grained timing-related constraints as part of the expectations framework and laid them out in proposals 0.5 and 0.6. There is a less satisfying (to me) third way that I explain in the current issue, called 0.7). To summarise:

  • 0.5: Make verdicts of testcase scores more fine-grained by subdividing AC and TLE into two separate verdicts such as AC! and AC. This is clear, easy to decode for the human reader, easy to indicate, syntactically light-weight, introduces no new fields, and is utterly well-defined with respect to handling more than one expectation
  • 0.6: Introduce a new discretisation into four different timing categories "fast enough with margin", etc. This is clear, easy to decode for the human reader, somewhat more verbose to indicate, syntactically heavy, introduces two new field, and is utterly well-defined with respect to handling more than one expectation
  • 0.7: Add a non-overridable boolean flag with_margin that constrains the maximum execution time of all results in a set of results. This is harder to define clearly, difficult to decode for the human reader, syntactically light-weight, adds only a single field, and can be made to be well-defined with respect to handling more than one expectation.

I have not seen a well-defined proposal for an overridable boolean flag that would make sense of the following:

accepted: accepted
accepted/thore:
  with_margin: true
accepted/*.py:
  with_margin: false
accepted/thore.py:
  secret: { with_margin: true }

The only suggestion so far is Arnar’s, which maybe is the following:

it is an error to specify expectations for at least one testcase and a submission where with_margin has been explicitly set to various values for that testcase

For this to make sense, it becomes important, e.g., whether the accepted defaults are viewed as explicitly set. I have very little hope that our little community will be able to agree on that kind subtlety, or even reason clearly about it.

Hence my decision. The default is false, but one can set it to true. This is well-defined and maintains expressiveness (by moving submissions out of accepted.)

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Dec 2, 2023

Here are Fredrik’s desires expressed in the actual syntax of 0.7, with some comments.

other/sol1.py:
  permitted_testcase_verdicts: ["AC"]
# gets a verdict of AC (i.e. aggregated verdict)
# Thore: I would just create `other_ac` (as below) and put sol1.py there

other/sol2.py: wrong answer
# gets a verdict of WA

other/sol3.py: 50
# gets a score of 50 (i.e. aggregated score)
# Thore: Note that in 0.7 this means the running time is less than timelimit (without margin).

other/sol3.py: [80, 100]
# gets a sore in the 80-100 range

wrong_answer/sol4.py:
   sample: accepted
# gets a aggregated verdict for the data group `sample`

other/sol5.py:
   secret/group1: 20
   secret/group2: 20
   secret/group3: 20
   secret/group4: [0, 40]
# gets a score of 20 on groups 1-3 and in the 0-40 range on group 4

other_ac/kinda_slow.py:
  permitted_testcase_verdicts: ["AC"]
# Don't use for timing. This one does not feel important, it's barely shorter than the non-shorthand version.

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Dec 2, 2023

The most common and simple things that you can't do (well) with the older spec, and that we wanted to add support of with the expectations framework were the following, in order of how commonly it's been requested: […]

No.

The most important thing for me and Johan, as expressed clearly in August, is to be able to express required and permitted testcase verdicts per testgroup. This is the entire core of the proposal, and has been so since August, in each of its iterations (of which we now have the 7th.)

The minimal expressiveness we want is this:

partially_accepted/quadratic.py:
  sample:
    permitted_testgroup_verdicts: ["AC"]
  secret/group1:
    permitted_testgroup_verdicts: ["AC"]
  secret/group2:
    permitted_testgroup_verdicts: ["TLE", "AC"]
    required_testgroup_verdicts: ["TLE"]
  secret/group3:
    permitted_testgroup_verdicts: ["TLE", "AC", "WA"]
    required_testgroup_verdicts: ["WA"]

(Everything else, in particular the abbreviations, is syntactic sugar.)

In particular, many authors of scoring problems think of subtasks as “this should kill the brute-force submission with TLE”) and write testcases and submission with that attitude.

The decision of whether subtask one should get 15 or 21 points is different from that. Points-per-subtask are defined in testdata.yaml, and this decision is typically changed during development. Many problem authors are would not be interested in having to update, say, expectation.yaml after changing points values in problem.yaml when decisions about points values (number of subtasks and their relative weights) change. Mind you, there is a use case where score is usefully specified, namely in problems that have scoring custom output validators. So score remains useful. But it is not the primary thing authors of “subtasky” problems should want to specify.

Johan was really admirably clear about this in August, and his explanation was exactly what got me to completely change the expectations proposal to be about expressing required and permitted testcase verdicts per testgroup (instead of aggregate numeral scores or aggregate verdicts). This is the right way of doing it, as agreed on in August. Every single proposal about expectations since then – three months of exchanges – have been grounded in that mindset.

@niemela
Copy link
Member

niemela commented Dec 2, 2023

1. I am mildly in favour of defining aggregate_verdict of a set of results for all problems (not just pass-fail problems), and doing so in exactly the way laid out in 0.7.1. (Symmetrically, I’m fine with defining score for every testcase result and defining aggregate_score for every set of results even for pass-fail problems.) I think such uniform definitions are cleaner than defining two different families of “whatever we mean by verdict/result/outcome/aggregate when a submissions runs on a testcase.)

Are you

  1. saying that the definition of aggregate_verdict in the expectations framework should not match the verdict that would be returned when judging normally as defined i the (rest of the) spec, or
  2. that the (rest of the) spec should be changed to use this definition as well?

IMO the aggregate_verdict should mean exactly what the submission verdict would be when judged normally. If not, I don't see any use of it. permitted and required are more powerful anyway (strictly, I think?), so the only reason to have aggregate_verdict would be for this to match.

Using this definition for aggregation verdicts is a bit problematic for scoring problems. The relationship between verdict and score should be such that the score may only exist and be non-0 if the verdict is AC. We should specify clearly whether the score *exists and is 0 for rejected verdicts, or if it simply isn't defined, but it doesn't change the general point here.

2. I am agnostic about whether a problem of type scoring should be allowed to express an expectation for aggregate_verdict, and whether a problem of type pass-fail should be allowed to express a score-related expecation. (I lean toward no.)

aggregate_verdict for scoring is "less useful", but not completely pointless. It does exist, and I think it makes sense to be able to assert on it. I don't see how allowing it would make things (much) more complicated, in fact that would make pass-fail and scoring more similar. I summary, I'm weakly in favor of allowing aggregate_verdict for scoring.

score for pass-fail OTOH is completely useless. As currently defined in the spec, it simply doesn't exist. That said, we could of course define it to exist, but then it would always be 0 for rejecting verdicts and some value (1?) for accepted. I am agnostic about allowing scoring for pass-fail. If it simplifies enough... why not?

3. Yes, the decision of the problem package format to rely on YAML for its configurations leads to mild restrictions on syntax, such as the inability to specify two different expectations for the same submission in the same file using the same pattern. This is known. I see no reason to second-guess this decision.

I don't think anybody was second guessing YAML, I agree that changing the file format is orthogonal to and out of scope of this discussion.

The question that (I think) you are answering to was how to best express (in YAML) the gist of:

thore.py: wrong answer
thore.py:
   sample: accepted

Given that that is not valid YAML. Not, which other format to switch to where it would be legal.

Do you have any thoughts on that?

@niemela
Copy link
Member

niemela commented Dec 2, 2023

0.5: Make verdicts of testcase scores more fine-grained by subdividing AC and TLE into two separate verdicts such as AC! and AC. This is clear, easy to decode for the human reader, easy to indicate, syntactically light-weight, introduces no new fields, and is utterly well-defined with respect to handling more than one expectation

As mentioned, I strongly prefer to keep to the set of verdicts defined elsewhere (i.e. CLICS). It does behave really well when applied to subsets of testcases though.

Fundamentally, I see the time limit verification as a separate process from the verdict, and I don't think these two concepts should be forced together.

0.6: Introduce a new discretisation into four different timing categories "fast enough with margin", etc. This is clear, easy to decode for the human reader, somewhat more verbose to indicate, syntactically heavy, introduces two new field, and is utterly well-defined with respect to handling more than one expectation

This does fix the inconsistency with CLICS. Other than that is functionally identical to 0.5, meaning it will have the same benefits and drawbacks, except being less concise (and beautiful). That also means my fundamental issue above applies here as well.

0.7: Add a non-overridable boolean flag with_margin that constrains the maximum execution time of all results in a set of results. This is harder to define clearly, difficult to decode for the human reader, syntactically light-weight, adds only a single field, and can be made to be well-defined with respect to handling more than one expectation.

We had put some limits on how to override it, but it's not non-overridable. In fact if it's truly non-overridable it does not solve the two main (and only?) issues this was designed to solve. Namely, "I want exclude an accepted submission from the normal time limit verification" and "I want to exclude and time limit exceeded submission from the normal time limit verification".

I don't quite agree with the "difficult to decode for the human reader". I would say that that is rather the main benefits of this model. My reasoning is that having the ability to tag submission with "exclude this from margin stuff" is very close to the main use cases.

I have not seen a well-defined proposal for an overridable boolean flag that would make sense of the following:

accepted: accepted
accepted/thore:
  with_margin: true
accepted/*.py:
  with_margin: false
accepted/thore.py:
  secret: { with_margin: true }

Towards the end of the call we had (or at least worked towards) a definition that said that more path segments overrides fewer path segments, and setting different values with the same number of path segments is an error. We never talked about applying this to testcases but that would makes sense, so with this we get:

accepted/thore is true for all testcases.
accepted/thore.py is true for all testcases under secret and false for all others.
All other files matching accepted/*.py are false for all testcases.
All other files matching accepted/* are true for all testcases.
All other files are false for all testcases.

Is that what you intended it to mean?

For this to make sense, it becomes important, e.g., whether the accepted defaults are viewed as explicitly set. I have very little hope that our little community will be able to agree on that kind subtlety, or even reason clearly about it.

My intent when suggesting this was definitely to view the default to be explicitly set. I have not heard anyone saying or implying something else.

Hence my decision. The default is false, but one can set it to true. This is well-defined and maintains expressiveness (by moving submissions out of accepted.)

Is this referring to:

Changed in 0.7.1: Removed false from the values for with_margin

Why was it removed? And what does that mean exactly?

I think we all agree that the default is false and it can be set to true. My question here was more technical. What does it mean that objects you define don't have false as a possible value?

@niemela
Copy link
Member

niemela commented Dec 2, 2023

Replying to this comment from Thore

  • other/sol1.py: Agreed, this is the same, but quite a bit longer. OTOH, it was honestly a kinda unrealistic example as implied by your comment.
  • other/sol2.py: The spec currently says "first error", this defines "worst error", so it's not the same. Also (but strongly related) this is not short circuited. That's not a huge deal. It probably not the same thing at all for a scoring problem.
  • other/sol3.py: I actually missed that this is already allowed. I agree with your note. The "80-100" vs "[80, 100]" is not an issue.
  • wrong_answer/sol4.py: That works, but note that if the example had "WA" instead of "AC" we would have the same subtle issue as for other/sol2.py.
  • other/sol5.py: Yes, that works.
  • other_ac/kinda_slow.py: Ok, that works, but forces users to move problems out of accepted/ to not apply the margin. If we're ok with this (and I'm not saying that I'm not) then we almost don't need the used_for_timing flag at all. If we instead are willing to concede that some directories have magic not expressible in the expression framework (namely everything related to "used_for_timing") we could maybe drop that? We would need to think about how to apply margin for some but not all testcases on a submission?

@niemela
Copy link
Member

niemela commented Dec 2, 2023

The most common and simple things that you can't do (well) with the older spec, and that we wanted to add support of with the expectations framework were the following, in order of how commonly it's been requested: […]

No.

The most important thing for me and Johan, as expressed clearly in August, is [something else]

Those things are not mutually exclusive.

Are you saying that what I listed was:

  • not the most common and simple things that you can't do (well) with the older spec
  • not things we wanted to add support for
  • not things we wanted to add support for with the expectations framework
    or something else? I actually don't understand what you "No." means here.

I would say that "being able to express required and permitted testcase verdicts per testgroup" is a very powerful (and very useful) feature, but is neither "simple" nor "commonly requested". Note that not being "commonly requested" doesn't mean that many would want it. I think most would, but that they had not thought about it.

So, what you said can be "the most imprtant thing" at the same time as what I listed was "The most common and simple things that you can't do (well) with the older spec, and that we wanted to add support of with the expectations framework".

Are we actually disagreeing on something here? If so, what?

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Dec 2, 2023

The definition of the aggregate_verdict of a set V of linearly ordered testcase verdicts should be what I say in the opening statement of this issue, under the headline Aggregate verdict.

Aggregate verdict
For a linearly ordered sequence of verdicts, the aggregate verdict is

  • the first occurrence of "WA", "TLE", or "RTE" if it exist
  • "AC" if the sequence is empty or only contains "AC"

This definition subsumes the case where V is all the testcases under data/s*, and when the linear order is given by the lexicographic order of the testcase paths. This special case is what you (FN) sometimes call “submission verdict when judged normally”. Let me stick to the name submission verdict for clarity.

There is indeed a definition of “submission verdict” in the august draft of the spec:

For pass-fail problems, the verdict of a submission is the first non-accepted verdict, where test cases are run in lexicographical order of their full file paths (note that sample comes before secret in this order).

This is well-defined for nonempty V, and with and unfortunate formulation about running testcases. But I think these are just unintended mistakes. In fact, I am confident that the spec’s definition of submission verdict is intended to be consistent with the semantics given by my more general definition of aggregate verdict. (This may need to be generalised to other non-AC testcase verdicts, but, again, this is outside the scope of expectation 0.7. All I want to do here is to define expectations, and testcase verdicts like CE or MLE can never be expected.)

Note that aggregate verdicts are also defined for arbitrary subsets of the testcase results in data/s*, such as for subtasks in IOI-style problems, testgroups like sample or secret/huge, testcase pattens like sample/*-huge- (whether they are used in an expectation or as a -d regex flag to problemtools in various syntaxen). So the more general definition of aggregate verdict (rather than just submission verdict) is needed to make sense of even basic operations of the entire problem package. No such definition is in the current problem package format spec, which is why I included it in 0.7 for precision.

(Let me repeat that I am not even advocating to add aggregate verdict to the expectations framework, yet have done so to my best ability in 0.7. This task out of necessity includes a suggestion of a definition of what aggregate verdict should mean.)

@niemela
Copy link
Member

niemela commented Dec 2, 2023

There is indeed a definition of “submission verdict” in the august draft of the spec:

For pass-fail problems, the verdict of a submission is the first non-accepted verdict, where test cases are run in lexicographical order of their full file paths (note that sample comes before secret in this order).

This is well-defined for nonempty R, and with and unfortunate formulation about running testcases. But I think these are just unintended mistakes.

unintended mistakes, plural. You say "unfortunate formulation about running testcases", I assume that's one of the mistakes you're referring to (and strictly speaking I agree), what other mistake(s) are you referring to?

In fact, I am confident that the spec’s definition of submission verdict is intended to be consistent with the semantics given by my more general definition of aggregate verdict.

For pass-fail problems, I agree. I also think that it is consistent, i.e. the "submission verdict" of a pass-fail problem is the same as the "aggregate verdict" of all the testcase verdicts ordered alphabetically on testcase path.

Note that aggregate verdicts are also defined for arbitrary subsets of the testcases in data/s*, such as subtasks in IOI-style problems, testgroups such as sample or secret/huge, testcase pattens like -huge- (whether they are used in an expectation or as a -d regex flag to problemtools). So the more general definition of aggregate verdict (rather than just submission verdict) is needed to make sense of even basic operations of the entire problem package. No such definition is in the current problem package format spec, which is why I included it in 0.7 for precision.

(Let me repeat that I am not even advocating to add aggregate verdict to the expectations framework, yet have done so to my best ability in 0.7. This task out of necessity includes a suggestion of a definition of what aggregate verdict should mean.)

Why is a definition of "aggregate verdict" for arbitrary subsets needed "to make sense of even basic operations of the entire problem package"? Based on the fact that you are not advocating for having an "aggregate verdict", I'm assuming that you don't see a need for it? I wanted to add what we named "submission verdict" above (which I originally called "verdict" and we renamed to "aggregate verdict"), and I see a use for that.

I would argue that "aggregate verdicts" are not so relevant for arbitrary subsets, there the concepts of permitted and required are much more useful. An "aggregate verdict" could possibly be useful for nodes in the test data tree (i.e. groups), but if we allow tsetdata to target data groups, which is my preference, than "aggregate verdict" would not be needed in that case either. For clarity, by "target data groups" i mean that <testcase>: "/secret/group1" refers to the result of the group, and is different from <testcase>: "/secret/group1/*" that refers to the members of the group, probably(?) actual testcases.

@niemela
Copy link
Member

niemela commented Dec 2, 2023

The most important thing for me and Johan, as expressed clearly in August, is to be able to express required and permitted testcase verdicts per testgroup. This is the entire core of the proposal, and has been so since August, in each of its iterations (of which we now have the 7th.)

I agree, and furthermore I believe there is consensus on this point.

To be clear, consensus that we want "to be able to express required and permitted testcase verdicts per testgroup", not that we don't want anything else. And to even clearer, I'm not implying that you are saying that we want nothing else.

The minimal expressiveness we want is this:

partially_accepted/quadratic.py:
  sample:
    permitted_testgroup_verdicts: ["AC"]
  secret/group1:
    permitted_testgroup_verdicts: ["AC"]
  secret/group2:
    permitted_testgroup_verdicts: ["TLE", "AC"]
    required_testgroup_verdicts: ["TLE"]
  secret/group3:
    permitted_testgroup_verdicts: ["TLE", "AC", "WA"]
    required_testgroup_verdicts: ["WA"]

That sounds good. We have that in the current (and most/all previous) proposals. I also think there is consensus here.

(Everything else, in particular the abbreviations, is syntactic sugar.)

Agreed. My point when arguing "against" the abbreviations was to focus syntactic sugar (because it does have some cost) on the "common and simple" cases. In any case, the details of the syntactic sugar is less important than the underlying logic (but it's not unimportant).

In particular, many authors of scoring problems think of subtasks as “this should kill the brute-force submission with TLE”) and write testcases and submission with that attitude.

I'm not entirely sure I understand exactly what you mean, but I think I agree.

The decision of whether subtask one should get 15 or 21 points is different from that. Points-per-subtask are defined in testdata.yaml, and this decision is typically changed during development.

Agreed.

Many problem authors are would not be interested in having to update, say, expectation.yaml after changing points values in problem.yaml when decisions about points values (number of subtasks and their relative weights) change.

Agreed.

Mind you, there is a use case where score is usefully specified, namely in problems that have scoring custom output validators. So score remains useful. But it is not the primary thing authors of “subtasky” problems should want to specify.

That's a much more opinionated (and subjective) claim. I definitely think it's fine (and useful) to do verification by specifying the expected score for a set of submission. This is how it's typically done at IOI.

Johan was really admirably clear about this in August, and his explanation was exactly what got me to completely change the expectations proposal to be about expressing required and permitted testcase verdicts per testgroup

And we seem to agree on all that?

(instead of aggregate numeral scores or aggregate verdicts).

Why does it have to be instead of and not in addition to?

This is the right way of doing it, as agreed on in August. Every single proposal about expectations since then – three months of exchanges – have been grounded in that mindset.

@niemela
Copy link
Member

niemela commented Dec 11, 2023

I'll try to summarize a little bit...

Things we agree on:

  1. We should be able to express requirements on testcase verdicts using (something like) permitted and required. This is the core of the proposals.
  2. These requirements should be expressed in some .yaml-file in /submissions.
  3. There should be some directories with per-defined meanings, such as accepted. It would be really nice if they could be expressed within the framework.
  4. We should be able to express requirements on the "aggregated" result of judging using (something like) verdict and score.
  5. We should be able to express requirements on the judge message (output by an output validator) using (something like) message.
  6. We should be able to express requirements on testcases or groups. Definitely permitted and required, likely also verdict, score, and message.

Important things that we disagree on:

  1. How should the time limit verification procedure be worked into the framework. Some options are:
    a. Using some additional verdicts such as "AC-" or "AC!"
    b. Using boolean settings such as used_for_timing
    c. Using explicit overrides of the timing margins.
    d. Not at all, by giving some of the per-defined directories semantics that is not expressible in the framework.
  2. What should be the semantics for rules applied to test cases and groups? Some options are:
    a. The rule is applied once for the set of matches. I.e. secret/*/small: required: WA would mean that at least some testcase (or group) called "small" in some group must result in a WA verdict. This interpretation is clearly better for permitted and required.
    b. The rule is applied once for ever set of matches. I..e secret/*/small: required: WA would mean that every testcase (or group) called "small" inside some other group must result in a WA verdict. This interpretation is clearly kinda pointless for permitted and required since it makes them mean the same thing.

Things that we don't quite agree on, but that are less important:

  1. Exactly which abbreviations and/or short forms that should exist.
  2. How should we match files? Prefix, globs, or regexp.
  3. How should we match testcases? Prefix, globs, or regexp.

After summarizing this it definitely feels to me that we are very close. Did I miss something?

@thorehusfeldt
Copy link
Contributor Author

Note on regex/prefix matching

One clarification: the matching rule that I like is that

submission/testcase s matches a pattern p if re.match(p, s) holds (in python).

This means that p is a regular expression that has to match at the beginning of s. (For a python-free definition, the pattern is actually ^p.)

In particular, the submission accepted/th.py is matched by the submission patterns accepted, accepted/th.py and accepted/th. These are the common cases and cover >99% of all situtations we’ll ever see. But it even allows more esoteric stuff like accepted/.*\.py (all python files in accepted) or .*py$ (all python files).

Also, the testcases in testgroup secret match the pattern testcase secret, the testcases in testgroup secret/group1 match the testcase pattern secret/group1, which is again pleasant, clear, and intuitive. But I can still to esoteric things with patterns like .*-huge-. (I can express negative matches, like “apply this expectation to all test cases that don’t contain foo”, but that’s not an important use case; more like a nice benefit of a clean definition in an existing powerful framework.)

I find this notationally very lightweight, and extremely simple to communicate. (And implementation is already done, because re.match exists, and there’s excellent documentation for it. Let me repeat that I actually implemented the expectations framework back in August–September, and this is the rule I used. It’s really very straightforward.)

@RagnarGrootKoerkamp
Copy link
Collaborator

As a user I think this is pretty nice. I was going to write that it's limited as an implementer since ''all the regex dialects out there are slightly different'', but it turns out at least for python there is a quite nice list of things that are supported. And since anyway we'll mostly do this in python I'm OK with it.

https://docs.python.org/3/library/re.html#regular-expression-syntax

Some remarks:

  • a . matches anything, so accepted/th.py also matches accepted/thopy.cpp 🤔
  • writing accepted/th\.py is slightly ugly
  • I think my preference would be to do full matches against files and directories. So accepted works, accepted/thore\.py, accepted/thore\..*, and accepted/.*\.py work, but accepted/thore does not.
  • I do feel like globbing does cover 99% of uses and is cleaner in the cases where it is sufficient. (Compare glob accepted/*.py and regex accepted/.*\.py)
  • In the end both will work and be fine. I still have a slight preference for globs but can easily be convinced for regexes with a good example ;)

@eldering
Copy link
Collaborator

One clarification: the matching rule that I like is that

submission/testcase s matches a pattern p if re.match(p, s) holds (in python).

I don't like using a specific language's regex implementation as a specification, as it basically forces everyone to use that language (Python in this case) to implement it. I know there's no single uniform regex specification, but I think we should stick to some sufficiently generic subset of regex that is supported in essentially all languages, for example POSIX Extended Regular Expressions, or maybe Perl Regex.

But I also agree with Ragnar that simple globbing (possibly with ** for multi-path components) may be good enough.

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Dec 13, 2023

I will not fight globbing, and I understand the very good point about accepted/th.py. I hadn’t thunk of that.

My main concern is that I am emotionally attached to:

accepted: accepted    # don’t want to write `accepted/*`
partially_accepted/thore.py:
  secret/group1: accepted # don’t want to write `secret/group1/*`
  secret/group2: wrong answer

Maybe the “and directories” in Ragnar’s post above is exactly what I’m looking for.

Let me try:

A testcase or submission matches a pattern glob if any of the parents of its path matches the pattern glob.

Here, it is understood that

  1. parents is here defined like PurePath.parents in Python (so foo/bar/baz.xyz has parents [foo, foo/bar, foo/bar/baz.xyz], but not foo/bar/baz or foo/b)
  2. path is secret/group1/huge-instance-04 (counting from data/, without suffix) for a testcase and accepted/th.py (counting from submissions/, with suffix) for a submission.
  3. the semantics of what it means for a string to “matching a glob” need also to be given: can path separators be matched?, do we need ? (or […]) and **? Better people than me need to define this or point to any standard that does.

I think I like this.

@RagnarGrootKoerkamp
Copy link
Collaborator

This sounds very good!

Pythons glob has *, ?, [] and ** by default, so we should probably just have them.

It would also be nice to do brace expansion as well, accepted/{thore.py,ragnar.cpp}, which is not supported by python glob. (You need to resolve the {} as a preprocessing step manually/using a library, which is fine.)
I'm leaning towards adding this since it provides flexibility that's annoying to work around otherwise. (Either one would duplicate the rule, or rename files to have a similar word in them.)

@thorehusfeldt
Copy link
Contributor Author

OK, I now have a minimal expectations framework (0.9.0) that seems to only include things we agree on, and that uses globbing. Early release was shared in the agenda for yesterday’s meeting.

Only intersting observation (after actually implementing it and validating data):

YAML doesn’t think * looks like a string. So with globbing, we need to add " such as here:

"accepted/*.py": accepted ## matches {accepted/th.py, accepted/ragnar.py}
"*/th.py" : accepted    ## matches {accepted/th.py, wrong_answer/th.py}.
---
brute_force.py:
  "*/*-small-*" : accepted ## matches secret/group3/032-small-disconnected

I think that’s perfectly fine.

I will today

  1. close this issue
  2. open expectations 0.9.0 (which also includes a roadmap to a richer specification) for comments as an issue here

@thorehusfeldt
Copy link
Contributor Author

Closing, evolved into #145

@niemela
Copy link
Member

niemela commented Dec 20, 2023

YAML doesn’t think * looks like a string. So with globbing, we need to add " such as here:

Only * or any string starting with *? The former doesn't seem like much of an issue.

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Dec 20, 2023

The former doesn't seem like much of an issue.

That would depend on the (possibly undocumented) implementation of your YAML parser or your editor. The YAML specification itself is quite clear, but some tools are more lenient and make educated guesses about what “looks like a string”. (A situtation that I find disastrous for a format used for specification, but that’s where we are. There are other famous examples with language: [no, da, en], where no looks like the boolean False to YAML, the others are strings. It’s a jungle.)

In any case, the definition of “what looks like a string to YAML” is outside of the expectations framework. In the above example, the tools that I tried do the right thing for accepted/*.py; but my examples should make no assumption about unspecified behaviour of third-party tools. Hence the ".

@niemela
Copy link
Member

niemela commented Dec 20, 2023

The former doesn't seem like much of an issue.

That would depend on the (possibly undocumented) implementation of your YAML parser or your editor. The YAML specification itself is quite clear, but some tools are more lenient and make educated guesses about what “looks like a string”. (A situtation that I find disastrous for a format used for specification, but that’s where we are. There are other famous examples with language: no, which looks like the boolean False to some parsers, and like the string "no" – for Norwegian – to others. It’s a jungle.)

In any case, the definition of “what looks like a string to YAML” is outside of the expectations framework. In the above example, the tools that I tried do the right thing for accepted/*.py; but my examples should make no assumption about unspecified behaviour of third-party tools. Hence the ".

I don't understand your answer. It seems tangential, but I might be missing something?

My point was that if it was only the string * that needs to be quoted, that would not seem like much of an issue for our use here. Mostly because I don't think you would ever match rules to *.

If it's in fact any string starting with * (or even containing *) then it is of course more of an issue, but certainly not the end of the world. Furthermore, I don't see a reasonable way around. Using something else than * for that purpose would be more annoying than having to use " whenever you want to use a *.

Which of the above are you saying that it is?

@thorehusfeldt
Copy link
Contributor Author

thorehusfeldt commented Dec 20, 2023

You are not missing anything; there is not issue I’m worried about, I used the formulation “interesting observation” above, not “end of the world”.

I implemented the syntax of the expectations framework (as you can see in #145), including an online validator of the current specification that you can play around with in your browser. (Link repeated: https://cuelang.org/play/?id=baug8IzTJVU#cue@export@cue)

You can add and remove " around some of the YAML strings. As you can see (try it!), the strings that start with * confuse the YAML parser used in my implementation, so they are necessary for that particular parser (which is Go’s, I believe.)

The only observation is that when we previously sketched examples for desirable syntax like:

*/th.py: accepted

then we can’t have that. Instead, we should think of it as

"*/th.py": accepted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants