Skip reason for the diag #646

karagog · 2023-11-01T16:39:13Z

karagog
Nov 1, 2023
Collaborator

When writing an OCPDiag it can be common to decide to skip before emitting the TestRunStart artifact. This is common because loading the dut information happens prior to start, and that is when you usually know enough information to decide to skip the test.

In case of a skip, the only information you get is a TestRunEnd message with a SKIPPED status, but you can't tell why the skip happened without studying other artifacts, like maybe a log message or inspecting mock calls. This is problematic for unit tests, because it means we have to use an indirect side-channel to know why a skip happened. Not having a reliable way to determine the skip reason increases maintenance costs for a large codebase, because every diag can do it a different way, and relying on a particular occurrence of a log string is brittle (e.g. I could break a lot of unit tests simply by introducing a new log artifact at the end).

As a potential solution, the TestRunEnd artifact could contain a string field that allows us to specify the skip reason. We could make this field specific to the SKIP status (e.g. call it "skip_reason" and only populate it when the run is skipped), or allow specifying a reason for any test run result (e.g. call it "reason"). I would consider this field optional, but can be used by diags to express the reason behind a result in a more structured way that can be relied upon by automated systems like unit tests.

Goshik92 · 2023-11-01T17:49:08Z

Goshik92
Nov 1, 2023
Maintainer

Since we started the topic of skipping tests, let me ask another question in this thread. Why do we need to skip tests in the first place? Isn't information about the platform under test statically available before a diag is started? In other words, should it be the test executive who decides which steps of a diag should run on a particular platform and pass this knowledge to the diag as input parameters, rather the diag itself making a decision in run time? Such a static approach can simplify our diags (thus, making them more portable) and prevent a scenario where a certain hardware item is not discovered by a diag, so an important test step gets skipped, whereas missing hardware should actually cause a test failure.

UPD:
I saw an internal diag that detects RAM size and platform name and makes a decision on whether or not to skip a test based on that info. Clearly, such functionality cannot be a part of a universal open-source diag. So, in my view, the ability to report the SKIP status will quickly turn into a loophole for introducing ad-hoc things into portable diags. Instead, tests that need to be skipped, should be excluded from the test executive's config for a particular platform altogether.

3 replies

karagog Nov 6, 2023
Collaborator Author

This point seems moot because the OCPDiag spec allows the diag to skip for whatever reasons (not necessarily static inputs, and not limited to platform configuration). Maybe the diag wants to skip during unlucky days of the week, or during heightened levels of solar flare activity, or whatever else. My question is only, "Given that the diag can skip, how do we distinguish one skip from another"? I assume the bigger question of whether skipping should be done at all has already reached a conclusion, so I'm hoping not to re-litigate that in this discussion.

Such a static approach can simplify our diags (thus, making them more portable) and prevent a scenario where a certain hardware item is not discovered by a diag, so an important test step gets skipped, whereas missing hardware should actually cause a test failure.

This might simplify the diag, but complicates the executor, and requires implementing the skip in potentially N different executors (where N is the number of execution environments). This would require the executor to store a configuration that specifies which platforms are valid for every diag, and this configuration would need ~constant maintenance as diags are added/removed/modified. Trying to coordinate such executor changes with the released diag binaries (which can have a different release cadence in different environments) does not seem like a very good time.

Goshik92 Nov 7, 2023
Maintainer

the OCPDiag spec allows the diag to skip for whatever reasons.

The fact that we are having this discussion implies that the spec can be changed, so it is not the source of the absolute truth.

Maybe the diag wants to skip during unlucky days of the week, or during heightened levels of solar flare activity, or whatever else.

This statement does not seem to add much value to the discussion as it does not provide realistic examples. In fact, when talking to people, I could not hear any legitimate reasons for a diag to skip. All examples I am aware of can safely rely on the PASS/FAIL logic. So, it would be nice to hear how the SKIP diagnosis could solve real problems.

This might simplify the diag, but complicates the executor

IIUC, the goal of the OCP TV initiative is to develop an ecosystem of portable diags. Moving logic from the test executive to diags does not contribute to the portability goal.

This would require the executor to store a configuration that specifies which platforms are valid for every diag...
...and this configuration would need ~constant maintenance as diags are added/removed/modified

This is actually a reversed version of what I proposed in my previous comment. What I proposed was that the test executive's config should list the diags that are run on machines of a specific type. That way, the config will not contain the information about all existing diags and whether or not each of them needs to be skipped.

My question is only, "Given that the diag can skip, how do we distinguish one skip from another"?

This is a perfect example of why allowing diags to skip was probably not the best idea. The decision was made, but its ramifications were not thoroughly evaluated, so now we have this problem of not knowing what to do with the SKIP diagnosis. The solution proposed in this thread will make the spec more complex, which, in turn, will make the test executive's logic more complex (you argued against complicating it, right?) because the reason for SKIP will need to be analyzed. So, if we do not have a legitimate reason for diags to SKIP, why don't we make our life easier and reduce the possible outcomes to the PASS and FAIL statuses? After all, our unit tests do not SKIP "during unlucky days of the week".

karagog Nov 8, 2023
Collaborator Author

This thread should probably be broken out into a separate discussion. IMO it threatens to derail the discussion I'm trying to have, which is how to distinguish between skip events with constrained changes to the spec (not incur a lot of work for everyone involved). Maybe it would have been better to avoid skips in the protocol (I tend to disagree), but that is an orthogonal discussion IMO. Anyways, see responses below:

This statement does not seem to add much value to the discussion as it does not provide realistic examples. In fact, when talking to people, I could not hear any legitimate reasons for a diag to skip. All examples I am aware of can safely rely on the PASS/FAIL logic. So, it would be nice to hear how the SKIP diagnosis could solve real problems.

Those were tongue-in-cheek examples that IMO can easily represent realistic examples. Environmental conditions like temperature, humidity, barometric pressure, noise level, etc... could all be legitimately used to skip a test. If I'm developing a sensitive piece of hardware that requires certain environmental conditions to be met before running, I would not want to rely on the test executor to determine the skip. It might not even have all the sensors needed to know that.

IIUC, the goal of the OCP TV initiative is to develop an ecosystem of portable diags. Moving logic from the test executive to diags does not contribute to the portability goal.

If the diag is tightly coupled with the test executor then that seems to make it decidedly less portable.

This is actually a reversed version of what I proposed in my previous comment. What I proposed was that the test executive's config should list the diags that are run on machines of a specific type.

This is functionally the same thing. So if I update a diag to support a new platform, I can't just release it, I have to push a change to N different executor configurations, which may have a different release cadence, before I can consider it "released". I don't see how using a different config schema would improve this fundamental constraint.

Today, if we find a bug where a diag is running on an incorrect platform, the resolution is usually very quick. We simply update the diag to skip on that platform and release it. It doesn't require coordinating with different teams (one for production and one for manufacturing), we can do the change in one place and consider it done.

This is a perfect example of why allowing diags to skip was probably not the best idea. The decision was made, but its ramifications were not thoroughly evaluated, so now we have this problem of not knowing what to do with the SKIP diagnosis. The solution proposed in this thread will make the spec more complex, which, in turn, will make the test executive's logic more complex (you argued against complicating it, right?) because the reason for SKIP will need to be analyzed.

This proposal does not impact the test executor. It only impacts unit tests for diags. The executor can simply ignore the skip reason, since it won't do anything useful with it anyways. However, the unit tests should be able to use this to verify that a given stimulus resulted in a particular skip condition. I'm simply trying to resolve this gap in unit test coverage. That's why the idea of removing skips altogether seems disproportionately complicated to me.

kimmidi · 2023-11-02T14:16:25Z

kimmidi
Nov 2, 2023
Maintainer

A specific artifact to indicate a skip is preferable, so for me, "skip_reason" is better than "message". That way we don't have to parse the message string.

Your description however brings out another issue though - The TestRunStart artifact being omitted means we don't have some useful information in the logs, like the commandLine and parameters. Would it make sense to emit the TestRunArtifact right upfront, before loading the DUT information? That way it's nicer to see a start artifact before the end artifact, and we also include useful information.

rather the diag itself making a decision in run time?

The test executive should ideally have enough information to skip diags on inapplicable platforms. But as standard handling of unexpected situations, the diag could also ensure that it is being run in expected conditions, like for instance there is sufficient amount of RAM for it to run.

2 replies

Goshik92 Nov 2, 2023
Maintainer

the diag could also ensure that it is being run in expected conditions, like for instance there is sufficient amount of RAM for it to run.

If a diag detects unexpected conditions, it should FAIL, shouldn't it? This would mean that someone was trying to run it where it should not run. So, why SKIP?

The test executive should ideally have enough information to skip diags on inapplicable platforms.

IMO, the term 'skip' is misleading in this context. The test executive simply does not need to know about the existence of incompatible diags, as they should not be in the config for that platform.

karagog Nov 6, 2023
Collaborator Author

The TestRunStart artifact being omitted means we don't have some useful information in the logs, like the commandLine and parameters. Would it make sense to emit the TestRunArtifact right upfront, before loading the DUT information? That way it's nicer to see a start artifact before the end artifact, and we also include useful information.

Note that we do still emit the TestRunStart artifact even if the diag was skipped prior to start (at least in the Google core libs). Here is the code that detects whether the diag has started in the TestRun::End() method, and emits it just in time for the skip. Not saying this is the cleanest implementation, but it covers the problems you mentioned.

Would it make sense to emit the TestRunArtifact right upfront, before loading the DUT information?

No, because the TestRunStart artifact contains the DUT info, which is impossible to know before loading it. (Chicken and Egg problem).

Goshik92 · 2023-11-08T17:32:38Z

Goshik92
Nov 8, 2023
Maintainer

As a continuation of our online discussion today, let us talk about the wording used in the spec (p. 24):

The diagnostic was skipped or did not run to normal completion as part of its execution.

The "diagnostic was skipped" part does not elaborate on legitimate use cases for the SKIP status. The "diagnostic did not run to normal completion" part sounds more like a description for the ERROR status.

The diagnostic did not find applicable hardware

Why would a diagnostic not find applicable hardware? Is it because the hardware is not functional? This is a perfect case for the FAIL status. Is it because the hardware is not supposed to be on the target platform? Then it's the user's mistake that they ran this diag on that platform. The user should see the NON_APPLICABLE+ERROR status and reconsider their action.

Opinions?

Edits: replaced FAIL with NON_APPLICABLE+ERROR in the last sentence.

1 reply

karagog Nov 9, 2023
Collaborator Author

IMO we addressed those points in the meeting. I think we agree that whether the diag skips or errors on a particular test step should ideally be a decision for the executor to make. Treating these conditions as errors by default is sane behavior IMO, but there should still be an option to skip particular steps for any reason (including, but not limited to, missing hardware or environmental anomalies), where this option is decided by the test executor rather than the diag itself.

Also note that there is a difference between skipping the entire test run and skipping an individual test step. It seems like a nice feature of the framework that it could allow running N test steps while allowing some of them to skip without impacting the overall test result. Without such a feature we would simply have to not run certain steps, and there would be no indication in the output as to why the step was not run. Having a "skip" status to represent this case seems logical to me, and IMO it isn't the main problem you're trying to solve anyways. IIUC the issue you brought up is where the "skip" status introduces uncertainty into the result, but this can be resolved by the executor driving the decision to skip instead of the diag making the decision itself. Either way though, the skip status seems useful to me. All we need to do is add a "reason" field to make it unit-testable, we don't need to get rid of it altogether (at least not part of this smaller-scope effort).

dmbsocram · 2023-11-09T23:32:24Z

dmbsocram
Nov 9, 2023

I think a fundamental flaw in the design of diags is what causes the need for diags to 'SKIP' execution because the common design pattern used in many of the diags we are provided are designed to PASS unless they find a reason to FAIL. This means that a hard-drive test may be implemented to look for error sectors is designed to search for HDDs in the system and if none are found, simply SKIP. If any are found, they are tested and so long a none of them an unacceptable number of bad sectors, they will all PASS. This makes it really easy to create a test executor that doesn't have to worry about which system has hard-drives, or how many it has, just run this test on all systems and interpret the results, ignoring the SKIPs. The obvious problem is that a system with 11 hard-drives where only 10 of them are functional won't FAIL if functional isn't detected. The purpose of a test executor is to decide which tests to run when, and what to do when a failure is detected. That does mean that the test executor needs to know if a target system has a hard-drive, and execute the relevant test with the appropriate parameters. In this scenario, a hard-drive test that does not find any hard-drives should result in a failure. Moreover, the tests need to be designed to fail unless the defined criteria required to pass has been satisfied. In the hard-drive test, I would want to have a spec defined that specifies the limit of error sectors and identifies the devices that need to be checked in some way (a count, the serial numbers, the device paths) in whatever way it appropriate. I have too many examples of running diags that return a PASS result because some internal detection or typo resulted in entire paths to be silently skipped, resulting in test escapes. While I see that there may be a valid use case for a diag to SKIP, I have yet to find a good example that shouldn't have been a FAIL/ERROR due to running a diag on unsupported hardware if properly implemented.

…

On Thu, Nov 9, 2023 at 10:30 AM George Karagoulis ***@***.***> wrote: IMO we addressed those points in the meeting. I think we agree that whether the diag skips or errors on a particular test step should ideally be a decision for the executor to make. Treating these conditions as errors by default is sane behavior IMO, but there should still be an option to skip particular steps for any reason (including, but not limited to, missing hardware or environmental anomalies), where this option is decided by the test executor rather than the diag itself. Also note that there is a difference between skipping the entire test run and skipping an individual test step. It seems like a nice feature of the framework that it could allow running N test steps while allowing some of them to skip without impacting the overall test result. Without such a feature we would simply have to not run certain steps, and there would be no indication in the output as to why the step was not run. Having a "skip" status to represent this case seems logical to me, and IMO it isn't the main problem you're trying to solve anyways. IIUC the issue you brought up is where the "skip" status introduces uncertainty into the result, but this can be resolved by the executor driving the decision to skip instead of the diag making the decision itself. Either way though, the skip status seems useful to me. All we need to do is add a "reason" field to make it unit-testable, we don't need to get rid of it altogether (at least not part of this smaller-scope effort). — Reply to this email directly, view it on GitHub <#646 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABS3DMC727OKZWGCJ53JL5LYDUOMPAVCNFSM6AAAAAA6ZR4SXCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TKMRVHA4DC> . You are receiving this because you are subscribed to this thread.Message ID: <opencomputeproject/ocp-diag-core/repo-discussions/646/comments/7525881@ github.com>

0 replies

Skip reason for the diag #646

Uh oh!

karagog Nov 1, 2023 Collaborator

Replies: 4 comments · 6 replies

Uh oh!

Uh oh!

Goshik92 Nov 1, 2023 Maintainer

Uh oh!

karagog Nov 6, 2023 Collaborator Author

Uh oh!

Uh oh!

Goshik92 Nov 7, 2023 Maintainer

Uh oh!

karagog Nov 8, 2023 Collaborator Author

Uh oh!

kimmidi Nov 2, 2023 Maintainer

Uh oh!

Goshik92 Nov 2, 2023 Maintainer

Uh oh!

karagog Nov 6, 2023 Collaborator Author

Uh oh!

Uh oh!

Goshik92 Nov 8, 2023 Maintainer

Uh oh!

karagog Nov 9, 2023 Collaborator Author

Uh oh!

dmbsocram Nov 9, 2023

karagog
Nov 1, 2023
Collaborator

Replies: 4 comments 6 replies

Goshik92
Nov 1, 2023
Maintainer

karagog Nov 6, 2023
Collaborator Author

Goshik92 Nov 7, 2023
Maintainer

karagog Nov 8, 2023
Collaborator Author

kimmidi
Nov 2, 2023
Maintainer

Goshik92 Nov 2, 2023
Maintainer

karagog Nov 6, 2023
Collaborator Author

Goshik92
Nov 8, 2023
Maintainer

karagog Nov 9, 2023
Collaborator Author

dmbsocram
Nov 9, 2023