Replies: 4 comments 6 replies
-
Since we started the topic of skipping tests, let me ask another question in this thread. Why do we need to skip tests in the first place? Isn't information about the platform under test statically available before a diag is started? In other words, should it be the test executive who decides which steps of a diag should run on a particular platform and pass this knowledge to the diag as input parameters, rather the diag itself making a decision in run time? Such a static approach can simplify our diags (thus, making them more portable) and prevent a scenario where a certain hardware item is not discovered by a diag, so an important test step gets skipped, whereas missing hardware should actually cause a test failure. UPD: |
Beta Was this translation helpful? Give feedback.
-
A specific artifact to indicate a skip is preferable, so for me, "skip_reason" is better than "message". That way we don't have to parse the message string. Your description however brings out another issue though - The TestRunStart artifact being omitted means we don't have some useful information in the logs, like the commandLine and parameters. Would it make sense to emit the TestRunArtifact right upfront, before loading the DUT information? That way it's nicer to see a start artifact before the end artifact, and we also include useful information.
The test executive should ideally have enough information to skip diags on inapplicable platforms. But as standard handling of unexpected situations, the diag could also ensure that it is being run in expected conditions, like for instance there is sufficient amount of RAM for it to run. |
Beta Was this translation helpful? Give feedback.
-
As a continuation of our online discussion today, let us talk about the wording used in the spec (p. 24):
The "diagnostic was skipped" part does not elaborate on legitimate use cases for the SKIP status. The "diagnostic did not run to normal completion" part sounds more like a description for the ERROR status.
Why would a diagnostic not find applicable hardware? Is it because the hardware is not functional? This is a perfect case for the FAIL status. Is it because the hardware is not supposed to be on the target platform? Then it's the user's mistake that they ran this diag on that platform. The user should see the NON_APPLICABLE+ERROR status and reconsider their action. Opinions? Edits: replaced FAIL with NON_APPLICABLE+ERROR in the last sentence. |
Beta Was this translation helpful? Give feedback.
-
I think a fundamental flaw in the design of diags is what causes the need
for diags to 'SKIP' execution because the common design pattern used in
many of the diags we are provided are designed to PASS unless they find a
reason to FAIL. This means that a hard-drive test may be implemented to
look for error sectors is designed to search for HDDs in the system and if
none are found, simply SKIP. If any are found, they are tested and so long
a none of them an unacceptable number of bad sectors, they will all PASS.
This makes it really easy to create a test executor that doesn't have to
worry about which system has hard-drives, or how many it has, just run this
test on all systems and interpret the results, ignoring the SKIPs. The
obvious problem is that a system with 11 hard-drives where only 10 of them
are functional won't FAIL if functional isn't detected.
The purpose of a test executor is to decide which tests to run when, and
what to do when a failure is detected. That does mean that the test
executor needs to know if a target system has a hard-drive, and execute the
relevant test with the appropriate parameters. In this scenario, a
hard-drive test that does not find any hard-drives should result in a
failure.
Moreover, the tests need to be designed to fail unless the defined criteria
required to pass has been satisfied. In the hard-drive test, I would want
to have a spec defined that specifies the limit of error sectors and
identifies the devices that need to be checked in some way (a count, the
serial numbers, the device paths) in whatever way it appropriate.
I have too many examples of running diags that return a PASS result because
some internal detection or typo resulted in entire paths to be silently
skipped, resulting in test escapes. While I see that there may be a valid
use case for a diag to SKIP, I have yet to find a good example that
shouldn't have been a FAIL/ERROR due to running a diag on unsupported
hardware if properly implemented.
…On Thu, Nov 9, 2023 at 10:30 AM George Karagoulis ***@***.***> wrote:
IMO we addressed those points in the meeting. I think we agree that
whether the diag skips or errors on a particular test step should ideally
be a decision for the executor to make. Treating these conditions as errors
by default is sane behavior IMO, but there should still be an option to
skip particular steps for any reason (including, but not limited to,
missing hardware or environmental anomalies), where this option is decided
by the test executor rather than the diag itself.
Also note that there is a difference between skipping the entire test run
and skipping an individual test step. It seems like a nice feature of the
framework that it could allow running N test steps while allowing some of
them to skip without impacting the overall test result. Without such a
feature we would simply have to not run certain steps, and there would be
no indication in the output as to why the step was not run. Having a "skip"
status to represent this case seems logical to me, and IMO it isn't the
main problem you're trying to solve anyways. IIUC the issue you brought up
is where the "skip" status introduces uncertainty into the result, but this
can be resolved by the executor driving the decision to skip instead of the
diag making the decision itself. Either way though, the skip status seems
useful to me. All we need to do is add a "reason" field to make it
unit-testable, we don't need to get rid of it altogether (at least not part
of this smaller-scope effort).
—
Reply to this email directly, view it on GitHub
<#646 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABS3DMC727OKZWGCJ53JL5LYDUOMPAVCNFSM6AAAAAA6ZR4SXCVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TKMRVHA4DC>
.
You are receiving this because you are subscribed to this thread.Message
ID:
<opencomputeproject/ocp-diag-core/repo-discussions/646/comments/7525881@
github.com>
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
When writing an OCPDiag it can be common to decide to skip before emitting the TestRunStart artifact. This is common because loading the dut information happens prior to start, and that is when you usually know enough information to decide to skip the test.
In case of a skip, the only information you get is a TestRunEnd message with a SKIPPED status, but you can't tell why the skip happened without studying other artifacts, like maybe a log message or inspecting mock calls. This is problematic for unit tests, because it means we have to use an indirect side-channel to know why a skip happened. Not having a reliable way to determine the skip reason increases maintenance costs for a large codebase, because every diag can do it a different way, and relying on a particular occurrence of a log string is brittle (e.g. I could break a lot of unit tests simply by introducing a new log artifact at the end).
As a potential solution, the TestRunEnd artifact could contain a string field that allows us to specify the skip reason. We could make this field specific to the SKIP status (e.g. call it "skip_reason" and only populate it when the run is skipped), or allow specifying a reason for any test run result (e.g. call it "reason"). I would consider this field optional, but can be used by diags to express the reason behind a result in a more structured way that can be relied upon by automated systems like unit tests.
Beta Was this translation helpful? Give feedback.
All reactions