Skip to content

RFC: How to handle unsound flags #6480

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
NlightNFotis opened this issue Nov 25, 2021 · 8 comments
Closed

RFC: How to handle unsound flags #6480

NlightNFotis opened this issue Nov 25, 2021 · 8 comments
Assignees
Labels
aws Bugs or features of importance to AWS CBMC users RFC Request for comment soundness Soundness bug? Review and add "aws" if it is, or remove "soundness" if it isn't.

Comments

@NlightNFotis
Copy link
Contributor

Hi,

This is to probe a discussion on the handling of unsound flags and arguments to cbmc
and assorted tools going forward.

To that effect I have drafted a PR (#6479) which is issuing a warning in cbmc and goto-instrument
if the user has enabled any of the known unsound flags either directly or indirectly.

This may or may not be fine as a temporary solution (hence why it's a draft PR) but long
term there needs to be different handling of it. At the moment I'm unsure if the unsound options
are resolved if say some associated bugs (like in #6394) are resolved - actually, I'm unsure
about the extend of unsoundness that these options entail in general.

I'm open to modifying/closing that draft PR if further investigation (I'm planning to deal with #6394 as
a next step) shows that these issues are transient and really are just a manifestation of one bug, and
then fixing that restores soundness (as an example).

But there's a lot of uncertainty surrounding that from my end, hence why I'd like to solicit
some advise from the community. There has been some discussion already about this in #6397, but
this RFC is about converging on an acceptable solution for the long term handling of issues like these.

I'd appreciate your input 👍🏻

@martin-cs
Copy link
Collaborator

Thanks for starting this discussion!

I am sure you will be shocked to find out that I have opinions on this subject and that my contribution to the discussion is going to be more questions than answers!

  1. I think to really have an informed discussion on this, we have to ask what unsound flags are, how flags become marked as unsound and how (if, when, etc.) they loose that mark and most of all how you want the user to process and use this information.
  2. I think the user interaction is key. Are you printing a disclaimer to warn them? In which case printing is good. Are you making sure that their workflow breaks in a really obvious way, in which case maybe a new exit code or
    #define CPROVER_EXIT_USAGE_ERROR 1
    ? Do you want to guard against people getting "unsound" results and missing the disclaimer by simply not giving results at all? How do you want this change over time? If I have scripts / a workflow set up and running in CI and I upgrade to the latest CBMC and one of the options I use has become unsound, then what?
  3. Note that we already have a system for marking flags -- in the help text some are listed as (experimental). You are do doubt aware of how well this works.
  4. When teaching verification and communicating what we do, I have tended to avoid the terms 'sound' and 'complete' as the literature is a context-dependent on which way around they are used ("sound for proof" vs. "sound for refutation") because (as far as I can tell) there is a cultural bias towards algorithms being "sound but incomplete" vs. "unsound but complete".
  5. This is why I have tended towards over-approximate and under-approximate. An over-approximate algorithm / option is one where you can trust the SAFE / UNSAT results, an under-approximate one is where you can trust the FAILED / SAT results. There also needs to be a "mixed approximation" option which is both.
  6. So maybe rather than "unsound" we should have "over", "under" and "mixed". That way it is clearly communicated to the user what they can expect / trust.
  7. Another dimension is why it is over/under/mixed, is it, A. by design, the perfect implementation would still have this issue, B. by the current implementation, even if we perfected it, there are still things it doesn't consider (heap issues, threads, etc.), or C. it is known buggy but if we fixed those then it wouldn't be.
  8. This may seem like hair splitting but it answers the question about the "lifecycle" of the marking. If you are marking C. then it is removed when some KNOWNBUG test cases or issues are resolved (how about "things only get marked if you commit the KNOWNBUG test case"). If you are marking B. then it is removed when some issue / development project is resolved. If it is A. then it probably needs to interact with
    /// The status of a property
    .

To drag this back to something more concrete, how would you classify the following flags and why: --bounds-check, --no-assumptions, --malloc-may-fail, --full-slice, --havoc-undefined-functions, --unwinding-assertions and --partial-loops ?

@TGWDB
Copy link
Contributor

TGWDB commented Nov 30, 2021

I'll chime in with a small and incomplete addition to the discussion.

Martin has many good points, and perhaps the discussion is mis-focused on the idea of "unsound", although this is also the hard challenge that initiated the discussion. So I'm going to slightly side-step this point and try to focus on a smaller and more concrete area to address immediately.

CBMC is not bug free, so there are going to be some problems with soundness in CBMC, however we treat most of these as bugs and do not consider this a core concern. However, there are many options and features which range from no impact on the analysis results, to experimental and speculative with known unsound/incomplete behaviours. Considering this context, I think the best scope for this RFC and follow-up is to be able to clearly mark which features/options are "experimental" and also provide clear feedback to users.

Assuming this is the right path, then we come to two main questions:

  1. Which features/options should be marked experimental? At the moment some are, and some are not (that maybe should be). For example, reachability-slice changes verification from FAILED to SUCCESSFUL  #6394 might indicate that reachability-slice should be marked experimental (or maybe this is just a bug).
  2. Once we know which features/options are experimental, how should this be indicated to the user? I think Fotis' PR begins to address this in a lightweight way. There is also discussion (see Unsound command line options should be documented and warnings given when used. #6397 ) about other ways to handle this. However, as has been raised, some of these may break existing uses of CBMC.

So to direct the discussion towards a short term solution to address the key points, it would be good to know if #6479 does indeed correctly warn on "experimental" features/options. If this is the complete list (other than those already marked), then we can move on to 2 and converge on a uniform way to notify a user.

@jimgrundy
Copy link
Collaborator

I feel bound to have an opinion as I feel like I may have been the one to provoke this.

Both under and over approximations matter to me, depending on the question I am asking:

  • If an analysis tells me that an assertion isn’t hittable, but it really is, then that’s a problem.
    • The other way around is an annoyance, perhaps a serious one, but I can live with it if it isn’t too common.
  • If an analysis tells me that a cover is hittable, but it really isn’t, then that’s a problem.
    • The other way around is an annoyance, perhaps a serious one, but I can live with it if it isn’t too common.

I’ll continue under the understanding that I’m speaking only of things that I would call a “problem”, and not bothering with the mere “annoyances”.

I think of the tool as having a “core” and a number of “optional extras”. You always have to have the core, but you can chose to run one of the “optional extras”. With regards to the core, I believe there is an expectation that it functions correctly, there may be bugs that result in “problems”, but once noticed they will be fixed as a matter of priority.

When it comes to the optional extras I propose we consider some of them to be “production” and some to be “experimental”.

We could move an “experimental” feature to “production” in the following circumstance:

  • It’s had some sort of design review.
  • It doesn’t have any outstanding “problem” bugs, and hasn’t had any new ones filed against it for some time. (Maybe each feature should have its own GitHub label so we can track.)
  • It has an owner (person or team) who will undertake to address the issue if a bug is subsequently found.

A “production” feature could move back to “experimental” under the following circumstances.

  • A “problem” bug is filed against it and the owner has either left the project or can’t address the issue in a timely way.

I’d like to propose the following:

  • We complicate things a little by introduce 3rd category of feature which for want of a better word I will call “paroled”.
    • All optional features that are currently thought likely to be problematic get put in this “paroled” status.
  • Suppose we have a feature that is normally invoked by the command-line argument “--foo”, we also add a new command-line argument “--experimental-foo”, which invokes the same feature.
  • If you invoke feature “foo” with the flag “--foo“ then the following happens:
    • If foo has “production” status it runs as normal.
    • If foo has “paroled” status it runs as normal, but you also get a warning telling you that this feature may be demoted to “experimental” with the next major release as there are concerns about the accuracy of its results.
    • If foo has “experimental” status the tool exits with a distinct error code and a message telling you that you have tried to use an analysis that is not trusted and that if you really want to use that analysis you should use the “--experimental-foo” option.
  • If you invoke the feature “foo” with the flag “--experimetal-foo” then the following happens:
    • If foo has “production” status it runs as normal, but you also get a polite message informing you that “foo” is no-longer experimental and you should invoke it with the “—foo” flag.
    • If foo has “paroled” status then it runs as normal.
    • If foo has “experimental” status then it runs as normal.
  • With the next major release of CBMC we demote any features that are still on “parole” to “experimental” status.

@jimgrundy
Copy link
Collaborator

Now that we've all had a chance to comment on this let's meet together to discuss how we wish to proceed. I have scheduled a meeting for 15:00 UTC on Monday 2022-01-24. If you would like an invite to the meeting then drop me an email at [email protected]

@NlightNFotis
Copy link
Contributor Author

Hello everyone,

These are the notes we've collected from the live discussion.

Please do let me know if there are errors of omission or I'm misrepresenting something that has been said.

Thanks for your time,

Notes from the Discussion on RFC-6480 - Unsound CLI options handling

  • We're aiming to solve the problem of a user using a feature which was "problematic" and the
    (out-of-band) response was to avoid using that feature because it's "unsound".
    • It's not acceptable for a user to be using a tool with no indication that a specific
      feature (or combination of features) is known to be problematic.
    • It would be ideal if there's an explicit opt-in from the user in the cases where problematic
      features are used, if only to act as a guide for the user's expectations.
  • The user experience that we expect is based on these two axis:
    • The user needs to be loudly warned that a feature flag enabled could produce compromised results
      for some reason.
    • The user needs to opt-in to that somehow (perhaps by using an additional --experimental flag)
  • To that end, we agree to start with warning and explicit labels.
    • To which we can add a flag/option to exit if "unsound"/experimental features are used in a minor
      version upgrade.
    • In the future we could switch to only allow "sound"/good features with an explicit flag
      (--I-know-the-risks), but that might require a major version bump.
  • It was also agreed that the words "sound"/"unsound" don't really highlight the wanted semantics
    clearly.
    • Instead of them, we want to focus on two different behaviours that we want to be avoided:
      • Spurious counter example to an assertion, which effectively gives a proof that the assertion
        is refuted.
      • Wrong proof to an assertion
  • It was also suggested that in terms of "unsound" behaviour, core CBMC has (almost?) none, and should
    have none.
    • However goto-instrument is quite experimental in nature.
    • There are also some areas (for example, x86-asm modeling) which are not complete.

The action plan is:

  • Short term - Add documentation on what we know can produce problematic output for any reason.
  • Short term - Add warnings on all of these features when they are used (also works as developer
    documentation inside the code)
  • Long term - Move CBMC's reporting to an n-valued output: Proved/Falsified/Unknown/Unreachable,
    etc

@martin-cs
Copy link
Collaborator

Thanks for writing this up @NlightNFotis I think this is a good summary of the discussion. The only thing I think I might add is that this is fundamentally about communicating and managing user expectations; it is a "soft" problem not a "tech" problem. There are two aspects to this, acquiring / life-cycle-ing / managing the information from developers about what to expect and communicating it to users.

@martin-cs
Copy link
Collaborator

Also

/// The status of a property
is @peterschrammel 's previous work on n-valued output.

@feliperodri feliperodri changed the title RFC: How to handle unsound flags. RFC: How to handle unsound flags Oct 6, 2022
@feliperodri feliperodri added the aws Bugs or features of importance to AWS CBMC users label Oct 6, 2022
@feliperodri feliperodri added soundness Soundness bug? Review and add "aws" if it is, or remove "soundness" if it isn't. and removed aws-medium labels Feb 13, 2023
@tautschnig tautschnig self-assigned this Mar 9, 2023
@tautschnig
Copy link
Collaborator

I believe we have addressed the short-term actions by merging #6479 (which adds both documentation as well as warnings). Longer-term work (making CBMC's results n-valued) remains to be done, but then this also overlaps with reporting coverage, which is being tracked elsewhere. To everyone: closing this for the time being, but feel free to reopen if you believe we need to keep tracking some activity in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws Bugs or features of importance to AWS CBMC users RFC Request for comment soundness Soundness bug? Review and add "aws" if it is, or remove "soundness" if it isn't.
Projects
None yet
Development

No branches or pull requests

6 participants