Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panics in the SDK #310

Open
PiotrSikora opened this issue Mar 27, 2025 · 2 comments
Open

Panics in the SDK #310

PiotrSikora opened this issue Mar 27, 2025 · 2 comments

Comments

@PiotrSikora
Copy link
Member

@alexsnaps said in #282 (comment):

This issue with these panic is that currently envoy doesn't play nicely with them... Tho things are getting better, it can still result in unexpected behavior. So while I understand the rational behind wanting to panic!, given this is wasm and panic are aborting execution, this leaves the caller with no way to gracefully handle the error. Which by itself is problematic in my perspective, but together with some environment then failing further (like starving workers), is really not acceptable, as I imagine we can all agree on.

So, should there be a way to provide the user with means to be "fully responsible"? it could be "as simple" as allowing the user to call into functions that would expose all failures, while leaving the trait et al default implementation panic? Or maybe add a "SDK panic error handler" that defaults to the current behavior, but would let the user override that behavior? Thinking out loud here, but while I understand ideally "it would all always work according to spec", its hard to guarantee it will, more so with an on-going specification.

This has come up in the past when discussing one of my PRs with @PiotrSikora . We are now running into issues in production that don't allow us to tolerate the SDK panicking. Looking into a viable path forward that would check everyone's goals, if such a thing is possible obviously. Thanks!

@PiotrSikora
Copy link
Member Author

This issue with these panic is that currently envoy doesn't play nicely with them... Tho things are getting better, it can still result in unexpected behavior. So while I understand the rational behind wanting to panic!, given this is wasm and panic are aborting execution, this leaves the caller with no way to gracefully handle the error. Which by itself is problematic in my perspective, but together with some environment then failing further (like starving workers), is really not acceptable, as I imagine we can all agree on.

panics are only used in the code paths that cannot happen (unless there is a bug in the SDK or Envoy, but those should be fixed in those projects, and not in the plugins).

However, it sounds like you're saying that those panics are a regular occurrences, but I don't see too many reports about them from you (other than #237 that got fixed a while ago), so I'm a little confused about the frequency and how it's really affecting you in production.

Also, how do you plan to handle those cases that should never happen gracefully in your plugin?

So, should there be a way to provide the user with means to be "fully responsible"? it could be "as simple" as allowing the user to call into functions that would expose all failures, while leaving the trait et al default implementation panic?

The hostcalls are public interfaces, so that's already possible (although, it wouldn't override invalid / duplicate context_id panics in the dispatcher).

Or maybe add a "SDK panic error handler" that defaults to the current behavior, but would let the user override that behavior?

That should be already possible with panic::set_hook. Did you run any issues with that?

This has come up in the past when discussing one of my PRs with @PiotrSikora . We are now running into issues in production that don't allow us to tolerate the SDK panicking.

Could you elaborate on this? Is it a policy against panics in the codebase or panics at runtime?

@alexsnaps
Copy link
Contributor

Also, how do you plan to handle those cases that should never happen gracefully in your plugin?

Things can go wrong, and sadly will. From our plugin/solution's perspective they are actually many reasons things go wrong, excluding the SDK entirely here. Because of that we let users define what the behavior should be when "things don't go according to plan". Now we've hit weird cases that have been e.g. linked to "bad requests", not necessarily the SDK at all... might be UB afaict.

As an example here, what if this isn't proper utf-8? Then there are other cases where the host, like in #237 isn't compliant, which... yes! Should be fixed, but then also means we can't provide our users with a working solution until it is fixed merged and released to them (where that last step can be somewhat cumbersome).

That's where I'm coming from: embrace that things can go wrong, including because of "bad hosts", yet leave some possibility for the caller to deal with it. Hence the proposal of doing this "layered", e.g. the default trait impl "hide" the Result::Err from you and "hope for the best", while some other layer has explicit Results.

The hostcalls are public interfaces, so that's already possible (although, it wouldn't override invalid / duplicate context_id panics in the dispatcher).

Well, that means duplicating a lot of what the SDK already does too then, right? e.g. the bad utf-8 encoding again.

Could you elaborate on this? Is it a policy against panics in the codebase or panics at runtime?

Sorta the perfect storm really, envoy starving workers as the wasm modules panics... And while newer versions cope with that (or have so far), we're all humans, we write code and all humans writing code will eventually do mistakes... again. Murphy's Law I guess.

That should be already possible with panic::set_hook. Did you run any issues with that?

The issue there is no unwind, so yeah it is probably feasible to be a little more graceful at least and otherwise fallback to other means of reclaiming memory et al. But feels like a nuclear option, don't you think?

Anyways, I'm just trying to come at it from a slightly different angle, while trying to preserve your intent of making bugs/invalid behavior really obvious. The question I try to answer is: how does one deal with such an issue until a fix becomes available to them? Enabling users to handle the Err seemed like a valid approach to me... and I still fail to see how it is bad to be honest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants