-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: .Net: Bug: Missing "status" field in Python dynamic session plugin responses causes LLM misinterpretation of execution failures #9902
Comments
@markwallace-microsoft / others: this isn't a bug in the Python codebase specifically, rather a bug in the In all the dynamic sessions plugin implementations, the template to interpolate the code execution API response back to the LLM is identical. Hence they're all affected. |
Thanks for filing the issue and for working on the fix, @malandis. We'll have the .Net team review your PR soon! |
@moonbox3 sounds good. Once we align on the problem/fix, I can port to other languages/integrations 👍 |
Looks like you've provided the fix. When I first created the Python plugin, I don't remember there being a "status" property in the result -- I wonder if it was added recently. I've made the similar fix in #9904. Thanks for your help! |
### Motivation and Context The Python sessions plugin (ACA) includes a `status` now in their result once code is executed. This looks to have been added recently as we weren't including it in the original return string. This PR adds that. <!-- Thank you for your contribution to the semantic-kernel repo! Please help reviewers and future users, providing the following information: 1. Why is this change required? 2. What problem does it solve? 3. What scenario does it contribute to? 4. If it fixes an open issue, please link to the issue here. --> ### Description Add the `status` key value to the return string. - Fixes Python bug for #9902 <!-- Describe your changes, the overall approach, the underlying design. These notes will help understanding how your code works. Thanks! --> ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [X] The code builds clean without any errors or warnings - [X] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [X] All unit tests pass, and I have added new tests where possible - [X] I didn't break anyone 😄
### Motivation and Context Addresses issue #9902 for .NET. Include the "status" field in the response string returned by Python dynamic sessions. Fixes a usability issue where the LLM assumes success despite execution failures. ### Description As per #9902, including the `status` property from the response in the plugin result, we ensure that the LLM has explicit information about whether the execution succeeded or failed, preventing misinterpretation of stderr or other response elements. This helps avoid hallucinated follow-ups or unnecessary retries by providing clear success/failure indicators. ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [x] I didn't break anyone 😄 --------- Co-authored-by: Evan Mattson <[email protected]>
### Motivation and Context Addresses issue #9902 for .NET. Include the "status" field in the response string returned by Python dynamic sessions. Fixes a usability issue where the LLM assumes success despite execution failures. ### Description As per #9902, including the `status` property from the response in the plugin result, we ensure that the LLM has explicit information about whether the execution succeeded or failed, preventing misinterpretation of stderr or other response elements. This helps avoid hallucinated follow-ups or unnecessary retries by providing clear success/failure indicators. ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [x] I didn't break anyone 😄 --------- Co-authored-by: Evan Mattson <[email protected]>
Closing this issue as both .Net and Python PRs are in. |
Describe the bug
When using the Python dynamic sessions plugin as a tool for LLMs, the execution response from the API includes an explicit "status" field, but this is not included by the plugin when building the string response to the LLM.
Specifically: while the HTTP POST request status code is OK, code execution response contains an explicit
status
property in the response body that can be"Failure"
(vs"Success"
).Because this is missing from the plugin result, the LLM must infer success or failure based on the presence of stderr content. This can lead the LLM to misinterpret certain exceptional scenarios as ongoing or partial success instead of failure. As a result, the LLM may hallucinate follow-up steps, enter unnecessary retry loops, or otherwise produce incorrect responses.
To Reproduce
This is a bit involved since we must set up a tool-calling agent and specifically trigger a code execution failure. That said this scenario reliably produces it for me.
Set up a Python dynamic session and upload a CSV file to the session’s container. For example, create a
sample-data.csv
file with random data:Prompt the LLM through the tool to execute Python code that will intentionally fail, for example by including a deprecated library or code snippet that won’t run:
Observe the tool’s execution result. The response includes:
Note that the LLM misinterprets this scenario as ongoing success, repeatedly attempting to continue execution or retry.
Expected behavior
When the interpreter fails, the returned response should include the
status
field from the response body. With this explicit status, the LLM recognizes the error immediately and will not enter a loop of retries or produce hallucinated follow-ups. Instead, it can gracefully handle the failure scenario.Screenshots
N/A
Platform
Additional context
The text was updated successfully, but these errors were encountered: