Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Handling Rate Limits and Potential Code Interpreter Limitations in Azure Assistant Agent #10287

Open
anu43 opened this issue Jan 24, 2025 · 2 comments
Assignees
Labels
agents python Pull requests for the Python Semantic Kernel

Comments

@anu43
Copy link

anu43 commented Jan 24, 2025

We're encountering challenges when attempting to run more complex ML/DL algorithms on the Titanic dataset using an Azure Assistant Agent. It's unclear whether this is due to code interpreter limitations or our implementation.

Current Behavior:

  • Basic analyses and initial ML model training work successfully.
  • We encounter a rate limit error when attempting to improve model accuracy beyond 85%.

Error Message:

semantic_kernel.exceptions.agent_exceptions.AgentInvokeException: Run failed with status: `failed` for agent `data-scientist` and thread `thread_xxxxxxxxxx` with error: Rate limit is exceeded. Try again in 22 seconds.

Relevant Code Snippet:

agent = await AzureAssistantAgent.create(
        kernel=Kernel(),
        service_id="agent",
        name="data-scientist",
        instructions=DS_SYS_PROMPT,
        enable_code_interpreter=True,
        code_interpreter_filenames=[DATA_PATH],
    )

    print("Creating thread... ", end="")
    thread_id = await agent.create_thread()
    print(thread_id)

    try:
        is_complete: bool = False
        file_ids: list[str] = []
        while not is_complete:
            user_input = input("\nUser:> ")
            if not user_input:
                continue

            if user_input.lower() == "exit":
                is_complete = True

            await agent.add_chat_message(
                thread_id=thread_id,
                message=ChatMessageContent(role=AuthorRole.USER, content=user_input),
            )
            is_code: bool = False
            async for response in agent.invoke(thread_id=thread_id):
                if is_code != response.metadata.get("code"):
                    print()
                    is_code = not is_code

                print(f"{response.content}", end="")

                file_ids.extend(
                    [
                        item.file_id
                        for item in response.items
                        if isinstance(item, StreamingFileReferenceContent)
                    ]
                )

            print()

            await download_response_image(agent, file_ids)
            file_ids.clear()

    finally:
        # Clean up agents
        print("Cleaning up resources...")
        if agent is not None:
            await _clean_up_resources(agent=agent, thread_id=thread_id)

Questions:

  1. Is this a limitation of the code interpreter, or could it be related to our implementation?
  2. Are there best practices for optimizing code execution within the Azure Assistant Agent to avoid rate limits?
  3. How can we implement a wait mechanism to respect the rate limit (e.g., waiting 22 seconds before retrying)?
  4. Are there any built-in retry mechanisms or rate limit handling features in the Azure Assistant Agent that we should be using?
  5. Should more complex ML tasks be broken down into smaller, sequential requests to the agent?

Desired Outcome:
We aim to understand the source of this limitation and find ways to handle rate limits effectively, allowing us to perform more complex ML tasks without errors. Additionally, we seek guidance on best practices for working with the Azure Assistant Agent for computationally intensive tasks.

Any insights, suggestions, or examples of addressing these issues would be greatly appreciated.

@markwallace-microsoft markwallace-microsoft added python Pull requests for the Python Semantic Kernel triage labels Jan 24, 2025
@moonbox3 moonbox3 self-assigned this Jan 25, 2025
@moonbox3 moonbox3 added agents and removed triage labels Jan 25, 2025
@moonbox3
Copy link
Contributor

Hi @anu43, we allow one to provide overrides for the RunPollingOptions which are used by the AzureAssistantAgent. The run polling options consist of:

@experimental_class
class RunPollingOptions(KernelBaseModel):
    """Configuration and defaults associated with polling behavior for Assistant API requests."""

    default_polling_interval: timedelta = Field(default=timedelta(milliseconds=250))
    default_polling_backoff: timedelta = Field(default=timedelta(seconds=1))
    default_polling_backoff_threshold: int = Field(default=2)
    default_message_synchronization_delay: timedelta = Field(default=timedelta(milliseconds=250))
    run_polling_interval: timedelta = Field(default=timedelta(milliseconds=250))
    run_polling_backoff: timedelta = Field(default=timedelta(seconds=1))
    run_polling_backoff_threshold: int = Field(default=2)
    message_synchronization_delay: timedelta = Field(default=timedelta(milliseconds=250))
    run_polling_timeout: timedelta = Field(default=timedelta(minutes=1))  # New timeout attribute

See the class definition here.

You could do something like:

from semantic_kernel.agents.open_ai.run_polling_options import RunPollingOptions
from datetime import timedelta

polling_options = RunPollingOptions(run_polling_interval=timedelta(seconds=5)) # or something based on your RPM

# Create the agent configuration
agent = await AzureAssistantAgent.create(
    kernel=kernel,
    service_id=service_id,
    name=AGENT_NAME,
    instructions=AGENT_INSTRUCTIONS,
    ...,
    polling_options=polling_options,
)

The attributes you'll want to pay attention to are:

run_polling_backoff, run_polling_interval and run_polling_backoff_threshold

We use these based on:

def get_polling_interval(self, iteration_count: int) -> timedelta:
    """Get the polling interval for the given iteration count."""
    return (
        self.run_polling_backoff
        if iteration_count > self.run_polling_backoff_threshold
        else self.run_polling_interval
    )

Additionally, in your AI Foundry Portal, you can adjust your RPM/TPM for your model deployment. Could you have a look at if you can increase your RPM?

@moonbox3
Copy link
Contributor

I should add: yes, we can do better at handling rate limits for the caller -- a feature we should explore in the future. But hopefully my suggestion above can help mitigate your current 429s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agents python Pull requests for the Python Semantic Kernel
Projects
Status: No status
Development

No branches or pull requests

3 participants