Planning as code - where does it end? #174

printfhere · 2025-01-13T14:01:43Z

printfhere
Jan 13, 2025

Hello everyone,

Something is bothering me since I started hacking with smolagents and I'm curious to hear the community's thoughts about it.

The smolagents' approach of using code instead of JSON for planning enables better composability and multiple tool usage per step. This effectively blurs the line between planning and execution steps, as a single "step" can contain multiple tool calls within a code block.

Should we push this philosophy to its logical conclusion and have agents plan entire workflows as a single code block when possible (considering we have the tools to do so)?

For example:

Query: If the temperature in Paris is lower than the the temperature in New York, give me 3 museums to visit in Paris, otherwise give me 3 parks in New York.

Using smolagents's default tooling:

# Current sequential approach
web_seach("weather in paris")
web_search("weather in new york")
# Wait for next planning step...

But it if we had enough tools:

# Full workflow in one plan
nyc_weather = web_seach("weather in new york")
paris_weather = web_seach("weather in paris")
nyc_temp = parse_temperature(nyc_weather, output_format="float")
paris_temp = parse_temperature(paris_weather, output_format="float")
if paris_temp < nyc_temp:
    # Continue logic...

This raises several questions (at least for me):

How do we balance the benefits of code-based composition against the need for dynamic adaptation and error recovery? I.e how do we define a step?
If we had enough tools to do anything, should there be a way to specify the max number of tool_calls made in a step?
What is the purpose of maintaining an outer agent loop if we move towards larger code blocks? Mainly for error handling, recovery and memory i suppose? (I get that some use cases like a web browsing agent would be difficult to plan as one shot workflow without recoding an agent itself).

I hope this makes sense, as I'm not sure what really triggers me, but I feel that there's some asbtraction to find about all this.

JeremyBickel · 2025-01-13T23:11:40Z

JeremyBickel
Jan 13, 2025

I think the answer is simply that we have to gauge things on a per-use basis. You already mentioned a web browser agent that needs multiple pauses before deciding what to do next, and there's a ready example from the vlm browser branch of this repo where the lm is instructed to "Proceed in several steps rather than trying to do it all in one shot". It makes sense there, so the tool is molded to fit the circumstance. It seems to me that you're asking to put the constraint on the tool instead of its application.

0 replies

printfhere · 2025-01-14T11:48:48Z

printfhere
Jan 14, 2025
Author

Hi Jeremy, thank your for your input!

I was simply wondering if people had the same questions as I did about the boundaries of a step and if some interesting ways of seeing things would emerge.

As I understand it, your point of view would be to evaluate the agent with different settings for the "step prompting policy" and find the best.

Still, the fact that creating specialized tools (such as tools to parse a result string into a determined format) can extend the length of a step since the agent no longer needs the outer loop to extract the result of a previous step is more a who does what problem I think.

1 reply

sunpazed Jan 16, 2025

Interesting question – I was recently asking myself the same thing. To test this, I wrote a small agent to answer the following question;

Which city is currently the coldest? New York, Glasgow, or Shanghai? Will I need an umbrella in this city in the next few days? Respond in natural language.

I wrote a custom Tool to fetch the current weather, the forecast, and the precipitation via a REST API. Here's a video on how the agent tackled the problem;

smolagents-weather.mp4

The agent fetched the weather results first, and then based on the shape of the data, extracted the additional data it needed in the next step. I believe this is an optimal planning strategy for the agent, unless the schema of the response is well known upfront.

I've also seen this approach with a text-to-sql agent I wrote, even when the table schema and description is well defined. For very complex questions, the agent will generate multiple smaller SQL queries, and then coalesce previous steps in it's final answer.

fedorzh · 2025-02-03T10:54:27Z

fedorzh
Feb 3, 2025

I have recently realized I am having the same issue. I started working on a simple scenario where user might want to ask the llm for some generic information, or they could use a sql agent described in the tutorial.
If I was working with the json-based agents, the functionality would, in my opinion, require clear separation of concerns: a routing agent which understands that a specific request has to be routed to the sql agent, the sql-writing agent, and the json-filling agent to further call the sql execution. Moreover, the LLMs can be forced to output correct json using constraint output techniques. This is not the case for code agents.
First, with code agents, it is not clear to me if routing should be done by the same code agent at all, or by a different one? If by the same, how do I guide it to make a correct choice, especially when I have more than two categories (sql call and processing and just a plain text response).
Then, a smaller problem, is whether the number of mistakes in producing a correct python line of code to call sql execution is small enough.

0 replies

SebastianLavertheDe · 2025-02-15T10:00:17Z

SebastianLavertheDe
Feb 15, 2025

What I know about agents comes in two forms: one is workflow, and the other is smolagents. In a workflow, each node can be an agent or a tool, and the execution process of the workflow is manually orchestrated, meaning the inputs and outputs of the tools are arranged together, similar to comfyui or langgraph.

The other form is like smolagents, where the agents are quite heavy and can contain many tools. These agents autonomously plan their execution logic and processes. In my opinion, I prefer the comfyui format. You have to understand that a complex comfyui workflow can have dozens or even over 100 nodes (you can think of nodes as tools), and the input and output relationships between nodes can be one-to-many or many-to-one. Currently, AI is unable to orchestrate so many tools at once.

The concept of smolagents is very advanced, allowing AI to autonomously plan execution logic and processes, especially in the form of Python code. However, given the current capabilities of AI, it still cannot orchestrate complex workflows.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Planning as code - where does it end? #174

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Planning as code - where does it end? #174

printfhere Jan 13, 2025

Replies: 4 comments · 1 reply

JeremyBickel Jan 13, 2025

printfhere Jan 14, 2025 Author

sunpazed Jan 16, 2025

fedorzh Feb 3, 2025

SebastianLavertheDe Feb 15, 2025

printfhere
Jan 13, 2025

Replies: 4 comments 1 reply

JeremyBickel
Jan 13, 2025

printfhere
Jan 14, 2025
Author

fedorzh
Feb 3, 2025

SebastianLavertheDe
Feb 15, 2025