Planning as code - where does it end? #174
Replies: 4 comments 1 reply
-
I think the answer is simply that we have to gauge things on a per-use basis. You already mentioned a web browser agent that needs multiple pauses before deciding what to do next, and there's a ready example from the vlm browser branch of this repo where the lm is instructed to "Proceed in several steps rather than trying to do it all in one shot". It makes sense there, so the tool is molded to fit the circumstance. It seems to me that you're asking to put the constraint on the tool instead of its application. |
Beta Was this translation helpful? Give feedback.
-
Hi Jeremy, thank your for your input! I was simply wondering if people had the same questions as I did about the boundaries of a step and if some interesting ways of seeing things would emerge. As I understand it, your point of view would be to evaluate the agent with different settings for the "step prompting policy" and find the best. Still, the fact that creating specialized tools (such as tools to parse a result string into a determined format) can extend the length of a step since the agent no longer needs the outer loop to extract the result of a previous step is more a who does what problem I think. |
Beta Was this translation helpful? Give feedback.
-
I have recently realized I am having the same issue. I started working on a simple scenario where user might want to ask the llm for some generic information, or they could use a sql agent described in the tutorial. |
Beta Was this translation helpful? Give feedback.
-
What I know about agents comes in two forms: one is workflow, and the other is smolagents. In a workflow, each node can be an agent or a tool, and the execution process of the workflow is manually orchestrated, meaning the inputs and outputs of the tools are arranged together, similar to comfyui or langgraph. The other form is like smolagents, where the agents are quite heavy and can contain many tools. These agents autonomously plan their execution logic and processes. In my opinion, I prefer the comfyui format. You have to understand that a complex comfyui workflow can have dozens or even over 100 nodes (you can think of nodes as tools), and the input and output relationships between nodes can be one-to-many or many-to-one. Currently, AI is unable to orchestrate so many tools at once. The concept of smolagents is very advanced, allowing AI to autonomously plan execution logic and processes, especially in the form of Python code. However, given the current capabilities of AI, it still cannot orchestrate complex workflows. |
Beta Was this translation helpful? Give feedback.
-
Hello everyone,
Something is bothering me since I started hacking with smolagents and I'm curious to hear the community's thoughts about it.
The smolagents' approach of using code instead of JSON for planning enables better composability and multiple tool usage per step. This effectively blurs the line between planning and execution steps, as a single "step" can contain multiple tool calls within a code block.
Should we push this philosophy to its logical conclusion and have agents plan entire workflows as a single code block when possible (considering we have the tools to do so)?
For example:
Query: If the temperature in Paris is lower than the the temperature in New York, give me 3 museums to visit in Paris, otherwise give me 3 parks in New York.
Using smolagents's default tooling:
But it if we had enough tools:
This raises several questions (at least for me):
How do we balance the benefits of code-based composition against the need for dynamic adaptation and error recovery? I.e how do we define a step?
If we had enough tools to do anything, should there be a way to specify the max number of tool_calls made in a step?
What is the purpose of maintaining an outer agent loop if we move towards larger code blocks? Mainly for error handling, recovery and memory i suppose? (I get that some use cases like a web browsing agent would be difficult to plan as one shot workflow without recoding an agent itself).
I hope this makes sense, as I'm not sure what really triggers me, but I feel that there's some asbtraction to find about all this.
Beta Was this translation helpful? Give feedback.
All reactions