Agent S In A Nutshell
Agent S solves for the following challenges in creating an Agentic Framework…
Domain Knowledge & Open-World Learning
- Agents must handle a wide variety of constantly changing applications and websites.
- They need specialised, up-to-date domain knowledge.
- The ability to continuously learn from open-world experiences is essential.
Complex Multi-Step Planning
- Desktop tasks often involve long sequences of interdependent actions.
- Agents need to generate plans with clear subgoals and track task progress over long horizons.
- This requires an understanding of task dependencies and proper execution sequencing.
Navigating Dynamic, Non-Uniform Interfaces
- Agents must process large volumes of visual and textual data while operating in a vast action space.
- They need to distinguish between relevant and irrelevant elements and respond accurately to visual feedback.
- GUI agents must interpret graphical cues correctly and adapt to dynamic interface changes.
- To address the challenge of solving long-horizon, complex desktop tasks, Agent S introduces Experience-Augmented Hierarchical Planning.
- This method enhances the agent’s ability to leverage domain knowledge and plan more effectively.
- It augments the agent’s performance in solving tasks that span multiple steps, involving intermediate goals.
MLLM Agents
Multimodal Large Language Models (MLLMs) serve as the core reasoning framework for MLLM Agents, enabling them to process both language and visual information.
These agents combine various components such as memory, structured planning, tool usage, and the ability to act in external environments.
MLLM Agents are applied in domains like simulation environments, video games, and scientific research. They are also increasingly used in fields like Software Engineering, where Agent-Computer Interfaces (ACI) enhance their ability to understand and act efficiently within complex systems.
This area of Agent-Computer Interfaces fascinates me the most.
GUI Agents
GUI Agents execute natural language instructions across both web and operating system environments.
Initially focused on web navigation tasks, their scope has expanded to operating systems, enabling them to handle OS-level tasks in benchmarks like OSWorld and WindowsAgentArena.
These agents are designed to navigate and control dynamic graphical interfaces, using methodologies such as behavioural cloning, in-context learning, and reinforcement learning.
Advanced features such as experience-augmented hierarchical planning enhance their performance in managing complex desktop tasks.
Retrieval-Augmented Generation (RAG) for AI Agents
RAG improves the reliability of MLLM agents by integrating external knowledge to enrich the input data, resulting in more accurate outputs.
MLLM agents benefit from retrieving task exemplars, state-aware guidelines, and historical experiences.
In the Agent S framework, experience augmentation takes three forms:
Hierarchical planning uses both full-task and subtask experience, full-task summaries serve as textual rewards for subtasks, and subtask experience is evaluated and stored for future reference. This ensures that the agent can effectively learn and adapt over time.
The image below shows the Agent S framework, given a task there is an initial environment observation. The Agent S Manager then performs experience-augmented planning. This is done by leveraging web knowledge, and narrative memory to create sub-tasks.