Multimodal large language models (MLLMs) have revolutionized LLM-based agents by enabling them to interact directly with application user interfaces (UIs).
This capability extends the model’s scope from text-based responses to visually understanding and responding within a UI, significantly enhancing performance in complex tasks.
Now, LLMs can interpret and respond to images, buttons, and text inputs in applications, making them more adept at navigation and user assistance in real-time workflows.
This interaction optimises the agent’s ability to handle dynamic and multi-step processes that require both visual and contextual awareness, offering more robust solutions across industries like customer support, data management and task automation.
AI Agents often suffer from high latency and low reliability due to the extensive sequential UI interaction
AXIS: Agent eXploring API for Skill integration
Conventional AI Agents often interact with a graphical user interface (GUI) in a human-like manner, interpreting screen layouts, elements, and sequences as a person would.
These LLM-based agents, which are typically fine-tuned with visual language models, aim to enable efficient navigation in mobile and desktop tasks.
However, AXIS presents a new perspective: while human-like UI-based interactions help make these agents versatile, they can be time-intensive, especially for tasks that involve numerous, repeated steps across a UI.
This complexity arises because traditional UIs are inherently designed for human-computer interaction (HCI), not agent-based automation.
AXIS suggests that leveraging application APIs, rather than interacting with the GUI itself, offers a far more efficient solution.
For instance, where a traditional UI agent might change multiple document titles by navigating through UI steps for each title individually, an API could handle all titles simultaneously with a single call, streamlining the process.
AXIS aims to not only reduce redundant interactions and simplify complex tasks but also establish new design principles for UIs in the LLM era. This approach advocates for rethinking application design to prioritize seamless integration between AI agents and application functionalities, enabling a more direct, API-driven approach that complements both user and agent workflows.
In this mode, the AI Agent autonomously interacts with the application’s interface to explore different functions and possible actions it can perform.
The agent records these interactions, gathering data on how various parts of the UI respond to different actions.
This exploration helps the agent map out the application’s capabilities, essentially “learning” what’s possible within the app.