AI Agent Evaluation Framework From Apple | by Cobus Greyling | Aug, 2024

15Aug

The notion of a World State is something I find very interesting, where certain ambient or environmental settings need to be accessed to enable certain actions.

This World State alludes to the research Apple did regarding Ferrit-UI and other research like WebVoyager. Where there is a World the agent needs to interact with. This world currently is constituted by surfaces or screens and needs to navigate browser windows, mobile phone OSs and more.

Milestones are key points which need to be executed in order to achieve or full-fill the user intent. These can also be seen as potential points of failure should it not be possible to execute.

In the example in the image above, the User intent is to send a message, while cellular service is turned off.

The Agent should first understand the User’s intent, and prompt for necessary arguments from the User. After collecting all arguments with the help of the search_contacts tool, the Agent attempted to send the message, figured out it needs to enable cellular service upon failure, and retried.

To evaluate this trajectory, we find the best match for all Milestones against Message Bus and World State in each turn while maintaining topological order.

This is an excellent example of how, for an Agent to be truly autonomous, it needs to be in control of its environment.

Despite the paradigm shift towards a more simplified problem formulation, the stateful, conversational and interactive nature of task oriented dialog remains, and poses a significant challenge for systematic and accurate evaluation of tool-using LLMs.

Stateful

Apple sees state as not only the conversational dialog turns or dialog state, but also the state of the environment in which the agents live.

This includes implicit state dependencies between stateful tools, allowing the agent to track and alter the world state based on its world or common-sense knowledge, which is implicit from the user query.

Something else I find interesting in this study is the notion of a Knowledge Boundary, which inform the user simulator what it should and should not know, providing partial access to expected result, combating hallucination. This is analogous to in and out of domain questions.

Milestones and Minefields, which define key events that must or must not happen in a trajectory, allowing us to evaluate any trajectory with rich intermediate and final execution signals.

For the conversational user interface, there are two scenarios defined…

Single / Multiple Tool Call

The one scenario is where there is a single conversation or dialog/user turn, with multiple tool calling procedures in the background.

Hence the user issues a single request which is not demanding from a NLU dialog state management perspective, but demands heavy lifting in the background.

Single / Multiple User Turn

In other scenarios there might only be a single tool call event or milestone, but multiple dialog turns are required to establish the user intent, disambiguate where necessary, collect relevant and required information from the user, etc.

Source link