27Aug


As I have mentioned the architecture and implementation of text based ๐—”๐—œ ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€ (๐—”๐—ด๐—ฒ๐—ป๐˜๐—ถ๐—ฐ ๐—”๐—ฝ๐—ฝ๐—น๐—ถ๐—ฐ๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€) are converging on very much the same principles.

The ๐˜ฏ๐˜ฆ๐˜น๐˜ต ๐˜ค๐˜ฉ๐˜ข๐˜ฑ๐˜ต๐˜ฆ๐˜ณ for ๐—”๐—œ ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€ is emergingโ€ฆ

And that isโ€ฆ ๐—”๐—œ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ which are capable of navigating mobile or browser screens, particularly using bounding boxes to define screen elements.

Hence Agents designed to interact with user interfaces (UI) in a way similar to how a human would.

Yet another recent study shows how navigation of a mobile device over multiple apps can be achieved.

The exploration module gathers element information through agent-driven or manual exploration, compiling it into a document.

During the deployment phase, the RAG (Retrieval-Augmented Generation) system retrieves and updates this document in real time, enabling swift task execution.

The bottom of the graphic below shows cross-app task being executed. I still need to dig into this study, but Iโ€™m still very much impressed with Ferris-UI from Apple and WebVoyager (LangChain implementation).

Considering the images below, the AppAgent V2 follows very much a well known and defined agent approach of observation, thought, action and summary.



Source link

Protected by Security by CleanTalk