As I have mentioned the architecture and implementation of text based ๐๐ ๐๐ด๐ฒ๐ป๐๐ (๐๐ด๐ฒ๐ป๐๐ถ๐ฐ ๐๐ฝ๐ฝ๐น๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป๐) are converging on very much the same principles.
The ๐ฏ๐ฆ๐น๐ต ๐ค๐ฉ๐ข๐ฑ๐ต๐ฆ๐ณ for ๐๐ ๐๐ด๐ฒ๐ป๐๐ is emergingโฆ
And that isโฆ ๐๐ ๐ฎ๐ด๐ฒ๐ป๐๐ which are capable of navigating mobile or browser screens, particularly using bounding boxes to define screen elements.
Hence Agents designed to interact with user interfaces (UI) in a way similar to how a human would.
Yet another recent study shows how navigation of a mobile device over multiple apps can be achieved.
The exploration module gathers element information through agent-driven or manual exploration, compiling it into a document.
During the deployment phase, the RAG (Retrieval-Augmented Generation) system retrieves and updates this document in real time, enabling swift task execution.
The bottom of the graphic below shows cross-app task being executed. I still need to dig into this study, but Iโm still very much impressed with Ferris-UI from Apple and WebVoyager (LangChain implementation).
Considering the images below, the AppAgent V2 follows very much a well known and defined agent approach of observation, thought, action and summary.