Recent studies have explored the construction of text-based web browsing environments and how to instruct large language model agents to perform web navigation.
This new development focusses on building multimodal web agents to leverage the environment rendered by browsers through screenshots, thus mimicking human web browsing behaviour.
WebVoyager is a multi-modal web AI agent designed to autonomously accomplish web tasks online from start to finish, managing the entire process end-to-end without any intermediate human intervention.
WebVoyager processes the user query by making observations from screenshots and textual content in interactive web elements, formulates a thought on what action to take.
Actions can include clicking, typing, scrolling, etc. And subsequently executes that action on the websites.
Below the sequence of events are shown for the agent to follow based on annotated screenshots from web navigation.
Similar to how humans browse the web, this agent uses visual information from the web (screenshots) as its primary input.
This approach allows for the bypassing the complexity of processing HTML DOM trees or accessibility trees, which can produce overly verbose texts and hinder the agent’s decision-making process.
Very similar to the approach Apple took with Ferret-UI, the researchers overlay bounding boxes on the interactive elements of the websites to better guide the agent’s action prediction.
This method does not require an object detection module but instead uses GPT-4V-ACT5, a JavaScript tool that extracts interactive elements based on web element types and overlays bounding boxes with numerical labels on the respective regions.
GPT-4V-ACT5 is efficient since it is rule-based and does not rely on any object detection models.
The action space for WebVoyager is designed to closely mimic human web browsing behaviour. This is achieved by implementing the most commonly used mouse and keyboard actions, enabling the agent to navigate effectively.
Using numerical labels in screenshots, the agent can respond with a concise Action Format. This method precisely identifies the interactive elements and executes the corresponding actions.
The primary actions include:
1. Click: Clicking on a webpage element, such as a link or button.
2. Input: Selecting a text box, clearing any existing content, and entering new content.
3. Scroll: Moving the webpage vertically.
4. Wait: Pausing to allow webpages to load.
5. Back: Returning to the previous page.
6. Jump to Search Engine: Redirecting to a search engine when stuck on a website without finding an answer.
7. Answer: Concluding the iteration by providing an answer that meets the task requirements.
These actions enable the agent to interact with web pages efficiently, simulating a human-like browsing experience.