Flow Engineering
Prompt Engineering alone was not enough and we had to find a way of re-using prompts; hence templates were introduced where key data fields could be populated at inference. This was followed by prompts being chained to create longer flows and more complex applications.
Chaining was supplemented with highly contextual information and inference, giving rise to an approach leveraging the In-Context Learning (ICL) via Retrieval Augmented Generation (RAG).
The next step in this evolution is Agentic Applications (AI Agents) where a certain level of agency (autonomy) is given to the application. LlamaIndex combined advanced RAG capabilities with an Agent approach to coin Agentic RAG.
For Agentic Applications to have an increased level of agency, more modalities need to be introduced. MindSearch can explore the web via a text interface. Where OmniParser, Ferrit-UI and WebVoyager enable agentic applications to be able define a graphic interface, and navigate the GUI.
The image above is from Microsoft is called OmniParser, where a similar approach is followed to Apple with FerritUI & WebVoyager. Screen elements are detected, mapped with bounding boxes and named. From here a natural language layer can be created between a UI and any conversational AI system.
MindSearch is premised on the problem that complex requests often cannot be accurately and completely retrieved by the search engine via a single instance.
Corresponding information which needs to be integrated into solving a problem or a question, is spread over multiple web pages along with significant noise.
Also, a large number of web pages with long contents may quickly exceed the maximum context length of LLMs.
The WebPlanner models the human mind of multi-step information seeking as a dynamic graph construction process.
It decomposes the user query into atomic sub-questions as nodes in the graph and progressively extends the graph based on the search result from WebSearcher; using either GPT-4o or InternLM2.5–7B models.