These symbols can represent spatial realities (for example, chair next to table), allowing the model to reason logically about tasks & answer questions based on these spatial observations. This helps AI make decisions or plan actions in real-world contexts.
Human reasoning can be understood as a cooperation between the intuitive & associative, and the deliberative & logical. ~ Source
Considering the image below, conversational AI systems traditionally followed System 2 approaches, characterised by deliberate and logical reasoning.
These systems relied on intent detection and structured flows to determine action sequences. With the rise of Generative AI and Large Language Models (LLMs), there’s a shift toward System 1 solutions, which are more intuitive and associative.
A possible approach to activity reasoning is to build a symbolic system consisting of symbols and rules, connecting various elements to mimic human reasoning.
Previous attempts, though useful, faced challenges due to handcrafted symbols and limited rules derived from visual annotations. This limited their ability to generalise complex activities.
To address these issues, a new symbolic system is proposed with two key properties: broad-coverage symbols and rational rules. Instead of relying on expensive manual annotations, LLMs are leveraged to approximate these properties.
Given an image, symbols are extracted from visual content, and fuzzy logic is applied to deduce activity semantics based on rules, enhancing reasoning capabilities.
This shift exemplifies how intuitive, associative reasoning enabled by LLMs is pushing the boundaries of AI agent systems in tasks like activity recognition.
With just a quick glance at an image, we as humans can naturally translate visual inputs into symbols or concepts.
This allows us to use common-sense reasoning to understand and imagine the broader context beyond the visible scene — similar to how we infer the existence of gravity without directly seeing it.