01Jul

LangChain Just Launched LangGraph Cloud | by Cobus Greyling | Jul, 2024


LangGraph is a fairly recent addition to the ever expanding LangChain ecosystem. With the launch of LangGraph Cloud, a managed, hosted service is introduced for deploying and hosting LangGraph applications.

The LangChain ecosystem is unfolding at a rapid pace, with a combination of Open Source Software (OSS) and Commercial software. The Commercial software includes LangSmith & LangGraph Cloud.

Source

We are all starting to realise that Agentic Applications will become a standard in the near future. The advantages of Agents are numerous…but to name a few:

  1. Agents can handle complex, ambiguous and more implicit user queries in an automated fashion.
  2. Underpinning agents is the capability to create a chain of events on the fly based on the task assigned by the user.
  3. Agents make use of an LLM which acts as the backbone of the agent.
  4. When the agent receives a user query, the agent decomposes the task into sub-tasks, which are then executed in a sequential fashion.
  5. One or more tools are made available to the Agent which can be employed by the agent as the agent deems fit. The agent decides which tool to use based on a tool description which forms part of each tool.
  6. A tool is a unit of capability which includes tasks like web search, mathematics, API calls and more.

Impediments and apprehension to Agent adoption included:

  1. LLM inference cost. The backbone LLM are queried multiple times during the course of a query, should an agent have a large number of users inference cost can skyrocket.
  2. Controllability, inspectability, observability and a more granular control are much needed. In the market there is this fear that agents are too autonomous.
  3. Agents broke the glass ceiling of chatbots, but by a little too much; and some measure of control is now required.
  4. For more complex agents, to decrease latency, there is a requirement to run tasks in parallel, and also stream not only LLM responses, but agent responses as it becomes available.

LangGraph is framework-agnostic, with each node functioning as a regular Python function.

It extends the core Runnable API (a shared interface for streaming, async, and batch calls) to facilitate:

  1. Seamless state management across multiple conversation turns or tool calls.
  2. Flexible routing between nodes based on dynamic criteria
  3. Smooth transitions between LLMs and human intervention
  4. Persistence for long-running, multi-session applications

Below the basic personal workflow is shown. A user will develop their LangGraph application within their IDE of choice. From here they will push their code to GitHub.

From LangGraph Cloud the GitHub code can be accessed and deployed to LangGraph Cloud. From LangGraph Cloud applications can be tested, traces can be run, interrupt can be added, and more.



Source link

01Jul

Examining Public and Private Control of Media Organs in Hungary and Italy – European Law Blog


Blogpost 33/2024

The state of media pluralism around the world stands at one of its most transformative points in modern history. The development of new technologies and the impact of social media platforms have radically reshaped society. Governments around the world have responded in kind. According to Freedom House, governments have shifted from open, laissez-faire internet exchange to ‘greater government intervention in the digital sphere.’ In 2023, global internet freedom had declined for the 13th consecutive year. Many point to the European Union as a bastion for ‘third way’ media co-regulation—balancing China’s authoritarian grip on expression and the United States’ unrestricted accommodations for free speech. Whereas one might view the European Union as a leader in media pluralism with appropriate safeguards for personal privacy, several Member State national governments stand in direct violation of such values. By April 2024, the Liberties Media Freedom Report declared that media freedom and pluralism stand ‘perilously close to the breaking point’ within the European Union. The European Union has produced legislation—specifically the General Data Protection Regulation (GDPR), the Digital Services Act (DSA), and the European Media Freedom Act (EMFA)—to try to address degrading media freedom within the EU community. This article examines how said legislation—specifically the EMFA—does not sufficiently secure media pluralism guarantees in two Member State case studies, Hungarian public media and Italian private media. With the European Union historically perceived as a ‘beacon of openness and liberal democracy,’ Member State derogations from media pluralism present hypocritical complicating factors for such international standards of liberal democratic governance.

 

Codifying EU Media Law

As enshrined in EU law, media pluralism and media freedom stand as one of the EU’s core principles and as a fundamental right for all EU citizens. Importantly, Article 11 of the EU Charter of Fundamental Rights states:

  1. Everyone has the right to freedom of expression. This right shall include freedom to hold opinions and to receive and impart information and ideas without interference by public authority and regardless of frontiers.
  2. The freedom and pluralism of the media shall be respected.

To this end, three major media protection packages have made their debut on the EU institutional stage. Implemented in 2018, the EU General Data Protection Regulation (GDPR) Regulation (EU) 2016/679 serves as unparalleled ‘third-way’ legislation intended to protect the personal data of EU citizens while still bolstering necessary information-related services such as journalistic free expression via Article 85. The Digital Services Act (DSA) Regulation (EU) 2022/2065 represents a novel avenue for confronting levels of hate speech, terrorist propaganda, and disinformation that have plagued major social media platforms in recent years; the DSA would require tech companies to enact policies aggressively combating illicit content or face billions of euros in fines. And most recently, in 2024, the European Media Freedom Act (EMFA) Regulation (EU) 2024/1083 formulates strict protections for journalistic practices and seeks transparency in public media funding and editorial independence.

Such legislation from the EU institutions display a concerted effort to preserve media freedom at the supranational level. However, such practices do not reflect the ‘on-the-ground’ situation at the Member State level nor will these laws serve as a panacea for long-standing, entrenched, and anti-competitive media freedom violations in various EU Member States. Two Member State case studies—Hungary and Italy—expose the gaps in the attempted remedies of these media packages, specifically the EMFA. To date, the EMFA represents the European Union’s foremost legislation on ensuring the integrity, independence, and durability of media freedom and media organizations. While the EMFA provisions would work to create a comprehensive future framework for media operations in a theoretical silo, this legislation arrives too late given the current state of affairs within the EU. As such, this piece will examine the disconnect between several of the more apparently robust Articles of the EMFA—Articles 4, 5, 6, 8-13, 22, and 25—and the media freedom environments in Hungarian public media and Italian private media. Whereas these measures might serve generally as substantive approaches to the reinforcement of media pluralism, they ultimately fail to address the deeply rooted and anti-competitve nature of leading Hungarian and Italian media organs.

 

Exerting Control over Hungarian Public Media

Hungary’s ruling Fidesz Party has openly and legally curtailed independent media since 2010 with an illiberal structure that will persist despite the aforementioned EU legislation. The illiberal structure’s public media constriction in Hungary functions through entirely legal and open parliamentary procedures to control and restrict media content. In 2011, Fidesz established the Media Authority and Media Council in Cardinal Act CLXXXV and CIV. The Media Authority serves as an umbrella media regulatory commission made up of three central branches: the President, the Media Council, and the Office of the Media Council. The media laws require official registration with the Media Authority before commencing media services, stipulate morality clauses and unbiased content, impose sanctions upwards of €720,000, and consolidate all public broadcasting and advertising under one organization—the Media Services and Support Trust Fund (MTVA). As the Council of Europe noted, the President of the Media Authority ‘holds extensive and concentrated powers for nine years over all regulatory, senior staffing, financing and content matters across all media sectors.’ Despite the EMFA’s proposed intention of ‘avoid[ing] the risk of… undue political influence on the media,’ [EMFA Recital 73 of the Preamble] the Regulation will not effect any material change in this highly concentrated, ruling-party aligned state organization.

Technically, the appointment process of the President of the Media Authority entirely aligns with Article 5(2) EMFA requiring ‘transparent and non-discriminatory procedures’ for appointments to management boards of public media service providers. The Hungarian government points to the fact that a constitutionally-codified confirmation vote of a two-thirds majority in Parliament would attribute popular, universal consensus to Media Authority appointees. However, these claims only provide a rhetorical veneer of nonpartisan composition. A gerrymandered two-thirds parliamentary Fidesz supermajority accommodates a streamlined confirmation process for pro-Fidesz political appointees. As such, the Media Authority regulatory commission is singularly composed of allies of the Hungarian ruling party who cannot—nor would not—be recalled from their positions—another point of alignment with Article 5(2) EMFA. The first President of the Media Authority, Annamária Szalai, was a Fidesz MP. The second President—Mónika Karas—served as the defense attorney for two Fidesz-aligned media outlets. The third and current President—András Koltay—has carried a lead position in Mathias Corvinus Collegium, the Fidesz-affiliated think tank and educational institution.

With the European Union attempting to outline some basic standards for media pluralism, many of their responses have come far too delayed, particularly in the Hungarian case. In assessing the novel EU legal mechanisms for media pluralism, one does not see possible redress from the European supranational level. While the EMFA seeks transparency in appointment processes, it does not carry any mechanism for fully ensuring nonpartisan government-appointees in regulatory bodies—nor could it given appointees are determined at the Member State level. One study found that by 2017, nearly 90% of all Hungarian media was already ‘directly or indirectly controlled by Fidesz.’ At the Prague European Summit 2024, European Commissioner for Values and Transparency Věra Jourová indicated that while the EMFA makes significant strides for establishing protections of editorial independence in public media and media ownership transparency, Hungarian media state capture is ultimately at the whim of the national government and fundamentally irreversible from the European level. Commissioner Jourová is correct in this assessment particularly given that much of the EMFA approaches media institutions with a ‘freedom from interference’ negative liberty approach [EMFA Recitals 15, 18 and 19].

The well-entrenched, intricate, and legalistic implementation of the Hungarian Media Authority will continue unaffected by the EMFA. Article 4(2) EMFA outlines the need for Member State self-restraint in intervening in editorial decisions in media organs and regulatory authorities to preserve editorial independence. This guideline falls entirely flat à la hongrois; the now-purged editorial boards of Hungarian media providers are composed of decision-makers who voluntarily align with the government position. As previously mentioned, Article 5(2) EMFA mandates transparent, open, and non-discriminatory appointment processes for the heads of public media providers. The procedure for appointing a new President of the Media Authority is entirely transparent and outlined in Hungarian law; however, the appointee him or herself has consistently come from a pro-Fidesz background in the media. Articles 8-13 EMFA shape the role of the newly-established European Board for Media Services. While an entirely respectable mandate, the Board however would be composed of respective Member State national regulatory authorities, effectively legitimizing the Hungarian Media Authority in European-level decision-making. Finally, Article 6(1) EMFA seeks to clarify and publicize the ownership structure of private media. In Hungary, it is not unknown that close Orbán allies Andrew Vajna owns TV 2—the most-watched television channel in Hungary in 2022—and Lőrinc Mészáros owns Hungary’s largest print media company, Mediaworks. Their outsized power over private media will not change with simple audience knowledge of the ownership of these companies. Already as of 2020, 74% of Hungarian voters believed that Hungarian media has a strong political bias and 66% believed it was ‘disconcerting that the media are increasingly concentrated in Fidesz’s hands.’ Even with the changes of the EMFA entering into force on 8 August 2025, Hungarian state capture of media capably evades EU media pluralism guarantees.

 

Establishing Conflicts of Interest in Italian Private Media

To turn to the Italian case as it pertains to the EMFA, the concern over privately-owned, party-affiliated media dominating the advertising markets prompts major conflict of interest considerations. A number of party-aligned television channels controlled by one individual have dominated the media advertising market share in Italy over the past three decades—former Prime Minister Silvio Berlusconi and his Mediaset conglomerate. The top six most-viewed television channels from 2008 to 2017 divided across the state-run RAI and private Mediaset company—with RAI channels maintaining a plurality of viewers. However, because of legal limits on advertising spend in public channels, Mediaset has consistently captured disproportionate advertising market share. For example, in 2009, RAI and Mediaset respectively maintained 39.2% and 38.8% of the total television audience, but Mediaset held 63.7% of advertising spend to RAI’s 25.5% the same year. European Commissioner for Values and Transparency Věra Jourová noted at the Prague European Summit 2024 that one of the EMFA’s goals is to establish transparency concerning party-affiliated media channels and to promote fair competition in the media markets. And yet the problem arises in the Italian case where a private, partisan media outlet already controls a dominate market share and the EMFA regulatory efforts are only specific to public advertising spend.

The effort to assess fair competition in media markets manifests in Article 22 EMFA, and transparent public spending on media platforms is codified in Article 25 EMFA. Article 22 EMFA establishes a reporting mechanism regarding media market concentrations. Article 25 EMFA seeks proportionate, transparent, and objective measures for determing public-advertising spend on media platforms. With Article 22 EMFA, it is difficult to see a ‘through-line’ between a report on highly-concentrated media outlets and the actual remediation of said monopolizing force. Article 25 EMFA would successfully combat arbitrary Member State funding for a media company which might result in illegitimately awarded public monies. But while this provision would stimy willful ruling-party media clientelism, it is unable to address private advertising spend, which can serve as a source of indirect conflict of interest lobbying. In the Berlusconi case where he actually owned the media outlets, one study found that firms shifted their allocated advertising spend to Mediaset during Berlusconi’s respective tenures as Prime Minister boosting Mediaset profits by 25% through his years as Prime Minister. Mediaset’s growth in advertising market share between 1993 and 2011 was marked by major increases at the start of his third and fourth governments. While Mediaset saw a 25% increase in profits during the period of various Berlusconi governments from 1994 to 2011, RAI’s profits decreased by 9% despite viewership remaining relatively consistent. The EMFA provisions do not provide any recourse for addressing such conflicts of interest or monopolizing tendencies in privately-owned media companies and resultant discretionary firm-by-firm advertising spend. And with Mediaset functioning from a majority position in the media advertising market—the company managed on average 55% of television advertising revenue from 2019 to 2022—the possibility of retrofitting fair competition procedures is unlikely. As such, Article 22 EMFA’s competition guidelines are toothless and Article 25 EMFA is too narrowly tailored in the Italian case, considering the reality that Berlusconi’s Mediaset already controls both a strong television viewership and an even stronger advertising stake. While the proportionality and transparency measures are respectable from behind a ‘veil of ignorance,’ the Berlusconi media empire has already positioned itself as the controlling stake in advertising revenue, and private firms can continue to operate via indirect conflict of interest lobbying beyond the confines of EMFA regulation.

 

Concluding Comments

The reality is that changes in the media landscape take place at the national level; the EU’s EMFA regulation can only do so much to secure Member State-specific media pluralism—particularly if editorial offices and ownership structures for these media organs have already been usurped. Even more concerning is the fact that these methods for state or partisan capture of media outlets serve as entirely replicable models for other nations—carrying grave connotations for the future of liberal democratic governance in constitutional democracies in the EU and around the world. In Hungary, Orbán’s efforts to control independent media and propagate his political agenda have irreversibly violated principles of media pluralism which—as the European Court of Human Rights once noted—stands as the ‘cornerstone of [a] democratic and pluralist society’ (Manole and Others v. Moldova, para 54). In Italy, Berlusconi’s media congolmerate Mediaset found avenues to solidify advertising control and financially benefit from firm advertising spend during his time as Prime Minister. While the EMFA prompts some important regulatory changes for the future state of media pluralism, it falls short of fully addressing the current state of Hungarian public media and Italian private media ecosystems. Such a topic provides context to the worldwide retreat of media pluralism, internet freedom, and free speech in liberal democratic societies; the backsliding of media pluralism—and liberal democratic principles writ large—is not confined to strictly authoritarian regimes but instead osmotically permeates throughout previously entrenched liberal democracies.





Source link

30Jun

The History of Convolutional Neural Networks for Image Classification (1989 – Today) | by Avishek Biswas | Jun, 2024


A visual tour of the greatest innovations in Deep Learning and Computer Vision.

Before CNNs, the standard way to train a neural network to classify images was to flatten it into a list of pixels and pass it through a feed-forward neural network to output the image’s class. The problem with flattening the image is that the essential spatial information in the image is discarded.

In 1989, Yann LeCun and team introduced Convolutional Neural Networks — the backbone of Computer Vision research for the last 15 years! Unlike feedforward networks, CNNs preserve the 2D nature of images and are capable of processing information spatially!

In this article, we are going to go through the history of CNNs specifically for Image Classification tasks — starting from those early research years in the 90’s to the golden era of the mid-2010s when many of the most genius Deep Learning architectures ever were conceived, and finally discuss the latest trends in CNN research now as they compete with attention and vision-transformers.

Check out the YouTube video that explains all the concepts in this article visually with animations. Unless otherwise specified, all the images and illustrations used in this article are generated by myself during creating the video version.

The papers we will be discussing today!

At the heart of a CNN is the convolution operation. We scan the filter across the image and calculate the dot product of the filter with the image at each overlapping location. This resulting output is called a feature map and it captures how much and where the filter pattern is present in the image.

How Convolution works — The kernel slides over the input image and calculates the overlap (dot-product) at each location — outputting a feature map in the end!

In a convolution layer, we train multiple filters that extract different feature maps from the input image. When we stack multiple convolutional layers in sequence with some non-linearity, we get a convolutional neural network (CNN).

So each convolution layer simultaneously does 2 things —
1. spatial filtering with the convolution operation between images and kernels, and
2. combining the multiple input channels and output a new set of channels.

90 percent of the research in CNNs has been to modify or to improve just these two things.

The two main things CNN do

The 1989 Paper

This 1989 paper taught us how to train non-linear CNNs from scratch using backpropagation. They input 16×16 grayscale images of handwritten digits, and pass through two convolutional layers with 12 filters of size 5×5. The filters also move with a stride of 2 during scanning. Strided-convolution is useful for downsampling the input image. After the conv layers, the output maps are flattened and passed through two fully connected networks to output the probabilities for the 10 digits. Using the softmax cross-entropy loss, the network is optimized to predict the correct labels for the handwritten digits. After each layer, the tanh nonlinearity is also used — allowing the learned feature maps to be more complex and expressive. With just 9760 parameters, this was a very small network compared to today’s networks which contain hundreds of millions of parameters.

The OG CNN architecture from 1989

Inductive Bias

Inductive Bias is a concept in Machine Learning where we deliberately introduce specific rules and limitations into the learning process to move our models away from generalizations and steer more toward solutions that follow our human-like understanding.

When humans classify images, we also do spatial filtering to look for common patterns to form multiple representations and then combine them together to form our predictions. The CNN architecture is designed to replicate just that. In feedforward networks, each pixel is treated like it’s own isolated feature as each neuron in the layers connects with all the pixels — in CNNs there is more parameter-sharing because the same filter scans the entire image. Inductive biases make CNNs less data-hungry too because they get local pattern recognition for free due to the network design but feedforward networks need to spend their training cycles learning about it from scratch.



Source link

30Jun

A Crash Course of Planning for Perception Engineers in Autonomous Driving | by Patrick Langechuan Liu | Jun, 2024


The fundamentals of planning and decision-making

AlphaGo, ChatGPT and FSD (image credit Elena Popova, Karthik Sridasyam and Jonathan Kemper on Unsplash)

A classical modular autonomous driving system typically consists of perception, prediction, planning, and control. Until around 2023, AI (artificial intelligence) or ML (machine learning) primarily enhanced perception in most mass-production autonomous driving systems, with its influence diminishing in downstream components. In stark contrast to the low integration of AI in the planning stack, end-to-end perception systems (such as the BEV, or birds-eye-view perception pipeline) have been deployed in mass production vehicles.

Classical modular design of an autonomous driving stack, 2023 and prior (Chart created by author)
Classical modular design of an autonomous driving stack, 2023 and prior (Chart created by author)

There are multiple reasons for this. A classical stack based on a human-crafted framework is more explainable and can be iterated faster to fix field test issues (within hours) compared to machine learning-driven features (which could take days or weeks). However, it does not make sense to let readily available human driving data sit idle. Moreover, increasing computing power is more scalable than expanding the engineering team.

Fortunately, there has been a strong trend in both academia and industry to change this situation. First, downstream modules are becoming increasingly data-driven and may also be integrated via different interfaces, such as the one proposed in CVPR 2023’s best paper, UniAD. Moreover, driven by the ever-growing wave of Generative AI, a single unified vision-language-action (VLA) model shows great potential for handling complex robotics tasks (RT-2 in academia, TeslaBot and 1X in industry) and autonomous driving (GAIA-1, DriveVLM in academia, and Wayve AI driver, Tesla FSD in industry). This brings the toolsets of AI and data-driven development from the perception stack to the planning stack.

This blog post aims to introduce the problem settings, existing methodologies, and challenges of the planning stack, in the form of a crash course for perception engineers. As a perception engineer, I finally had some time over the past couple of weeks to systematically learn the classical planning stack, and I would like to share what I learned. I will also share my thoughts on how AI can help from the perspective of an AI practitioner.

The intended audience for this post is AI practitioners who work in the field of autonomous driving, in particular, perception engineers.

The article is a bit long (11100 words), and the table of contents below will most likely help those who want to do quick ctrl+F searches with the keywords.

Table of Contents (ToC)

Why learn planning?
What is planning?
The problem formulation
The Glossary of Planning
Behavior Planning
Frenet vs Cartesian systems
Classical tools-the troika of planning
Searching
Sampling
Optimization
Industry practices of planning
Path-speed decoupled planning
Joint spatiotemporal planning
Decision making
What and why?
MDP and POMDP
Value iteration and Policy iteration
AlphaGo and MCTS-when nets meet trees
MPDM (and successors) in autonomous driving
Industry practices of decision making
Trees
No trees
Self-Reflections
Why NN in planning?
What about e2e NN planners?
Can we do without prediction?
Can we do with just nets but no trees?
Can we use LLMs to make decisions?
The trend of evolution

This brings us to an interesting question: why learn planning, especially the classical stack, in the era of AI?

From a problem-solving perspective, understanding your customers’ challenges better will enable you, as a perception engineer, to serve your downstream customers more effectively, even if your main focus remains on perception work.

Machine learning is a tool, not a solution. The most efficient way to solve problems is to combine new tools with domain knowledge, especially those with solid mathematical formulations. Domain knowledge-inspired learning methods are likely to be more data-efficient. As planning transitions from rule-based to ML-based systems, even with early prototypes and products of end-to-end systems hitting the road, there is a need for engineers who can deeply understand both the fundamentals of planning and machine learning. Despite these changes, classical and learning methods will likely continue to coexist for a considerable period, perhaps shifting from an 8:2 to a 2:8 ratio. It is almost essential for engineers working in this field to understand both worlds.

From a value-driven development perspective, understanding the limitations of classical methods is crucial. This insight allows you to effectively utilize new ML tools to design a system that addresses current issues and delivers immediate impact.

Additionally, planning is a critical part of all autonomous agents, not just in autonomous driving. Understanding what planning is and how it works will enable more ML talents to work on this exciting topic and contribute to the development of truly autonomous agents, whether they are cars or other forms of automation.

The problem formulation

As the “brain” of autonomous vehicles, the planning system is crucial for the safe and efficient driving of vehicles. The goal of the planner is to generate trajectories that are safe, comfortable, and efficiently progressing towards the goal. In other words, safety, comfort, and efficiency are the three key objectives for planning.

As input to the planning systems, all perception outputs are required, including static road structures, dynamic road agents, free space generated by occupancy networks, and traffic wait conditions. The planning system must also ensure vehicle comfort by monitoring acceleration and jerk for smooth trajectories, while considering interaction and traffic courtesy.

The planning systems generate trajectories in the format of a sequence of waypoints for the ego vehicle’s low-level controller to track. Specifically, these waypoints represent the future positions of the ego vehicle at a series of fixed time stamps. For example, each point might be 0.4 seconds apart, covering an 8-second planning horizon, resulting in a total of 20 waypoints.

A classical planning stack roughly consists of global route planning, local behavior planning, and local trajectory planning. Global route planning provides a road-level path from the start point to the end point on a global map. Local behavior planning decides on a semantic driving action type (e.g., car following, nudging, side passing, yielding, and overtaking) for the next several seconds. Based on the decided behavior type from the behavior planning module, local trajectory planning generates a short-term trajectory. The global route planning is typically provided by a map service once navigation is set and is beyond the scope of this post. We will focus on behavior planning and trajectory planning from now on.

Behavior planning and trajectory generation can work explicitly in tandem or be combined into a single process. In explicit methods, behavior planning and trajectory generation are distinct processes operating within a hierarchical framework, working at different frequencies, with behavior planning at 1–5 Hz and trajectory planning at 10–20 Hz. Despite being highly efficient most of the time, adapting to different scenarios may require significant modifications and fine-tuning. More advanced planning systems combine the two into a single optimization problem. This approach ensures feasibility and optimality without any compromise.

Classification of planning design approaches (source: Fluid Dynamics Planner)
Classification of planning design approaches (source: Fluid Dynamics Planner)

The Glossary of Planning

You might have noticed that the terminology used in the above section and the image do not completely match. There is no standard terminology that everyone uses. Across both academia and industry, it is not uncommon for engineers to use different names to refer to the same concept and the same name to refer to different concepts. This indicates that planning in autonomous driving is still under active development and has not fully converged.

Here, I list the notation used in this post and briefly explain other notions present in the literature.

  • Planning: A top-level concept, parallel to control, that generates trajectory waypoints. Together, planning and control are jointly referred to as PnC (planning and control).
  • Control: A top-level concept that takes in trajectory waypoints and generates high-frequency steering, throttle, and brake commands for actuators to execute. Control is relatively well-established compared to other areas and is beyond the scope of this post, despite the common notion of PnC.
  • Prediction: A top-level concept that predicts the future trajectories of traffic agents other than the ego vehicle. Prediction can be considered a lightweight planner for other agents and is also called motion prediction.
  • Behavior Planning: A module that produces high-level semantic actions (e.g., lane change, overtake) and typically generates a coarse trajectory. It is also known as task planning or decision making, particularly in the context of interactions.
  • Motion Planning: A module that takes in semantic actions and produces smooth, feasible trajectory waypoints for the duration of the planning horizon for control to execute. It is also referred to as trajectory planning.
  • Trajectory Planning: Another term for motion planning.
  • Decision Making: Behavior planning with a focus on interactions. Without ego-agent interaction, it is simply referred to as behavior planning. It is also known as tactical decision making.
  • Route Planning: Finds the preferred route over road networks, also known as mission planning.
  • Model-Based Approach: In planning, this refers to manually crafted frameworks used in the classical planning stack, as opposed to neural network models. Model-based methods contrast with learning-based methods.
  • Multimodality: In the context of planning, this typically refers to multiple intentions. This contrasts with multimodality in the context of multimodal sensor inputs to perception or multimodal large language models (such as VLM or VLA).
  • Reference Line: A local (several hundred meters) and coarse path based on global routing information and the current state of the ego vehicle.
  • Frenet Coordinates: A coordinate system based on a reference line. Frenet simplifies a curvy path in Cartesian coordinates to a straight tunnel model. See below for a more detailed introduction.
  • Trajectory: A 3D spatiotemporal curve, in the form of (x, y, t) in Cartesian coordinates or (s, l, t) in Frenet coordinates. A trajectory is composed of both path and speed.
  • Path: A 2D spatial curve, in the form of (x, y) in Cartesian coordinates or (s, l) in Frenet coordinates.
  • Semantic Action: A high-level abstraction of action (e.g., car following, nudge, side pass, yield, overtake) with clear human intention. Also referred to as intention, policy, maneuver, or primitive motion.
  • Action: A term with no fixed meaning. It can refer to the output of control (high-frequency steering, throttle, and brake commands for actuators to execute) or the output of planning (trajectory waypoints). Semantic action refers to the output of behavior prediction.

Different literature may use various notations and concepts. Here are some examples:

These variations illustrate the diversity in terminology and the evolving nature of the field.

Behavior Planning

As a machine learning engineer, you may notice that the behavior planning module is a heavily manually crafted intermediate module. There is no consensus on the exact form and content of its output. Concretely, the output of behavior planning can be a reference path or object labeling on ego maneuvers (e.g., pass from the left or right-hand side, pass or yield). The term “semantic action” has no strict definition and no fixed methods.

The decoupling of behavior planning and motion planning increases efficiency in solving the extremely high-dimensional action space of autonomous vehicles. The actions of an autonomous vehicle need to be reasoned at typically 10 Hz or more (time resolution in waypoints), and most of these actions are relatively straightforward, like going straight. After decoupling, the behavior planning layer only needs to reason about future scenarios at a relatively coarse resolution, while the motion planning layer operates in the local solution space based on the decision made by behavior planning. Another benefit of behavior planning is converting non-convex optimization to convex optimization, which we will discuss further below.

Frenet vs Cartesian systems

The Frenet coordinate system is a widely adopted system that merits its own introduction section. The Frenet frame simplifies trajectory planning by independently managing lateral and longitudinal movements relative to a reference path. The sss coordinate represents longitudinal displacement (distance along the road), while the lll (or ddd) coordinate represents lateral displacement (side position relative to the reference path).

Frenet simplifies a curvy path in Cartesian coordinates to a straight tunnel model. This transformation converts non-linear road boundary constraints on curvy roads into linear ones, significantly simplifying the subsequent optimization problems. Additionally, humans perceive longitudinal and lateral movements differently, and the Frenet frame allows for separate and more flexible optimization of these movements.

Schematics on the conversion from Cartesian frame to Frenet frame (source: Cartesian Planner)

The Frenet coordinate system requires a clean, structured road graph with low curvature lanes. In practice, it is preferred for structured roads with small curvature, such as highways or city expressways. However, the issues with the Frenet coordinate system are amplified with increasing reference line curvature, so it should be used cautiously on structured roads with high curvature, like city intersections with guide lines.

For unstructured roads, such as ports, mining areas, parking lots, or intersections without guidelines, the more flexible Cartesian coordinate system is recommended. The Cartesian system is better suited for these environments because it can handle higher curvature and less structured scenarios more effectively.

Planning in autonomous driving involves computing a trajectory from an initial high-dimensional state (including position, time, velocity, acceleration, and jerk) to a target subspace, ensuring all constraints are satisfied. Searching, sampling, and optimization are the three most widely used tools for planning.

Searching

Classical graph-search methods are popular in planning and are used in route/mission planning on structured roads or directly in motion planning to find the best path in unstructured environments (such as parking or urban intersections, especially mapless scenarios). There is a clear evolution path, from Dijkstra’s algorithm to A* (A-star), and further to hybrid A*.

Dijkstra’s algorithm explores all possible paths to find the shortest one, making it a blind (uninformed) search algorithm. It is a systematic method that guarantees the optimal path, but it is inefficient to deploy. As shown in the chart below, it explores almost all directions. Essentially, Dijkstra’s algorithm is a breadth-first search (BFS) weighted by movement costs. To improve efficiency, we can use information about the location of the target to trim down the search space.

Visualization of Dijkstra’s algorithm and A-star search (Source: PathFinding.js, example inspired by RedBlobGames)

The A* algorithm uses heuristics to prioritize paths that appear to be leading closer to the goal, making it more efficient. It combines the cost so far (Dijkstra) with the cost to go (heuristics, essentially greedy best-first). A* only guarantees the shortest path if the heuristic is admissible and consistent. If the heuristic is poor, A* can perform worse than the Dijkstra baseline and may degenerate into a greedy best-first search.

In the specific application of autonomous driving, the hybrid A* algorithm further improves A* by considering vehicle kinematics. A* may not satisfy kinematic constraints and cannot be tracked accurately (e.g., the steering angle is typically within 40 degrees). While A* operates in grid space for both state and action, hybrid A* separates them, maintaining the state in the grid but allowing continuous action according to kinematics.

Analytical expansion (shot to goal) is another key innovation proposed by hybrid A*. A natural enhancement to A* is to connect the most recently explored nodes to the goal using a non-colliding straight line. If this is possible, we have found the solution. In hybrid A*, this straight line is replaced by Dubins and Reeds-Shepp (RS) curves, which comply with vehicle kinematics. This early stopping method strikes a balance between optimality and feasibility by focusing more on feasibility for the further side.

Hybrid A* is used heavily in parking scenarios and mapless urban intersections. Here is a very nice video showcasing how it works in a parking scenario.

Hybrid A-star algorithm with analytical expansion (source: the 2010 IJRR Hybrid A-star paper and 2012 Udacity class )

Sampling

Another popular method of planning is sampling. The well-known Monte Carlo method is a random sampling method. In essence, sampling involves selecting many candidates randomly or according to a prior, and then selecting the best one according to a defined cost. For sampling-based methods, the fast evaluation of many options is critical, as it directly impacts the real-time performance of the autonomous driving system.

Large Language Models (LLMs) essentially provide samples, and there needs to be an evaluator with a defined cost that aligns with human preferences. This evaluation process ensures that the selected output meets the desired criteria and quality standards.

Sampling can occur in a parameterized solution space if we already know the analytical solution to a given problem or subproblem. For example, typically we want to minimize the time integral of the square of jerk (the third derivative of position p(t)), indicated by the triple dots over p, where one dot represents one order derivative with respect to time), among other criteria.

Minimizing squared jerk for driving comfort (source: Werling et al, ICRA 2010)

It can be mathematically proven that quintic (5th order) polynomials provide the jerk-optimal connection between two states in a position-velocity-acceleration space, even when additional cost terms are considered. By sampling in this parameter space of quintic polynomials, we can find the one with the minimum cost to get the approximate solution. The cost takes into account factors such as speed, acceleration, jerk limit, and collision checks. This approach essentially solves the optimization problem through sampling.

Sampling of lateral movement time profiles (source: Werling et al, ICRA 2010)

Sampling-based methods have inspired numerous ML papers, including CoverNet, Lift-Splat-Shoot, NMP, and MP3. These methods replace mathematically sound quintic polynomials with human driving behavior, utilizing a large database. The evaluation of trajectories can be easily parallelized, which further supports the use of sampling-based methods. This approach effectively leverages a vast amount of expert demonstrations to mimic human-like driving behavior, while avoiding random sampling of acceleration and steering profiles.

Sampling from human-driving data for data-driven planning methods (source: NMP, CoverNet and Lift-splat-shoot)

Optimization

Optimization finds the best solution to a problem by maximizing or minimizing a specific objective function under given constraints. In neural network training, a similar principle is followed using gradient descent and backpropagation to adjust the network’s weights. However, in optimization tasks outside of neural networks, models are usually less complex, and more effective methods than gradient descent are often employed. For example, while gradient descent can be applied to Quadratic Programming, it is generally not the most efficient method.

In autonomous driving, the planning cost to optimize typically considers dynamic objects for obstacle avoidance, static road structures for following lanes, navigation information to ensure the correct route, and ego status to evaluate smoothness.

Optimization can be categorized into convex and non-convex types. The key distinction is that in a convex optimization scenario, there is only one global optimum, which is also the local optimum. This characteristic makes it unaffected by the initial solution to the optimization problems. For non-convex optimization, the initial solution matters a lot, as illustrated in the chart below.

Convex vs non-convex optimization (source: Stanford course materials)

Since planning involves highly non-convex optimization with many local optima, it heavily depends on the initial solution. Additionally, convex optimization typically runs much faster and is therefore preferred for onboard real-time applications such as autonomous driving. A typical approach is to use convex optimization in conjunction with other methods to outline a convex solution space first. This is the mathematical foundation behind separating behavior planning and motion planning, where finding a good initial solution is the role of behavior planning.

Take obstacle avoidance as a concrete example, which typically introduces non-convex problems. If we know the nudging direction, then it becomes a convex optimization problem, with the obstacle position acting as a lower or upper bound constraint for the optimization problem. If we don’t know the nudging direction, we need to decide first which direction to nudge, making the problem a convex one for motion planning to solve. This nudging direction decision falls under behavior planning.

Of course, we can do direct optimization of non-convex optimization problems with tools such as projected gradient descent, alternating minimization, particle swarm optimization (PSO), and genetic algorithms. However, this is beyond the scope of this post.

A convex path planning problem vs a non-convex one (chart made by author)
The solution process of the convex vs non-convex path planning problem (chart made by author)

How do we make such decisions? We can use the aforementioned search or sampling methods to address non-convex problems. Sampling-based methods scatter many options across the parameter space, effectively handling non-convex issues similarly to searching.

You may also question why deciding which direction to nudge from is enough to guarantee the problem space is convex. To explain this, we need to discuss topology. In path space, similar feasible paths can transform continuously into each other without obstacle interference. These similar paths, grouped as “homotopy classes” in the formal language of topology, can all be explored using a single initial solution homotopic to them. All these paths form a driving corridor, illustrated as the red or green shaded area in the image above. For a 3D spatiotemporal case, please refer to the QCraft tech blog.

We can utilize the Generalized Voronoi diagram to enumerate all homotopy classes, which roughly corresponds to the different decision paths available to us. However, this topic delves into advanced mathematical concepts that are beyond the scope of this blog post.

The key to solving optimization problems efficiently lies in the capabilities of the optimization solver. Typically, a solver requires approximately 10 milliseconds to plan a trajectory. If we can boost this efficiency by tenfold, it can significantly impact algorithm design. This exact improvement was highlighted during Tesla AI Day 2022. A similar enhancement has occurred in perception systems, transitioning from 2D perception to Bird’s Eye View (BEV) as available computing power scaled up tenfold. With a more efficient optimizer, more options can be calculated and evaluated, thereby reducing the importance of the decision-making process. However, engineering an efficient optimization solver demands substantial engineering resources.

Every time compute scales up by 10x, algorithm will evolve to next generation.
— — The unverified law of algorithm evolution

A key differentiator in various planning systems is whether they are spatiotemporally decoupled. Concretely, spatiotemporally decoupled methods plan in spatial dimensions first to generate a path, and then plan the speed profile along this path. This approach is also known as path-speed decoupling.

Path-speed decoupling is often referred to as lateral-longitudinal (lat-long) decoupling, where lateral (lat) planning corresponds to path planning and longitudinal (long) planning corresponds to speed planning. This terminology seems to originate from the Frenet coordinate system, which we will explore later.

Decoupled solutions are easier to implement and can solve about 95% of issues. In contrast, coupled solutions have a higher theoretical performance ceiling but are more challenging to implement. They involve more parameters to tune and require a more principled approach to parameter tuning.

The comparison of decoupled and joint planning (source: made by the author, inspired by Qcraft)
Pros and cons of decoupled vs joint spatiotemporal planning (chart made by author)

Path-speed decoupled planning

We can take Baidu Apollo EM planner as an example of a system that uses path-speed decoupled planning.

The EM planner significantly reduces computational complexity by transforming a three-dimensional station-lateral-speed problem into two two-dimensional problems: station-lateral and station-speed. At the core of Apollo’s EM planner is an iterative Expectation-Maximization (EM) step, consisting of path optimization and speed optimization. Each step is divided into an E-step (projection and formulation in a 2D state space) and an M-step (optimization in the 2D state space). The E-step involves projecting the 3D problem into either a Frenet SL frame or an ST speed tracking frame.

The EM iteration in Apollo EM planner (source: Baidu Apollo EM planner )

The M-step (maximization step) in both path and speed optimization involves solving non-convex optimization problems. For path optimization, this means deciding whether to nudge an object on the left or right side, while for speed optimization, it involves deciding whether to overtake or yield to a dynamic object crossing the path. The Apollo EM planner addresses these non-convex optimization challenges using a two-step process: Dynamic Programming (DP) followed by Quadratic Programming (QP).

DP uses a sampling or searching algorithm to generate a rough initial solution, effectively pruning the non-convex space into a convex space. QP then takes the coarse DP results as input and optimizes them within the convex space provided by DP. In essence, DP focuses on feasibility, and QP refines the solution to achieve optimality within the convex constraints.

In our defined terminology, Path DP corresponds to lateral BP, Path QP to lateral MP, Speed DP to longitudinal BP, and Speed QP to longitudinal MP. Thus, the process involves conducting BP (Basic Planning) followed by MP (Master Planning) in both the path and speed steps.

A full autonomous driving stack with path-speed decoupled planning (chart made by author)
A full autonomous driving stack with path-speed decoupled planning (chart made by author)

Joint spatiotemporal planning

Although decoupled planning can resolve 95% of cases in autonomous driving, the remaining 5% involve challenging dynamic interactions where a decoupled solution often results in suboptimal trajectories. In these complex scenarios, demonstrating intelligence is crucial, making it a very hot topic in the field.

For example, in narrow-space passing, the optimal behavior might be to either decelerate to yield or accelerate to pass. Such behaviors are not achievable within the decoupled solution space and require joint optimization. Joint optimization allows for a more integrated approach, considering both path and speed simultaneously to handle intricate dynamic interactions effectively.

A full autonomous driving stack with joint spatiotemporal planning (chart made by author)
A full autonomous driving stack with joint spatiotemporal planning (chart made by author)

However, there are significant challenges in joint spatiotemporal planning. Firstly, solving the non-convex problem directly in a higher-dimensional state space is more challenging and time-consuming than using a decoupled solution. Secondly, considering interactions in spatiotemporal joint planning is even more complex. We will cover this topic in more detail later when we discuss decision-making.

Here we introduce two solving methods: brute force search and constructing a spatiotemporal corridor for optimization.

Brute force search occurs directly in 3D spatiotemporal space (2D in space and 1D in time), and can be performed in either XYT (Cartesian) or SLT (Frenet) coordinates. We will take SLT as an example. SLT space is long and flat, similar to an energy bar. It is elongated in the L dimension and flat in the ST face. For brute force search, we can use hybrid A-star, with the cost being a combination of progress cost and cost to go. During optimization, we must conform to search constraints that prevent reversing in both the s and t dimensions.

Overtake by lane change in spatiotemporal lattice (source: Spatiotemporal optimization with A*)

Another method is constructing a spatiotemporal corridor, essentially a curve with the footprint of a car winding through a 3D spatiotemporal state space (SLT, for example). The SSC (spatiotemporal semantic corridor, RAL 2019), encodes requirements given by semantic elements into a semantic corridor, generating a safe trajectory accordingly. The semantic corridor consists of a series of mutually connected collision-free cubes with dynamical constraints posed by the semantic elements in the spatiotemporal domain. Within each cube, it becomes a convex optimization problem that can be solved using Quadratic Programming (QP).

SSC still requires a BP (Behavior Planning) module to provide a coarse driving trajectory. Complex semantic elements of the environment are projected into the spatiotemporal domain concerning the reference lane. EPSILON (TRO 2021), showcases a system where SSC serves as the motion planner working in tandem with a behavior planner. In the next section, we will discuss behavior planning, especially focusing on interaction. In this context, behavior planning is usually referred to as decision making.

An illustration of the spatiotemporal corridor (source: SSC)

What and why?

Decision making in autonomous driving is essentially behavior planning, but with a focus on interaction with other traffic agents. The assumption is that other agents are mostly rational and will respond to our behavior in a predictable manner, which we can describe as “noisily rational.”

People may question the necessity of decision making when advanced planning tools are available. However, two key aspects — uncertainty and interaction — introduce a probabilistic nature to the environment, primarily due to the presence of dynamic objects. Interaction is the most challenging part of autonomous driving, distinguishing it from general robotics. Autonomous vehicles must not only navigate but also anticipate and react to the behavior of other agents, making robust decision-making essential for safety and efficiency.

In a deterministic (purely geometric) world without interaction, decision making would be unnecessary, and planning through searching, sampling, and optimization would suffice. Brute force searching in the 3D XYT space could serve as a general solution.

In most classical autonomous driving stacks, a prediction-then-plan approach is adopted, assuming zero-order interaction between the ego vehicle and other vehicles. This approach treats prediction outputs as deterministic, requiring the ego vehicle to react accordingly. This leads to overly conservative behavior, exemplified by the “freezing robot” problem. In such cases, prediction fills the entire spatiotemporal space, preventing actions like lane changes in crowded conditions — something humans manage more effectively.

To handle stochastic strategies, Markov Decision Processes (MDP) or Partially Observable Markov Decision Processes (POMDP) frameworks are essential. These approaches shift the focus from geometry to probability, addressing chaotic uncertainty. By assuming that traffic agents behave rationally or at least noisily rationally, decision making can help create a safe driving corridor in the otherwise chaotic spatiotemporal space.

Among the three overarching goals of planning — safety, comfort, and efficiency — decision making primarily enhances efficiency. Conservative actions can maximize safety and comfort, but effective negotiation with other road agents, achievable through decision making, is essential for optimal efficiency. Effective decision making also displays intelligence.

MDP and POMDP

We will first introduce Markov Decision Processes (MDP) and Partially Observable Markov Decision Processes (POMDP), followed by their systematic solutions, such as value iteration and policy iteration.

A Markov Process (MP) is a type of stochastic process that deals with dynamic random phenomena, unlike static probability. In a Markov Process, the future state depends only on the current state, making it sufficient for prediction. For autonomous driving, the relevant state may only include the last second of data, expanding the state space to allow for a shorter history window.

A Markov Decision Process (MDP) extends a Markov Process to include decision-making by introducing action. MDPs model decision-making where outcomes are partly random and partly controlled by the decision maker or agent. An MDP can be modeled with five factors:

  1. State (S): The state of the environment.
  2. Action (A): The actions the agent can take to affect the environment.
  3. Reward (R): The reward the environment provides to the agent as a result of the action.
  4. Transition Probability (P): The probability of transitioning from the old state to a new state upon the agent’s action.
  5. Gamma (γ): A discount factor for future rewards.

This is also the common framework used by reinforcement learning (RL), which is essentially an MDP. The goal of MDP or RL is to maximize the cumulative reward received in the long run. This requires the agent to make good decisions given a state from the environment, according to a policy.

A policy, π, is a mapping from each state, s ∈ S, and action, a ∈ A(s), to the probability π(a|s) of taking action a when in state s. MDP or RL studies the problem of how to derive the optimal policy.

The agent-environment interface in MDP and RL (source: Reinforcement Learning: An Introduction)

A Partially Observable Markov Decision Process (POMDP) adds an extra layer of complexity by recognizing that states cannot be directly observed but rather inferred through observations. In a POMDP, the agent maintains a belief — a probability distribution over possible states — to estimate the state of the environment. Autonomous driving scenarios are better represented by POMDPs due to their inherent uncertainties and the partial observability of the environment. An MDP can be considered a special case of a POMDP where the observation perfectly reveals the state.

MDP vs POMDP (source: POMDPs as stochastic contingent planning)

POMDPs can actively collect information, leading to actions that gather necessary data, demonstrating the intelligent behavior of these models. This capability is particularly valuable in scenarios like waiting at intersections, where gathering information about other vehicles’ intentions and the state of the traffic light is crucial for making safe and efficient decisions.

Value iteration and policy iteration are systematic methods for solving MDP or POMDP problems. While these methods are not commonly used in real-world applications due to their complexity, understanding them provides insight into exact solutions and how they can be simplified in practice, such as using MCTS in AlphaGo or MPDM in autonomous driving.

To find the best policy in an MDP, we must assess the potential or expected reward from a state, or more specifically, from an action taken in that state. This expected reward includes not just the immediate reward but also all future rewards, formally known as the return or cumulative discounted reward. (For a deeper understanding, refer to “Reinforcement Learning: An Introduction,” often considered the definitive guide on the subject.)

The value function (V) characterizes the quality of states by summing the expected returns. The action-value function (Q) assesses the quality of actions for a given state. Both functions are defined according to a given policy. The Bellman Optimality Equation states that an optimal policy will choose the action that maximizes the immediate reward plus the expected future rewards from the resulting new states. In simple terms, the Bellman Optimality Equation advises considering both the immediate reward and the future consequences of an action. For example, when switching jobs, consider not only the immediate pay raise (R) but also the future value (S’) the new position offers.

Bellman’s equation of optimality (chart made by author)

It is relatively straightforward to extract the optimal policy from the Bellman Optimality Equation once the optimal value function is available. But how do we find this optimal value function? This is where value iteration comes to the rescue.

Extract best policy from optimal values (chart made by author)

Value iteration finds the best policy by repeatedly updating the value of each state until it stabilizes. This process is derived by turning the Bellman Optimality Equation into an update rule. Essentially, we use the optimal future picture to guide the iteration toward it. In plain language, “fake it until you make it!”

Update value functions under the guidance of Bellman’s Equation (chart made by author)

Value iteration is guaranteed to converge for finite state spaces, regardless of the initial values assigned to the states (for a detailed proof, please refer to the Bible of RL). If the discount factor gamma is set to 0, meaning we only consider immediate rewards, the value iteration will converge after just one iteration. A smaller gamma leads to faster convergence because the horizon of consideration is shorter, though it may not always be the best option for solving concrete problems. Balancing the discount factor is a key aspect of engineering practice.

One might ask how this works if all states are initialized to zero. The immediate reward in the Bellman Equation is crucial for bringing in additional information and breaking the initial symmetry. Think about the states that immediately lead to the goal state; their value propagates through the state space like a virus. In plain language, it’s about making small wins, frequently.

Value and policy functions interact until they converge to optimum together (source: Reinforcement Learning: An Introduction)

However, value iteration also suffers from inefficiency. It requires taking the optimal action at each iteration by considering all possible actions, similar to Dijkstra’s algorithm. While it demonstrates feasibility as a basic approach, it is typically not practical for real-world applications.

The contrast of Bellman Equation and Bellman Optimality Equation (chart made by author)

Policy iteration improves on this by taking actions according to the current policy and updating it based on the Bellman Equation (not the Bellman Optimality Equation). Policy iteration decouples policy evaluation from policy improvement, making it a much faster solution. Each step is taken based on a given policy instead of exploring all possible actions to find the one that maximizes the objective. Although each iteration of policy iteration can be more computationally intensive due to the policy evaluation step, it generally results in a faster convergence overall.

In simple terms, if you can only fully evaluate the consequence of one action, it’s better to use your own judgment and do your best with the current information available.

AlphaGo and MCTS — when nets meet trees

We have all heard the unbelievable story of AlphaGo beating the best human player in 2016. AlphaGo formulates the gameplay of Go as an MDP and solves it with Monte Carlo Tree Search (MCTS). But why not use value iteration or policy iteration?

Value iteration and policy iteration are systematic, iterative methods that solve MDP problems. However, even with improved policy iteration, it still requires performing time-consuming operations to update the value of every state. A standard 19×19 Go board has roughly 2e170 possible states. This vast number of states makes it intractable to solve with traditional value iteration or policy iteration techniques.

AlphaGo and its successors use a Monte Carlo tree search (MCTS) algorithm to find their moves, guided by a value network and a policy network, trained on both human and computer play. Let’s take a look at vanilla MCTS first.

The four steps of MCTS by AlphaGo, combining both value network and policy network (source: AlphaGo, Nature 2016)

Monte Carlo Tree Search (MCTS) is a method for policy estimation that focuses on decision-making from the current state. One iteration involves a four-step process: selection, expansion, simulation (or evaluation), and backup.

  1. Selection: The algorithm follows the most promising path based on previous simulations until it reaches a leaf node, a position not yet fully explored.
  2. Expansion: One or more child nodes are added to represent possible next moves from the leaf node.
  3. Simulation (Evaluation): The algorithm plays out a random game from the new node until the end, known as a “rollout.” This assesses the potential outcome from the expanded node by simulating random moves until a terminal state is reached.
  4. Backup: The algorithm updates the values of the nodes on the path taken based on the game’s result. If the outcome is a win, the value of the nodes increases; if it is a loss, the value decreases. This process propagates the result of the rollout back up the tree, refining the policy based on simulated outcomes.

After a given number of iterations, MCTS provides the percentage frequency with which immediate actions were selected from the root during simulations. During inference, the action with the most visits is selected. Here is an interactive illustration of MTCS with the game of tic-tac-toe for simplicity.

MCTS in AlphaGo is enhanced by two neural networks. Value Network evaluates the winning rate from a given state (board configuration). Policy Network evaluates the action distribution for all possible moves. These neural networks improve MCTS by reducing the effective depth and breadth of the search tree. The policy network helps in sampling actions, focusing the search on promising moves, while the value network provides a more accurate evaluation of positions, reducing the need for extensive rollouts. This combination allows AlphaGo to perform efficient and effective searches in the vast state space of Go.

The policy network and value network of AlphaGo (source: AlphaGo, Nature 2016)

In the expansion step, the policy network samples the most likely positions, effectively pruning the breadth of the search space. In the evaluation step, the value network provides an instinctive scoring of the position, while a faster, lightweight policy network performs rollouts until the game ends to collect rewards. MCTS then uses a weighted sum of the evaluations from both networks to make the final assessment.

Note that a single evaluation of the value network approaches the accuracy of Monte Carlo rollouts using the RL policy network but with 15,000 times less computation. This mirrors the fast-slow system design, akin to intuition versus reasoning, or System 1 versus System 2 as described by Nobel laureate Daniel Kahneman. Similar designs can be observed in more recent works, such as DriveVLM.

To be exact, AlphaGo incorporates two slow-fast systems at different levels. On the macro level, the policy network selects moves while the faster rollout policy network evaluates these moves. On the micro level, the faster rollout policy network can be approximated by a value network that directly predicts the winning rate of board positions.

What can we learn from AlphaGo for autonomous driving? AlphaGo demonstrates the importance of extracting an excellent policy using a robust world model (simulation). Similarly, autonomous driving requires a highly accurate simulation to effectively leverage algorithms similar to those used by AlphaGo. This approach underscores the value of combining strong policy networks with detailed, precise simulations to enhance decision-making and optimize performance in complex, dynamic environments.

In the game of Go, all states are immediately available to both players, making it a perfect information game where observation equals state. This allows the game to be characterized by an MDP process. In contrast, autonomous driving is a POMDP process, as the states can only be estimated through observation.

POMDPs connect perception and planning in a principled way. The typical solution for a POMDP is similar to that for an MDP, with a limited lookahead. However, the main challenges lie in the curse of dimensionality (explosion in state space) and the complex interactions with other agents. To make real-time progress tractable, domain-specific assumptions are typically made to simplify the POMDP problem.

MPDM (and the two follow-ups, and the white paper) is one pioneering study in this direction. MPDM reduces the POMDP to a closed-loop forward simulation of a finite, discrete set of semantic-level policies, rather than evaluating every possible control input for every vehicle. This approach addresses the curse of dimensionality by focusing on a manageable number of meaningful policies, allowing for effective real-time decision-making in autonomous driving scenarios.

Semantic actions help control the curse of dimensionality (source: EPSILON)

The assumptions of MPDM are twofold. First, much of the decision-making by human drivers involves discrete high-level semantic actions (e.g., slowing, accelerating, lane-changing, stopping). These actions are referred to as policies in this context. The second implicit assumption concerns other agents: other vehicles will make reasonably safe decisions. Once a vehicle’s policy is decided, its action (trajectory) is determined.

The framework of MPDM (chart created by author)

MPDM first selects one policy for the ego vehicle from many options (hence the “multi-policy” in its name) and selects one policy for each nearby agent based on their respective predictions. It then performs forward simulation (similar to a fast rollout in MCTS). The best interaction scenario after evaluation is then passed on to motion planning, such as the Spatiotemporal Semantic Corridor (SCC) mentioned in the joint spatiotemporal planning session.

MPDM enables intelligent and human-like behavior, such as actively cutting into dense traffic flow even when there is no sufficient gap present. This is not possible with a predict-then-plan pipeline, which does not explicitly consider interactions. The prediction module in MPDM is tightly integrated with the behavior planning model through forward simulation.

MPDM assumes a single policy throughout the decision horizon (10 seconds). Essentially, MPDM adopts an MCTS approach with one layer deep and super wide, considering all possible agent predictions. This leaves room for improvement, inspiring many follow-up works such as EUDM, EPSILON, and MARC. For example, EUDM considers more flexible ego policies and assigns a policy tree with a depth of four, with each policy covering a time duration of 2 seconds over an 8-second decision horizon. To compensate for the extra computation induced by the increased tree depth, EUDM performs more efficient width pruning by guided branching, identifying critical scenarios and key vehicles. This approach explores a more balanced policy tree.

The forward simulation in MPDM and EUDM uses very simplistic driver models (IDM for longitudinal simulation and Pure Pursuit for lateral simulation). MPDM points out that high fidelity realism matters less than the closed-loop nature itself, as long as policy-level decisions are not affected by low-level action execution inaccuracies.

The conceptual diagram of decision making, where prediction, BP and MP integrates tightly (chart created by author)
The conceptual diagram of decision making, where prediction, BP and MP integrates tightly (chart created by author)

Contingency planning in the context of autonomous driving involves generating multiple potential trajectories to account for various possible future scenarios. A key motivating example is that experienced drivers anticipate multiple future scenarios and always plan for a safe backup plan. This anticipatory approach leads to a smoother driving experience, even when cars perform sudden cut-ins into the ego lane.

A critical aspect of contingency planning is deferring the decision bifurcation point. This means delaying the point at which different potential trajectories diverge, allowing the ego vehicle more time to gather information and respond to different outcomes. By doing so, the vehicle can make more informed decisions, resulting in smoother and more confident driving behaviors, similar to those of an experienced driver.

Risk-aware contingency planning (source: MARC, RAL 2023)

One possible drawback of MPDM and all its follow-up works is their reliance on simple policies designed for highway-like structured environments, such as lane keeping and lane changing. This reliance may limit the capability of forward simulation to handle complex interactions. To address this, following the example of MPDM, the key to making POMDPs more effective is to simplify the action and state space through the growth of a high-level policy tree. It might be possible to create a more flexible policy tree, for example, by enumerating spatiotemporal relative position tags to all relative objects and then performing guided branching.

Decision-making remains a hot topic in current research. Even classical optimization methods have not been fully explored yet. Machine learning methods could shine and have a disruptive impact, especially with the advent of Large Language Models (LLMs), empowered by techniques like Chain of Thought (CoT) or Monte Carlo Tree Search (MCTS).

Trees

Trees are systematic ways to perform decision-making. Tesla AI Day 2021 and 2022 showcased their decision-making capabilities, heavily influenced by AlphaGo and the subsequent MuZero, to address highly complex interactions.

At a high level, Tesla’s approach follows behavior planning (decision making) followed by motion planning. It searches for a convex corridor first and then feeds it into continuous optimization, using spatiotemporal joint planning. This approach effectively addresses scenarios such as narrow passing, a typical bottleneck for path-speed decoupled planning.

Neural network heuristics guided MCTS (source: Tesla AI Day 2021)

Tesla also adopts a hybrid system that combines data-driven and physics-based checks. Starting with defined goals, Tesla’s system generates seed trajectories and evaluates key scenarios. It then branches out to create more scenario variants, such as asserting or yielding to a traffic agent. Such an interaction search over the policy tree is showcased in the presentations of the years 2021 and 2022.

One highlight of Tesla’s use of machine learning is the acceleration of tree search via trajectory optimization. For each node, Tesla uses physics-based optimization and a neural planner, achieving a 10 ms vs. 100 µs time frame — resulting in a 10x to 100x improvement. The neural network is trained with expert demonstrations and offline optimizers.

Trajectory scoring is performed by combining classical physics-based checks (such as collision checks and comfort analysis) with neural network evaluators that predict intervention likelihood and rate human-likeness. This scoring helps prune the search space, focusing computation on the most promising outcomes.

While many argue that machine learning should be applied to high-level decision-making, Tesla uses ML fundamentally to accelerate optimization and, consequently, tree search.

The Monte Carlo Tree Search (MCTS) method appears to be an ultimate tool for decision-making. Interestingly, those studying Large Language Models (LLMs) are trying to incorporate MCTS into LLMs, while those working on autonomous driving are attempting to replace MCTS with LLMs.

As of roughly two years ago, Tesla’s technology followed this approach. However, since March 2024, Tesla’s Full Self-Driving (FSD) has switched to a more end-to-end approach, significantly different from their earlier methods.

We can still consider interactions without implicitly growing trees. Ad-hoc logics can be implemented to perform one-order interaction between prediction and planning. Even one-order interaction can already generate good behavior, as demonstrated by TuSimple. MPDM, in its original form, is essentially one-order interaction, but executed in a more principled and extendable way.

Multi-order interaction between prediction and planning (source: TuSImple AI day, in Chinese, translated by author)

TuSimple has also demonstrated the capability to perform contingency planning, similar to the approach proposed in MARC (though MARC can also accommodate a customized risk preference).

Contingency planning (source: TuSImple AI day, in Chinese, translated by author)

After learning the basic building blocks of classical planning systems, including behavior planning, motion planning, and the principled way to handle interaction through decision-making, I have been reflecting on potential bottlenecks in the system and how machine learning (ML) and neural networks (NN) may help. I am documenting my thought process here for future reference and for others who may have similar questions. Note that the information in this section may contain personal biases and speculations.

Let’s look at the problem from three different perspectives: in the existing modular pipeline, as an end-to-end (e2e) NN planner, or as an e2e autonomous driving system.

Going back to the drawing board, let’s review the problem formulation of a planning system in autonomous driving. The goal is to obtain a trajectory that ensures safety, comfort, and efficiency in a highly uncertain and interactive environment, all while adhering to real-time engineering constraints onboard the vehicle. These factors are summarized as goals, environments, and constraints in the chart below.

The potentials of NN in planning (chart made by author)

Uncertainty in autonomous driving can refer to uncertainty in perception (observation) and predicting long-term agent behaviors into the future. Planning systems must also handle the uncertainty in future trajectory predictions of other agents. As discussed earlier, a principled decision-making system is an effective way to manage this.

Additionally, a typically overlooked aspect is that planning must tolerate uncertain, imperfect, and sometimes incomplete perception results, especially in the current age of vision-centric and HD map-less driving. Having a Standard Definition (SD) map onboard as a prior helps alleviate this uncertainty, but it still poses significant challenges to a heavily handcrafted planner system. This perception uncertainty was considered a solved problem by Level 4 (L4) autonomous driving companies through the heavy use of Lidar and HD maps. However, it has resurfaced as the industry moves toward mass production autonomous driving solutions without these two crutches. An NN planner is more robust and can handle largely imperfect and incomplete perception results, which is key to mass production vision-centric and HD-mapless Advanced Driver Assistance Systems (ADAS).

Interaction should be treated with a principled decision-making system such as Monte Carlo Tree Search (MCTS) or a simplified version of MPDM. The main challenge is dealing with the curse of dimensionality (combinatorial explosion) by growing a balanced policy tree with smart pruning through domain knowledge of autonomous driving. MPDM and its variants, in both academia and industry (e.g., Tesla), provide good examples of how to grow this tree in a balanced way.

NNs can also enhance the real-time performance of planners by speeding up motion planning optimization. This can shift the compute load from CPU to GPU, achieving orders of magnitude speedup. A tenfold increase in optimization speed can fundamentally impact high-level algorithm design, such as MCTS.

Trajectories also need to be more human-like. Human likeness and takeover predictors can be trained with the vast amount of human driving data available. It is more scalable to increase the compute pool rather than maintain a growing army of engineering talents.

The NN-based planning stack can leverage human-driving data more effectively (Chart created by author)

An end-to-end (e2e) neural network (NN) planner still constitutes a modular autonomous driving (AD) design, accepting structured perception results (and potentially latent features) as its input. This approach combines prediction, decision, and planning into a single network. Companies such as DeepRoute (2022) and Huawei (2024) claim to utilize this method. Note that relevant raw sensor inputs, such as navigation and ego vehicle information, are omitted here.

A full autonomous driving stack with an e2e planner (chart made by author)
A full autonomous driving stack with an e2e planner (chart made by author)

This e2e planner can be further developed into an end-to-end autonomous driving system that combines both perception and planning. This is what Wayve’s LINGO-2 (2024) and Tesla’s FSDv12 (2024) claim to achieve.

The benefits of this approach are twofold. First, it addresses perception issues. There are many aspects of driving that we cannot easily model explicitly with commonly used perception interfaces. For example, it is quite challenging to handcraft a driving system to nudge around a puddle of water or slow down for dips or potholes. While passing intermediate perception features might help, it may not fundamentally resolve the issue.

Additionally, emergent behavior will likely help resolve corner cases more systematically. The intelligent handling of edge cases, such as the examples above, may result from the emergent behavior of large models.

A full autonomous driving stack with a one-model e2e driver (chart made by author)
A full autonomous driving stack with a one-model e2e driver (chart made by author)

My speculation is that, in its ultimate form, the end-to-end (e2e) driver would be a large vision and action-native multimodal model enhanced by Monte Carlo Tree Search (MCTS), assuming no computational constraints.

A world model in autonomous driving, as of 2024 consensus, is typically a multimodal model covering at least vision and action modes (or a VA model). While language can be beneficial for accelerating training, adding controllability, and providing explainability, it is not essential. In its fully developed form, a world model would be a VLA (vision-language-action) model.

There are at least two approaches to developing a world model:

  1. Video-Native Model: Train a model to predict future video frames, conditioned on or outputting accompanying actions, as demonstrated by models like GAIA-1.
  2. Multimodality Adaptors: Start with a pretrained Large Language Model (LLM) and add multimodality adaptors, as seen in models like Lingo-2, RT2, or ApolloFM. These multimodal LLMs are not native to vision or action but require significantly less training resources.

A world model can produce a policy itself through the action output, allowing it to drive the vehicle directly. Alternatively, MCTS can query the world model and use its policy outputs to guide the search. This World Model-MCTS approach, while much more computationally intensive, could have a higher ceiling in handling corner cases due to its explicit reasoning logic.

Can we do without prediction?

Most current motion prediction modules represent the future trajectories of agents other than the ego vehicle as one or multiple discrete trajectories. It remains a question whether this prediction-planning interface is sufficient or necessary.

In a classical modular pipeline, prediction is still needed. However, a predict-then-plan pipeline definitely caps the upper limit of autonomous driving systems, as discussed in the decision-making session. A more critical question is how to integrate this prediction module more effectively into the overall autonomous driving stack. Prediction should aid decision-making, and a queryable prediction module within an overall decision-making framework, such as MPDM and its variants, is preferred. There are no severe issues with concrete trajectory predictions as long as they are integrated correctly, such as through policy tree rollouts.

Another issue with prediction is that open-loop Key Performance Indicators (KPIs), such as Average Displacement Error (ADE) and Final Displacement Error (FDE), are not effective metrics as they fail to reflect the impact on planning. Instead, metrics like recall and precision at the intent level should be considered.

In an end-to-end system, an explicit prediction module may not be necessary, but implicit supervision — along with other domain knowledge from a classical stack — can definitely help or at least boost the data efficiency of the learning system. Evaluating the prediction behavior, whether explicit or implicit, will also be helpful in debugging such an e2e system.

Conclusions First. For an assistant, neural networks (nets) can achieve very high, even superhuman performance. For agents, I believe that using a tree structure is still beneficial (though not necessarily a must).

First of all, trees can boost nets. Trees enhance the performance of a given network, whether it’s NN-based or not. In AlphaGo, even with a policy network trained via supervised learning and reinforcement learning, the overall performance was still inferior to the MCTS-based AlphaGo, which integrates the policy network as one component.

Second, nets can distill trees. In AlphaGo, MCTS used both a value network and the reward from a fast rollout policy network to evaluate a node (state or board position) in the tree. The AlphaGo paper also mentioned that while a value function alone could be used, combining the results of the two yielded the best results. The value network essentially distilled the knowledge from the policy rollout by directly learning the state-value pair. This is akin to how humans distill the logical thinking of the slow System 2 into the fast, intuitive responses of System 1. Daniel Kahneman, in his book “Thinking, Fast and Slow,” describes how a chess master can quickly recognize patterns and make rapid decisions after years of practice, whereas a novice would require significant effort to achieve similar results. Similarly, the value network in AlphaGo was trained to provide a fast evaluation of a given board position.

Grandmaster-Level Chess Without Search (source: DeepMind, 2024)

Recent papers explore the upper limits of this fast system with neural networks. The “chess without search” paper demonstrates that with sufficient data (prepared through tree search using a conventional algorithm), it is possible to achieve grandmaster-level proficiency. There is a clear “scaling law” related to data size and model size, indicating that as the amount of data and the complexity of the model increase, so does the proficiency of the system.

So here we are with a power duo: trees boost nets, and nets distill trees. This positive feedback loop is essentially what AlphaZero uses to bootstrap itself to reach superhuman performance in multiple games.

The same principles apply to the development of large language models (LLMs). For games, since we have clearly defined rewards as wins or losses, we can use forward rollout to determine the value of a certain action or state. For LLMs, the rewards are not as clear-cut as in the game of Go, so we rely on human preferences to rate the models via reinforcement learning with human feedback (RLHF). However, with models like ChatGPT already trained, we can use supervised fine-tuning (SFT), which is essentially imitation learning, to distill smaller yet still powerful models without RLHF.

Returning to the original question, nets can achieve extremely high performance with large quantities of high-quality data. This could be good enough for an assistant, depending on the tolerance for errors, but it may not be sufficient for an autonomous agent. For systems targeting driving assistance (ADAS), nets via imitation learning may be adequate.

Trees can significantly boost the performance of nets with an explicit reasoning loop, making them perhaps more suitable for fully autonomous agents. The extent of the tree or reasoning loop depends on the return on investment of engineering resources. For example, even one order of interaction can provide substantial benefits, as demonstrated in TuSimple AI Day.

From the summary below of the hottest representatives of AI systems, we can see that LLMs are not designed to perform decision-making. In essence, LLMs are trained to complete documents, and even SFT-aligned LLM assistants treat dialogues as a special type of document (completing a dialogue record).

Representative AI products as of 2024 (chart made by author)

I do not fully agree with recent claims that LLMs are slow systems (System 2). They are unnecessarily slow in inference due to hardware constraints, but in their vanilla form, LLMs are fast systems as they cannot perform counterfactual checks. Prompting techniques such as Chain of Thought (CoT) or Tree of Thoughts (ToT) are actually simplified forms of MCTS, making LLMs function more like slower systems.

There is extensive research trying to integrate full-blown MCTS with LLMs. Specifically, LLM-MCTS (NeurIPS 2023) treats the LLM as a commonsense “world model” and uses LLM-induced policy actions as a heuristic to guide the search. LLM-MCTS outperforms both MCTS alone and policies induced by LLMs by a wide margin for complex, novel tasks. The highly speculated Q-star from OpenAI seems to follow the same approach of boosting LLMs with MCTS, as the name suggests.

Below is a rough evolution of the planning stack in autonomous driving. It is rough as the listed solutions are not necessarily more advanced than the ones above, and their debut may not follow the exact chronological order. Nonetheless, we can observe general trends. Note that the listed representative solutions from the industry are based on my interpretation of various press releases and could be subject to error.

One trend is the movement towards a more end-to-end design with more modules consolidated into one. We see the stack evolve from path-speed decoupled planning to joint spatiotemporal planning, and from a predict-then-plan system to a joint prediction and planning system. Another trend is the increasing incorporation of machine learning-based components, especially in the last three stages. These two trends converge towards an end-to-end NN planner (without perception) or even an end-to-end NN driver (with perception).

A rough history of evolution of planning (Chart made by author)
  • ML as a Tool: Machine learning is a tool, not a standalone solution. It can assist with planning even in current modular designs.
  • Full Formulation: Start with a full problem formulation, then make reasonable assumptions to balance performance and resources. This helps create a clear direction for a future-proof system design and allows for improvements as resources increase. Recall the transition from POMDP’s formulation to engineering solutions like AlphaGo’s MCTS and MPDM.
  • Adapting Algorithms: Theoretically beautiful algorithms (e.g., Dijkstra and Value Iteration) are great for understanding concepts but need adaptation for practical engineering (Value Iteration to MCTS as Dijkstra’s algorithm to Hybrid A-star).
  • Deterministic vs. Stochastic: Planning excels in resolving deterministic (not necessarily static) scenes. Decision-making in stochastic scenes is the most challenging task toward full autonomy.
  • Contingency Planning: This can help merge multiple futures into a common action. It’s beneficial to be aggressive to the degree that you can always resort to a backup plan.
  • End-to-end Models: Whether an end-to-end model can solve full autonomy remains unclear. It may still need classical methods like MCTS. Neural networks can handle assistants, while trees can manage agents.



Source link

28Jun

RAG Survey & Available Research Overview


RAG Survey & Available Research

A Survey on Retrieval-Augmented Text Generation for Large Language Models

Recap On RAG

Retrieval-Augmented Generation (RAG) combines retrieval methods with In-Context Learning (ICL) & Natural Language Generation (NLG) to overcome the static knowledge limitations of large language models (LLMs) by integrating dynamic external information.

This approach primarily targets the text domain and offers a cost-effective solution to mitigate the generation of plausible but incorrect responses by LLMs, thus improving their accuracy and reliability through the use of real-world data.

𝗥𝗔𝗚 𝗶𝘀 𝗰𝗮𝘁𝗲𝗴𝗼𝗿𝗶𝘀𝗲𝗱 𝗶𝗻𝘁𝗼 𝗳𝗼𝘂𝗿 𝗸𝗲𝘆 𝘀𝘁𝗮𝗴𝗲𝘀:
𝗣𝗿𝗲-𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹,
𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹,
𝗣𝗼𝘀𝘁-𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹, &
𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻.

It also introduces evaluation methods for RAG, addressing the challenges faced and suggesting future research directions. By offering a structured framework and categorisation, the study aims to consolidate existing research on RAG, clarify its technological foundations, and emphasise its potential to expand the adaptability and applications of LLMs.

The paper highlights how RAG can dynamically integrate up-to-date information to enhance the performance of LLMs, making them more reliable and effective in generating accurate responses, thereby broadening their practical uses in various domains.

The image below shows a basic RAG workflow, but also the sub-components liked to the four RAG steps.

The image below contains a list of all the existing research for each of RAG components.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn





Source link

28Jun

Regulating the Virtual World as a new State – European Law Blog


By Annelieke Mooij and Jip Tushuizen

Blogpost 32/2024

The European Commission has recently published an initiative that aims to regulate virtual worlds and Web 4.0 which is structured around the objectives of the Digital Decade policy programme. Virtual reality (VR) is a relatively old concept that was introduced primarily through gaming environments but given a new meaning through the introduction of the “Metaverse”. The Metaverse allows users to enter an immersive virtual reality that offers relaxation, education or an office environment. The wide variety of virtual realities that are part of the Metaverse brings expected use to new levels. It is estimated that, by 2026, 25% of the global population will spend at least one hour a day in the Metaverse for the purposes of either work, shopping, education or entertainment. Unlike current online stores or movie platforms, the Metaverse will provide a 3D immersive environment where users can interact with other users. Companies like Apple, Google, Roblox and Microsoft have made significant investments, with the total market size expected to hit 800 billion US dollars by 2030, potentially contributing 2.8% to global GDP in the tenth year after its creation.

The interaction with other users has been proven to produce positive, but also very negative virtual experiences, sometimes even amounting to virtual rape. Victims have stated that whilst this act was virtual, the emotional damage was physical. VR technology has improved since 1993 when the first virtual rape occurred. Its current state can be so realistic as to confuse the human body with reality, impacting both our conscious and subconscious emotional state. Immersive environments further have a significant impact on users’ vision of the world. For example, gamers who are continuously confronted with oversexualized female avatars in games are more likely to tolerate sexual harassment and to support the rape myth.

Regulating the Metaverse hence does not seem an unnecessary luxury. This post will argue that the current regulatory approach under the Digital Services Act is insufficient. Whilst new regulation is highly desirable, it should not extend to provide de facto statehood to Metaverse providers.

 

Regulatory choices in the EU

The European Commission is currently working on a new legislative proposal to regulate virtual realities. While the initiative is still in its infancy, it concretely puts forward four pillars. The most important from a regulatory perspective is the third pillar: government. The Commission is not clear in how it intends to regulate the virtual worlds, but it refers to the applicability of the Digital Services Act (DSA). The DSA’s approach is primarily focused on the transparency of the terms & conditions and complaint procedures, but it does not regulate content. It determines applicability of fundamental rights (see e.g. Art. 1(1)) but fails to provide concrete elaboration. It further considers that content flagged as ‘illegal’ should be appropriately taken care of, but only refers to Union law and national law of Member States for the exact definition of what exactly constitutes ‘illegal content’ (see e.g. Arts. 16 and Art. 3(h)). Harmful content is furthermore excluded from this regime.

The counter-model to the DSA’s regulatory approach, so far not considered by the Commission in its Initiative, would be an emphasis on content regulation, whereby providers have to allow all speech without discrimination. Speech could only be limited when it is prohibited by law. This type of approach severely limits the freedom to conduct a business (Art. 16 CFREU) as all virtual realities are de facto regulated as public spaces. Nevertheless, this approach is considered to contribute to a safe digital environment. It would, however, entail assigning legal duties and limits on private legal persons that closely resemble those of a State. A legal person would have to monitor and effectively enforce the fundamental rights of its users. In this monitoring, the provider arguably becomes an extension of the State’s police. Similarly, virtual worlds can install their own dispute resolution proceedings. Increasing regulatory responsibilities for the Metaverse providers could reach a point where they are de facto mini-States. Whilst this approach may increase digital safety it raises the question of whether we could and should think of virtual realities as the new State?

 

Human rights and the Metaverse

Earlier generations of the internet were expected to produce substantial societal benefits by facilitating more efficient communication infrastructures. However, the destructive force of the internet has arguably turned out greater than initially anticipated with its ability to foster strong polarization, spread misinformation and reinforce pre-existing patterns of oppression. An example of the latter can be found on Facebook, with the platform notoriously punishing black women’s comments speaking out against racism and sexism. In fact, Facebook’s algorithms have targeted hate speech directed at white persons disproportionately compared to hate speech directed at any other societal group. Platform policies seeking to protect marginalized communities hence actually reinforce marginalization. Algorithms further generally consider the white male as the default, which resurfaced when Amazon had to discontinue using an AI hiring tool which rendered resumes containing variations of the word “women’s” as less desirable.

With further development of newer generations of the internet facilitating the development of entirely virtual spaces, the foregoing issues will aggravate exponentially if left regulated insufficiently or incorrectly. In fact, it has already been established that users of existing virtual spaces struggle with reporting mechanisms. Users describe that it is often difficult to identify the speaker, that usernames are not easily traceable and that it is relatively difficult for a new user to figure out how to report harassment. The definition of “online harassment” is further highly subjective. Harassment within a virtual space is experienced much more intensely by some identities than others and besides, full embodiment and presence within a virtual space facilitate a far more intense experience. It logically follows that users choose to customize their avatar in a way that reflects an identity that is subjected to the least amount of harassment, rather than have their avatar reflect their own physical identity. As a person of colour has pointed out: “Since I can choose to get treated like a black person or not get treated like a black person—I’m probably going to choose not to get treated like a black person.

Where one identity is deemed more “favourable” than the other, it logically follows that Metaverse spaces risk being overrepresented by identities rendered more “favourable” compared to others. Not only does this inherently communicate a narrative of desirability, it also projects a remarkably one-sided view of the world. Such a one-sided projection of reality unarguably runs the risk of seriously enhancing existing patterns of oppression towards minority groups both virtually and physically.

 

Human rights obligations of States vs companies

The modern conceptualization of Statehood is defined by the Westphalian system, identifying State sovereignty and the principle of territorial integrity as the foundations for the international legal system since 1648. Consequently, international human rights law is traditionally premised on the assumption that the sovereign State as the quintessential bearer of international obligations is responsible for the protection of fundamental rights within its territory. This logic firstly insinuates a hierarchy between the “oppressive” sovereign on the one hand and the citizen requiring protection from this oppression on the other. Secondly, this Westphalian logic is premised on the notion that the sovereign State is the exclusive actor within a legal system that is capable of wielding oppressive power against an individual.

Crucially, corporations are not, or at least not directly, subjected to international human rights obligations as it is the State that is burdened with this responsibility. Currently, companies merely face the moral responsibility to conduct a process of assessing, preventing and mitigating existing and potential adverse human rights impacts of operations across their supply chain. However, this process of human rights due diligence is derived from a soft law mechanism which does not produce legally binding obligations. Whilst the EU legislator has recently adopted a legally binding framework, emphasis remains on the avoidance of contribution to human rights violations rather than a responsibility to actively safeguard human rights protection across business operations.

 

The oppressive corporation

The traditional idea of the State monopoly on power and coercion has been proven to hold less relevance for today’s realities, with surveillance tasks increasingly becoming fragmented across various public and private actors. In fact, the idea of assigning State-like regulatory duties to private companies is far from modern, with former colonial companies like the Dutch and English East and West India companies being granted sovereign powers ranging from the right to form colonies to the right to use force. Interpreting the concept of ‘power’ in a broader sense, namely the ability to create or destroy wealth within a system, it follows that this trend undeniably mirrors today’s realities, with corporations representing 69 out of the top 100 largest economic entities globally in 2015.

With citizens increasingly practicing their daily needs and responsibilities in the Metaverse, the question to what extent this virtual world then factually still differs from life in a nation State is not far-fetched. Metaverse operators, predominantly represented by white or Asian non-queer men, can decide who gets to enter their virtual space and what type of behavior is deemed desirable. While the DSA mentions the applicability of fundamental rights to the regulation of online platforms, it is still questionable how this precisely plays out in practice. For example, the question arises whether operators can exclude certain identities from their virtual space without a valid cause. Upon entry, a user is obliged to accept the rules and guidelines of the platform. If the user disagrees, it is still uncertain to what extent these guidelines could effectively be challenged in a court. Users are left with the option of either agreeing and signing away their rights or disagreeing with subsequent exclusion from the platform. Such corporate policies are therefore capable of imposing restrictions on the user’s fundamental rights protection that undeniably resemble the intrusive character of regulatory decisions taken by the nation State.

 

Corporate sovereignty?

Accordingly, the corporate creator of the virtual space increasingly assumes the factual position of a regulatory actor with consequences that reach considerably further than previously seen. It takes on an authoritative role that inherently insinuates a hierarchy towards its users which mirrors the hierarchical position of the State against its citizens. Mark Zuckerberg has already indicated that he considers Facebook as a government with the policies it is developing and the number of users it has gathered. The company even announced the introduction of its own digital currency: the Libra.

Apart from a government, a recognized State under international law possesses a permanent population, a defined territory and the capacity to enter into relations with other States. The population of a Metaverse consists of its users, with the distinct virtual space providing for a defined territory these users can inhabit. Some argue that the sovereignty of the company is based on data rather than territory, rendering the boundaries of this sovereignty rather fluid. Metaverse companies could further enter into agreements of interoperability with other companies which determine the conditions based on which users and their data could ‘travel’ from one virtual space to the other. Yet, the extent to which these criteria apply to companies remains highly debatable. Indeed, corporate actors are not authorized to exercise physical coercion against citizens or collect taxes. While the latter issue could reasonably be refuted by the argument that the collection of data largely equates to the collection of taxes due to their monetizable character, or by selling data storage plans based on the amount of virtual goods a user wishes to store, the argument remains that corporate sovereignty inherently takes on a different form than State sovereignty. This becomes more apparent when considering that States and companies inherently project different narratives onto their target audience, with the former employing a vocabulary of citizenry while the latter considers its subordinates as ‘customers’ with the subsequent prioritization of commodification over human autonomy.

Nevertheless, the foregoing proves that Metaverse operators are factually exercising regulatory actions that mirror those of a State. Scholars draw an analogy with the financial principle of ‘same activity, same regulation’, prioritizing a logic of assigning regulatory duties based on an actor’s conduct rather than their status. In the context of a Metaverse, the overwhelming majority of power and factual control over the virtual space is likely assigned to one or a few dominant actors. Evidently, the extent to which the sovereign State can then still exercise factual control over this space that is entirely detached from State borders is severely limited. Subsequently, the ability of the regulatory approach taken under the DSA to effectively regulate such Metaverse spaces is highly questionable.

 

Conclusion

The development of Metaverse spaces undeniably creates promising societal benefits. Yet, as seen with the regulation of Web 2.0, the stakes for the Commission’s web 4.0 initiative are exceptionally high. It is crucial to be ahead of the developments in order to prevent power balances between States and private corporations to shift drastically. If the issue of human rights protection remains to be overlooked by the initiative, the possibility of an all-powerful Metaverse operator arising, or possibly even a “Virtual Wild West”, becomes increasingly realistic. While legislative efforts provide for promising frameworks, further elaboration on human rights duties of companies is crucial to facilitate a responsible transition into the virtual space. While it is largely undisputable that rendering Metaverse platforms as entirely sovereign States is rather undesirable and unrealistic, it is quintessential to assign responsibilities that mirror the factual position and regulatory actions of operators. Yet, the EU legislator will have no easy task in determining to what extent such duties should be assigned upon providers and what form these duties should have.



Source link

27Jun

Classification Loss Functions: Intuition and Applications | by Ryan D’Cunha | Jun, 2024


A simpler way to understand derivations of loss functions for classification and when/how to apply them in PyTorch

Source: GPT4o Generated

Whether you are new to exploring neural networks or a seasoned pro, this should be a beneficial read to gain more intuition about loss functions. As someone testing many different loss functions during model training, I would get tripped up on small details between functions. I spent hours researching an intuitive depiction of loss functions from textbooks, research papers, and videos. I wanted to share not only the derivations that helped me grasp the concepts, but common pitfalls and use cases for classification in PyTorch.

Before we get started, we need to define some basic terms I will be using.

  • Training dataset: {xᵢ, yᵢ}
  • Loss function: L[φ]
  • Model prediction output f[xᵢ, φ] with parameters φ
  • Conditional probability: Pr(y|x)
  • Parametric distribution: Pr(y|ω) with ω representing network parameters for distribution over y

Let’s first go back to the basics. A common thought is that neural networks compute a scalar output from the model f[xᵢ, φ]. However, most neural networks these days are trained to predict parameters of a distribution y. (as oppose to to predicted the value of y).

In reality, a network will output a conditional probability distribution Pr(y|x) over possible outputs y. In other words, every input data point will lead to a probability distribution generated for each output. The network wants to learn the parameters for the probability distribution and then use the parameters and distribution to predict the output.

The traditional definition of a loss function is a function that compares target and predicted outputs. But we just said a network raw output is a distribution instead of a scalar output, so how is this possible?

Thinking about this from the view we just defined, a loss function pushes each yᵢ to have a higher probability in the distribution Pr(yᵢ|xᵢ). The key part to remember is that our distribution is being used to predict the true output based on parameters from our model output. Instead of using our input xᵢ for the distribution, we can think of a parametric distribution Pr(y|ω) where ω represents probability distribution parameters. We are still considering the input, but there will be a different ωᵢ = f[xᵢ, φ] for each xᵢ.

Note: To clarify a confusing concept, φ represents the model parameters and ω represents the probability distribution parameters

Going back to the traditional definition of a loss function, we need to get an output we can use from the model. From our probability distribution, it seems logical to take φ that produces the greatest probability for each xᵢ. Thus, we need the overall φ that produces the greatest probability across all training points I (all derivations are adapted from Understanding Deep Learning [1]):

Maximizing parameters from output model probability distributions [1]

We multiply the generated probabilities from each distribution to find φ that produces the maximum probability (called max likelihood). In order to do this, we must assume the data is independent and identically distributed. But now we run into a problem: what if the probabilities are very small? Our multiplication output will approach 0 (similar to a vanishing gradient issue). Furthermore, our program may not be able to process such small numbers.

To fix this, we bring in a logarithmic function! Utilizing the properties of logs, we can add together our probabilities instead of multiplying them. We know that the logarithm is a monotonically increasing function, so our original output is preserved and scaled by the log.

Using logarithms to add probabilities [1]

The last thing we need to get our traditional negative log-likelihood is to minimize the output. We are currently maximizing the output, so simply multiply by a negative and take the minimum argument (think about some graphical examples to convince yourself of this):

Negative Log-Likelihood [1]

Just by visualizing the model output as a probability distribution, attempting to maximize φ that creates the max probability, and applying a log, we have derived negative log-likelihood loss! This can be applied to many tasks by choosing a logical probability distribution. Common classification examples are shown below.

If you are wondering how a scalar output is generated from the model during inference, it’s just the max of the distribution:

Generating an output from inference [1]

Note: This is just a derivation of negative log-likelihood. In practice, there will most likely be regularization present in the loss function too.

Up to this point, we derived negative log-likelihood. Important to know, but it can be found in most textbooks or online resources. Now, let’s apply this to classification to understand it’s application.

Side note: If you are interested in seeing this applied to regression, Understanding Deep Learning [1] has great examples with univariate regression and a Gaussian Distribution to derive Mean Squared Error

Binary Classification

The goal of binary classification is to assign an input x to one of two class labels y ∈ {0, 1}. We are going to use the Bernoulli distribution as our probability distribution of choice.

Mathematical Representation of Bernoulli Distribution. Image by Author

This is just a fancy way of saying the probability that the output is true, but the equation is necessary to derive our loss function. We need the model f[x, φ] to output p to generate the predicted output probability. However, before we can input p into Bernoulli, we need it to be between 0 and 1 (so it’s a probability). The function of choice for this is a sigmoid: σ(z)

Source: https://en.wikipedia.org/wiki/Sigmoid_function

A sigmoid will compress the output p to between 0 and 1. Therefore our input to Bernoulli will be p = σ(f[x, φ]). This makes our probability distribution:

New Probability Distribution with Sigmoid and Bernoulli. Image by Author

Going back to negative log-likehood, we get the following:

Binary Cross Entropy. Image by Author

Look familiar? This is the binary cross entropy (BCE) loss function! The main intuition with this is understanding why a sigmoid is used. We have a scalar output and it needs to be scaled to between 0 and 1. There are other functions capable of this, but the sigmoid is the most commonly used.

BCE in PyTorch

When implementing BCE in PyTorch, there are a few tricks to watch out for. There are two different BCE functions in PyTorch: BCELoss() and BCEWithLogitsLoss(). A common mistake (that I have made) is incorrectly swapping the use cases.

BCELoss(): This torch function outputs the loss WITH THE SIGMOID APPLIED. The output will be a probability.

BCEWithLogitsLoss(): The torch function outputs logits which are the raw outputs of the model. There is NO SIGMOID APPLIED. When using this, you will need to apply a torch.sigmoid() to the output.

This is especially important for Transfer Learning as the model even if you know the model is trained with BCE, make sure to use the right one. If not, you make accidentally apply a sigmoid after BCELoss() causing the network to not learn…

Once a probability is calculated using either function, it needs to be interpreted during inference. The probability is the model’s prediction of the likelihood of being true (class label of 1). Thresholding is needed to determine the cutoff probability of a true label. p = 0.5 is commonly used, but it’s important to test out and optimize different threshold probabilities. A good idea is to plot a histogram of output probabilities to see the confidence of outputs before deciding on a threshold.

Multiclass Classification

The goal of multiclass classification is to assign an input x to one of K > 2 class labels y ∈ {1, 2, …, K}. We are going to use the categorical distribution as our probability distribution of choice.

Categorical Distribution. Image by Author

This is just assigning a probability for each class for a given output and all probabilities must sum to 1. We need the model f[x, φ] to output p to generate the predicted output probability. The sum issue arises as in binary classification. Before we can input p into Bernoulli, we need it to be a probability between 0 and 1. A sigmoid will no longer work as it will scale each class score to a probability, but there is no guarantee all probabilities will sum to 1. This may not immediately be apparent, but an example is shown:

Sigmoid does not generate probability distribution in multiclass classification. Image by Author

We need a function that can ensure both constraints. For this, a softmax is chosen. A softmax is an extension of a sigmoid, but it will ensure all the probabilities sum to 1.

Softmax Function. Image by Author

This means the probability distribution is a softmax applied to the model output. The likelihood of calculating a label k: Pr(y = k|x) = Sₖ(f[x, φ]).

To derive the loss function for multiclass classification, we can plug the softmax and model output into the negative log-likelihood loss:

Multiclass Cross Entropy. Image by Author

This is the derivation for multiclass cross entropy. It is important to remember the only term contributing to the loss function is the probability of the true class. If you have seen cross entropy, you are more familiar with a function with a p(x) and q(x). This is identical to the cross entropy loss equation shown where p(x) = 1 for the true class and 0 for all other classes. q(x) is the softmax of the model output. The other derivation of cross entropy comes from using KL Divergence, and you can reach the same loss function by treating one term as a Dirac-delta function where true outputs exist and the other term as the model output with softmax. It is important to note that both routes lead to the same loss function.

Cross Entropy in PyTorch

Unlike binary cross entropy, there is only one loss function for cross entropy in PyTorch. nn.CrossEntropyLoss returns the model output with the softmax already applied. Inference can be performed by taking the largest probability softmax model output (taking the highest probability as would be expected).

These were two well studied classification examples. For a more complex task, it may take some time to decide on a loss function and probability distribution. There are a lot of charts matching probability distributions with intended tasks, but there is always room to explore.

For certain tasks, it may be helpful to combine loss functions. A common use case for this is in a classification task where it maybe helpful to combine a [binary] cross entropy loss with a modified Dice coefficient loss. Most of the time, the loss functions will be added together and scaled by some hyperparameter to control each individual functions contribution to loss.



Source link

26Jun

Predicting AI’s Impact on Work


GovAI research blog posts represent the views of their authors, rather than the views of the organisation.

Introduction

In the coming years, AI will impact many people’s jobs.

AI systems like ChatGPT and Claude can already perform a small but growing number of tasks. This includes, for example, drafting emails, producing illustrations, and writing simple computer programs.

Increasingly capable AI systems will produce increasingly significant economic effects. Automation will boost economic growth, but it will also disrupt labour markets: new jobs will be created and others will be lost. At a minimum, there will be immediate harm to workers who are forced to look for new work. Depending on how different groups are affected, inequality could rise or fall.

While automation is not a new phenomenon, some economists have suggestedcontroversially — that AI’s impacts might be more disruptive than those of previous labour-saving technologies. First, AI-driven automation could potentially happen faster. Second, at least in the long-run, AI might more heavily substitute for human labour. The second concern means that, beyond some level of automation, the typical person’s ability to find well-paid work could actually begin to decline. However, there is still no consensus on what to expect.

If policymakers could more clearly foresee AI’s economic impacts, then they could more readily develop policies to mitigate harms and accelerate benefits. To this end, some researchers have begun to develop automation evaluations: forward-looking assessments of AI’s potential to automate work, as well as automation’s downstream impacts on labour markets.

All existing approaches to automation evaluations have major limitations. Researchers are not yet in a position to make reliable predictions.

Fortunately, though, there is a great deal of room to produce more informative evaluations. This post will discuss a number of promising directions for further work, such as adopting more empirically-grounded methods for estimating automation potential and leveraging the “staged release” of AI systems to study early real-world effects. As the uncertain economic impact of AI looms increasingly large, advancing the science of automation evaluations should be a policy priority.

The Importance of Automation Evaluations

Policymakers will need to address the economic impacts of AI to some extent. This may mean crafting policies to support high-growth industries, limit harm to displaced workers (for example, by supporting retraining efforts), or redistribute concentrated gains.

Reliable automation evaluations would give policymakers more time to plan and craft effective policies, by offering them foresight into AI’s economic impacts.1 Without this foresight, policymakers could find themselves scrambling to respond to impacts after-the-fact. A purely reactive approach could be particularly inadequate if AI’s impacts unfold unusually quickly.2

Current Automation Evaluation Methods

The potential automation impacts of AI can be evaluated in several ways, though existing approaches have important limitations.

In this post, I review two prominent methods: estimating “occupational exposure” to AI using task descriptions and measuring the task-level productivity impacts that workers get from using AI systems.3 The former helps give a broad but imprecise overview of potential labour market impacts across the economy. The latter provides more precise, but narrowly focused evidence on the effect of AI on particular occupations and tasks.

After describing these two approaches, along with their respective advantages and limitations, I will discuss limitations that are relevant to both methods. I will then outline a number of research directions that could help to mitigate these limitations.

Estimating Occupational Exposure to AI Using Task Descriptions

What is “occupational exposure”?

One way to evaluate potential automation impacts is to estimate occupational exposure to AI. While definitions vary across studies, a task is generally considered “exposed” to AI if AI can be meaningfully helpful for completing it. One job is then typically said to be “more exposed” to AI than another if a larger proportion of the tasks that make up this job are exposed.

For example, because AI has proven to be useful for tasks involving writing, and less useful for physical tasks, secretaries are likely to be more exposed to today’s AI systems than roofers.

Exposure is a flexible concept that has been operationalised in a range of ways. For example, in a recent Science paper, my co-authors and I define a task as “exposed” to AI systems like ChatGPT if these systems could double the productivity of the worker performing the task.

How is exposure currently estimated?

Exposure estimates are not typically based on empirical observations of AI being applied to tasks.

Instead, these estimates are produced by drawing on existing datasets — such as the United States Bureau of Labor Statistics’ O*NET database — that describe a wide range of worker tasks. These descriptions may then be given to AI experts, who are asked to apply their knowledge of AI to judge whether the task is exposed. Alternatively, experts may develop a grading rubric that classifies tasks as exposed or unexposed based on whether they have particular traits (e.g. whether the tasks are described as “physical” and “routine”). There are also a number of other possible variations on these approaches.4

Drawing on existing task descriptions is appealing, because it avoids the need to collect new data or perform costly experiments. As a result, these studies are often able to report exposure estimates for a large number of jobs across the economy. This can help identify broad patterns and macro-level findings that would not be clear if only a handful of occupations were considered.

How accurate are current exposure estimates?

To some extent, these exposure studies achieve breadth by sacrificing accuracy. We cannot expect perfect accuracy from estimates that are based only on descriptions of tasks.

The methodologies applied in these studies, particularly the earliest studies, have attracted a number of critiques. It has also been noted that different exposure studies produce conflicting estimates, which implies that at least some estimates must be substantially inaccurate.

On the other hand, some patterns have emerged across recent exposure studies. Arguably, some of them are also beginning to be empirically validated. For example, recent studies have consistently found that higher-wage work is more exposed to AI in the US. This finding matches a pattern in AI adoption data in the US: on average, industries with higher AI adoption rates also have higher average wages.

Ultimately, we do not yet know exactly how accurate description-based exposure estimates are or can become.

Measuring Task-Level Worker Productivity Impacts

A common alternative approach to automation evaluation is to measure the task-level productivity impacts of AI systems on workers. Using this method, researchers randomly assign access to an AI system to some workers but not to others. They then measure how much more or less productively workers with access to the AI system can perform various tasks compared to workers without access. 

Unlike description-based exposure estimates, these worker productivity impact estimates are based on empirical observation. They also attempt to provide information about exactly how useful AI is for a given task, rather than simply classifying a task as “exposed” or “not exposed.”

For example, one study reported a 40% time savings and 18% boost in quality on professional writing tasks. Another reported a 55.8% time savings for software developers working on a coding task using GitHub Copilot.5

Ultimately, these experiments can offer more reliable and fine-grained information about how useful AI is for performing a specific task. The chief limitation of these experiments, however, is that they can be costly to design and run, particularly if they are implemented in real-world work settings. As a result, unlike description-based exposure estimates, they are typically applied to individual occupations and small sets of tasks. They are therefore more limited in scope and do not provide the broad economy-wide insights that can be captured by occupational exposure studies.

Limitations of Existing Methods

These two automation evaluation methods both have distinct strengths and weaknesses. Description-based exposure studies sacrifice accuracy for breadth, while worker productivity studies offer greater accuracy but on a smaller scale. Researchers deciding between these methods therefore need to consider their priorities in light of a significant trade-off between accuracy and breadth.

While the two methods have distinct limitations, there is a third category of limitations that applies to both methods. Because of these limitations, neither method can be used to directly predict the impact of AI on real-world variables such as wages, employment, or growth.

In particular, neither approach can effectively predict or account for:

  • Barriers to AI adoption
  • Changes in demand for workers’ outputs
  • The complexity of real-world jobs
  • Future AI progress
  • New tasks and new ways of producing the same outputs

To accurately predict real-world impacts, further evidence and analysis are ultimately needed.

Neither method accounts for barriers to AI adoption or for changes in demand for workers’ outputs

Ultimately, these methods can only predict whether AI has the potential to affect an occupation in some significant way. They do not tell us that AI actually will have a significant impact. They also do not tell us whether the impact will be positive or negative for workers.

For example, if we learn that AI can boost productivity in some occupation, this does not mean that it actually will be widely adopted by that occupation within any given time frame. There may be important barriers that delay adoption, such as the need for employee training, process adjustments, or costly capital investments.

Even if AI does boost productivity within an occupation, the implications for wages and employment are not necessarily clear. For example, if copy editors become more productive, this may allow them to earn more money by completing more assignments. However, it may also cause them to earn less, since the amount they earn per assignment could decline as the overall supply of copy-editing grows. The net effect will depend, in part, on how much demand for copy-editing rises as prices fall. However, neither exposure studies or worker productivity impact studies tell us anything about impacts on the demand for copy-editing services.

Neither method fully accounts for the complexity of real-world jobs

For the most part, both of the automation evaluation methods I have discussed treat jobs as collections of well-defined, isolated tasks.6 The evaluations consider how useful AI is for individual tasks, with the hope that researchers can easily make inferences about how AI will affect occupations that contain those tasks.

However, this approach overlooks the nuanced reality of jobs. Many occupations actually involve a complex web of interrelated tasks, interpersonal interactions, and contextual decisions. 

For example, even if an AI system can perform most of the individual tasks that are considered to be part of a particular worker’s job, this does not necessarily imply that the job can be automated to the point that the work is completely replaced by the technology. Furthermore, researchers cannot reliably judge how AI will impact a worker’s overall productivity if they do not understand how their workflow or role will shift in response to the adoption of AI.

Neither method accounts for future AI progress

The economic impact of AI will partly depend on how existing AI capabilities are applied throughout the economy. However, the further ahead we look, the more these impacts will depend on what new AI capabilities are developed.

Empirical worker productivity studies can only measure the impact of existing AI systems. Exposure studies typically ask analysts to judge how exposed various tasks are to existing AI capabilities. It will inevitably be harder to estimate exposure to future capabilities, when we do not yet know what these capabilities will be.

Neither method accounts for new tasks and new ways of producing the same outputs

The introduction of new technologies like the electric lightbulb or digital camera did not automate work by performing the same tasks that workers had previously performed in order to light a gas lamp or develop a photograph in a darkroom. Instead, these technologies completely changed the set of tasks that a worker needed to perform in order to produce the same or better-quality output (e.g. a brightly lit street lamp or a photograph).

These historical examples suggest that we cannot necessarily assume that a job will remain immune to significant changes just because AI is not helpful for performing the tasks it currently involves.

When considered on a broad, historical scale, technological progress does not only allow existing tasks to be performed more efficiently. It also makes entirely new tasks possible. Existing approaches to automation evaluations do little to help predict or understand the implications of these new tasks.

Towards Improved Evaluations

Below, I discuss a few ways in which evaluations could be improved to overcome some of these trade-offs and limitations. These are: 

  • Running large-sample worker productivity studies
  • Piloting evaluations of AI performance on worker tasks
  • Modelling additional economic variables
  • Measuring automation impacts in real-world settings (perhaps leveraging the “staged release” of new AI systems)

Running large-sample worker productivity studies

One way to overcome the trade-off between breadth and accuracy in automation evaluations would be to simply invest in much larger-scale worker productivity studies.

For example, a large-scale study (or set of studies) could attempt to empirically measure productivity impacts across a representative sample of economically relevant tasks and occupations. If the sample is large enough, it could offer the same kinds of insights about economy-wide patterns and trends that exposure studies aim to offer — but with greater accuracy and precision.

While the costs involved would be significant, it is possible that the insights produced would warrant these costs.

Piloting evaluations of AI performance on worker tasks

Another, potentially more scalable, approach to achieving both breadth and empirical grounding could be to pilot evaluations of AI performance on a wide variety of worker tasks. These evaluations would involve having AI systems perform tasks that workers currently perform and then assessing how helpful the technology has been in terms of reducing task time or improving the quality of task outputs. 

In practice, this approach would involve treating automation evaluations the same way researchers treat evaluations on performance-based benchmarks for other AI capabilities that are not directly tied to work. The goal of this approach would be to directly assess what AI systems can do, rather than assessing (as in the case of worker productivity studies) what workers can do with AI systems.

As an illustrative example, an evaluation focused on the specific task of writing professional emails could judge how well a model performs at drafting a representative sample of professional emails (e.g. a polite rejection email, a scheduling email, a workshop invitation). Evaluation scores could then be either considered directly or converted into binary judgements about whether or not a task is “exposed.” They could even be used to judge whether or not a system is technically capable of reliably automating a task.

Of course, there are significant technical challenges associated with implementing this approach. These include:

  • Translating task descriptions into clear prompts to an AI system7
  • Developing and validating efficient and reliable methods for rating AI system performance on a variety of tasks8,9

Despite these challenges, running initial pilots to develop and apply these evaluations could be a worthwhile experiment. If the early pilots are promising, then the approach could be scaled to examine a broader and more representative set of occupational tasks. 

Having the technical infrastructure to run evaluations of AI performance on worker tasks could become increasingly important as the automation capabilities of new systems advance.

Modelling additional economic variables

Beyond the breadth/accuracy trade-off, a shared limitation of the methods I have discussed so far is that neither can account for additional economic variables (such as the elasticity of demand for a worker’s outputs) that will help to determine real-world automation impacts.

One path forward here seems to be for researchers to attempt to estimate some of these additional variables and integrate those estimates into economic models.

It is not clear how far researchers can follow this path, since many of the relevant variables are themselves difficult to estimate. However, there is some early work that moves in this direction. For instance, a recent pioneering paper from a team at MIT sets out to go “beyond exposure” to evaluate which tasks it is currently cost-effective to automate with computer vision.

Measuring automation impacts in real-world settings

Another important limitation of the methods I have discussed is that they study tasks in isolation and are not capable of addressing the complexity of real-world jobs. For example, estimates of task-level productivity impacts do not allow us to infer how a worker’s overall productivity will change.

Experiments that vary AI access across teams of workers within real-world firms, over long periods of time, could potentially allow us to overcome this limitation. Researchers could observe how AI increases the overall productivity and performance of both workers and teams. In addition, these studies could potentially allow us to begin to observe how AI affects demand for labour.

It would be costly to run these experiments and would require complicated negotiations with individual firms. Fortunately, however, there may be another approach to studying real-world impacts that would be more feasible.

Specifically, researchers could leverage the staged release of AI systems. Many AI companies already deploy frontier AI systems through “staged release” processes, which often involve giving different actors access to their systems at different points in time. With the cooperation of AI companies and other firms, researchers could take advantage of the variation in adoption during staged releases to estimate the effect of AI on productivity and labour demand in the real world.10,11 Because some companies will get access earlier than others, staged releases enable comparisons between companies with access and those without access.

Conclusion

Automation evaluations could prove to be a critical tool for policymakers as they work to minimise the societal harms and maximise the economic benefits of powerful AI systems. Predicting how those systems will affect the labour market is a challenge, and current methods for evaluation have limitations. It is important for policymakers and researchers to be mindful of these limitations and invest in post-deployment monitoring of impacts as well. However, researchers can also improve their predictions by running large-sample worker productivity studies, piloting evaluations of AI performance on worker tasks, modelling additional economic variables, and measuring automation impacts in real-world settings.

The author of this piece would like to thank the following people for helpful comments on this work: Ben Garfinkel, Stephen Clare, John Halstead, Iyngkarran Kumar, Daniel Rock, Peter Wills, Anton Korinek, Leonie Koessler, Markus Anderljung, Ben Bucknall, and Alan Chan for helpful conversations and feedback

Sam Manning can be contacted at

sa*********@go********.ai













Source link

26Jun

FlowMind Is An Automatic Workflow Generator | by Cobus Greyling | Jun, 2024


RAG & API Retrieval, Partitioning & Extraction

FlowMind aims to solve for hallucination by providing contextual reference data at inference; analogous to RAG. The API also seeks to retrieve, partition and extract relevant XML-like blocks. Blocks are again very much similar to chunks.

FlowMind is also challenged by the problems of selecting the top retrieved blocks/chunks and truncating blocks which are too long.

Embeddings are also used in FlowMind to search according to semantic similarity.

So FlowMind can be considered as JPMorganChase’s propriety RAG solution and obviously it meets their data privacy and governance requirements. What I find curious is that the market in general has settled on certain terminology and a shared understanding has been developed.

JPMorganChase breaks from these terms and introduces their own lexicon. However, FlowMind is very much comparable to RAG in general.

It is evident that through this implementation, JPMorganChase has full control over their stack on a very granular level. The process and flow Python functions created by FlowMind most probably fits into their current ecosystem.

MindFlow can also be leveraged by skilled users to generate flows based on a description which can be re-used.

The aim of FlowMind is to remedy hallucination in Large Language Models (LLMs) while ensuring there is no direct link between the LLM and proprietary code or data.

FlowMind creates flows or pipelines on the fly, a process the paper refers to as robotic process automation. There is a human-in-the-loop element, which can also be seen as a single dialog turn, allowing users to interact with and refine the generated workflows.

Application Programming Interfaces (APIs) are used for grounding the LLMs, serving as a contextual reference to guide their reasoning. This is followed by code generation, code execution, and ultimately delivering the final answer.

Stage 1: It begins by following a structured lecture plan (prompt template as seen above) to create a lecture prompt. This prompt educates the Large Language Model (LLM) about the context and APIs, preparing it to write code.



Source link

Protected by Security by CleanTalk