10Jul

Our Human Creativity Is Becoming More Uniform Due To ChatGPT | by Cobus Greyling | Jul, 2024


Our ideas, solutions and artistic expressions are becoming less original & diverse.

One of the primary use-cases for ChatGPT is to use it to become more creative, or to generate new and unique ideas.

This recent study considers how, instead of ChatGPT making us more creative, it leads to similar ideas across disparate users. It also leads us to approach and experience the creative process differently.

In a study with 36 participants, the researchers found that users of ChatGPT produced less semantically distinct ideas compared to alternative creativity support tools (CSTs).

Additionally, ChatGPT users generated more detailed ideas but felt less responsible for them.

The challenge is that a large number of people are using highly centralised, data-driven AI systems (such as ChatGPT) for our creative ideas and content. This leads to decreased diversity in the results of our creative processes, amongst other things.

Below on the right, is a representation of users making use of ChatGPT to produce much more homogenous ideas. At a group level users on the left are making use of more traditional creativity support tools with more diverse ideas.



Source link

09Jul

Doping: A Technique to Test Outlier Detectors | by W Brett Kennedy | Jul, 2024


Using well-crafted synthetic data to compare and evaluate outlier detectors

This article continues my series on outlier detection, following articles on Counts Outlier Detector and Frequent Patterns Outlier Factor, and provides another excerpt from my book Outlier Detection in Python.

In this article, we look at the issue of testing and evaluating outlier detectors, a notoriously difficult problem, and present one solution, sometimes referred to as doping. Using doping, real data rows are modified (usually) randomly, but in such a way as to ensure they are likely an outlier in some regard and, as such, should be detected by an outlier detector. We’re then able to evaluate detectors by assessing how well they are able to detect the doped records.

In this article, we look specifically at tabular data, but the same idea may be applied to other modalities as well, including text, image, audio, network data, and so on.

Likely, if you’re familiar with outlier detection, you’re also familiar, at least to some degree, with predictive models for regression and classification problems. With these types of problems, we have labelled data, and so it’s relatively simple to evaluate each option when tuning a model (selecting the best pre-processing, features, hyper-parameters, and so on); and it’s also relatively easy to estimate a model’s accuracy (how it will perform on unseen data): we simply use a train-validation-test split, or better, use cross validation. As the data is labelled, we can see directly how the model performs on a labelled test data.

But, with outlier detection, there is no labelled data and the problem is significantly more difficult; we have no objective way to determine if the records scored highest by the outlier detector are, in fact, the most statistically unusual within the dataset.

With clustering, as another example, we also have no labels for the data, but it is at least possible to measure the quality of the clustering: we can determine how internally consistent the clusters are and how different the clusters are from each other. Using some distance metric (such as Manhattan or Euclidean distances), we can measure how close records within a cluster are to each other and how far apart clusters are from each other.

So, given a set of possible clusterings, it’s possible to define a sensible metric (such as the Silhouette score) and determine which is the preferred clustering, at least with respect to that metric. That is, much like prediction problems, we can calculate a score for each clustering, and select the clustering that appears to work best.

With outlier detection, though, we have nothing analogous to this we can use. Any system that seeks to quantify how anomalous a record is, or that seeks to determine, given two records, which is the more anomalous of the two, is effectively an outlier detection algorithm in itself.

For example, we could use entropy as our outlier detection method, and can then examine the entropy of the full dataset as well as the entropy of the dataset after removing any records identified as strong outliers. This is, in a sense, valid; entropy is a useful measure of the presence of outliers. But we cannot assume entropy is the definitive definition of outliers in this dataset; one of the fundamental qualities of outlier detection is that there is no definitive definition of outliers.

In general, if we have any way to try to evaluate the outliers detected by an outlier detection system (or, as in the previous example, the dataset with and without the identified outliers), this is effectively an outlier detection system in itself, and it becomes circular to use this to evaluate the outliers found.

Consequently, it’s quite difficult to evaluate outlier detection systems and there’s effectively no good way to do so, at least using the real data that’s available.

We can, though, create synthetic test data (in such a way that we can assume the synthetically-created data are predominantly outliers). Given this, we can determine the extent to which outlier detectors tend to score the synthetic records more highly than the real records.

There are a number of ways to create synthetic data we cover in the book, but for this article, we focus on one method, doping.

Doping data records refers to taking existing data records and modifying them slightly, typically changing the values in just one, or a small number, of cells per record.

If the data being examined is, for example, a table related to the financial performance of a company comprised of franchise locations, we may have a row for each franchise, and our goal may be to identify the most anomalous of these. Let’s say we have features including:

  • Age of the franchise
  • Number of years with the current owner
  • Number of sales last year
  • Total dollar value of sales last year

As well as some number of other features.

A typical record may have values for these four features such as: 20 years old, 5 years with the current owner, 10,000 unique sales in the last year, for a total of $500,000 in sales in the last year.

We could create a doped version of this record by adjusting a value to a rare value, for example, setting the age of the franchise to 100 years. This can be done, and will provide a quick smoke test of the detectors being tested — likely any detector will be able to identify this as anomalous (assuming a value is 100 is rare), though we may be able to eliminate some detectors that are not able to detect this sort of modified record reliably.

We would not necessarily remove from consideration the type of outlier detector (e.g. kNN, Entropy, or Isolation Forest) itself, but the combination of type of outlier detector, pre-processing, hyperparameters, and other properties of the detector. We may find, for example, that kNN detectors with certain hyperparameters work well, while those with other hyperparameters do not (at least for the types of doped records we test with).

Usually, though, most testing will be done creating more subtle outliers. In this example, we could change the dollar value of total sales from 500,000 to 100,000, which may still be a typical value, but the combination of 10,000 unique sales with $100,000 in total sales is likely unusual for this dataset. That is, much of the time with doping, we are creating records that have unusual combinations of values, though unusual single values are sometimes created as well.

When changing a value in a record, it’s not known specifically how the row will become an outlier (assuming it does), but we can assume most tables have associations between the features. Changing the dollar value to 100,000 in this example, may (as well as creating an unusual combination of number of sales and dollar value of sales) quite likely create an unusual combination given the age of the franchise or the number of years with the current owner.

With some tables, however, there are no associations between the features, or there are only few and weak associations. This is rare, but can occur. With this type of data, there is no concept of unusual combinations of values, only unusual single values. Although rare, this is actually a simpler case to work with: it’s easier to detect outliers (we simply check for single unusual values), and it’s easier to evaluate the detectors (we simply check how well we are able to detect unusual single values). For the remainder of this article, though, we will assume there are some associations between the features and that most anomalies would be unusual combinations of values.

Most outlier detectors (with a small number of exceptions) have separate training and prediction steps. In this way, most are similar to predictive models. During the training step, the training data is assessed and the normal patterns within the data (for example, the normal distances between records, the frequent item sets, the clusters, the linear relationships between features, etc.) are identified. Then, during the prediction step, a test set of data (which may be the same data used for training, or may be separate data) is compared against the patterns found during training, and each row is assigned an outlier score (or, in some cases, a binary label).

Given this, there are two main ways we can work with doped data:

  1. Including doped records in the training data

We may include some small number of doped records in the training data and then use this data for testing as well. This tests our ability to detect outliers in the currently-available data. This is a common task in outlier detection: given a set of data, we often wish to find the outliers in this dataset (though may wish to find outliers in subsequent data as well — records that are anomalous relative to the norms for this training data).

Doing this, we can test with only a small number of doped records, as we do not wish to significantly affect the overall distributions of the data. We then check if we are able to identify these as outliers. One key test is to include both the original and the doped version of the doped records in the training data in order to determine if the detectors score the doped versions significantly higher than the original versions of the same records.

We also, though, wish do check that the doped records are generally scored among the highest (with the understanding that some original, unmodified records may legitimately be more anomalous than the doped records, and that some doped records may not be anomalous).

Given that we can test only with a small number of doped records, this process may be repeated many times.

The doped data is used, however, only for evaluating the detectors in this way. When creating the final model(s) for production, we will train on only the original (real) data.

If we are able to reliably detect the doped records in the data, we can be reasonably confident that we are able to identify other outliers within the same data, at least outliers along the lines of the doped records (but not necessarily outliers that are substantially more subtle — hence we wish to include tests with reasonably subtle doped records).

2. Including doped records only in the testing data

It is also possible to train using only the real data (which we can assume is largely non-outliers) and then test with both the real and the doped data. This allows us to train on relatively clean data (some records in the real data will be outliers, but the majority will be typical, and there is no contamination due to doped records).

It also allows us to test with the actual outlier detector(s) that may, potentially, be put in production (depending how well they perform with the doped data — both compared to the other detectors we test, and compared to our sense of how well a detector should perform at minimum).

This tests our ability to detect outliers in future data. This is another common scenario with outlier detection: where we have one dataset that can be assumed to be reasonable clean (either free of outliers, or containing only a small, typical set of outliers, and without any extreme outliers) and we wish to compare future data to this.

Training with real data only and testing with both real and doped, we may test with any volume of doped data we wish, as the doped data is used only for testing and not for training. This allows us to create a large, and consequently, more reliable test dataset.

There are a number of ways to create doped data, including several covered in Outlier Detection in Python, each with its own strengths and weaknesses. For simplicity, in this article we cover just one option, where the data is modified in a fairly random manner: where the cell(s) modified are selected randomly, and the new values that replace the original values are created randomly.

Doing this, it is possible for some doped records to not be truly anomalous, but in most cases, assigning random values will upset one or more associations between the features. We can assume the doped records are largely anomalous, though, depending how they are created, possibly only slightly so.

Here we go through an example, taking a real dataset, modifying it, and testing to see how well the modifications are detected.

In this example, we use a dataset available on OpenML called abalone (https://www.openml.org/search?type=data&sort=runs&id=42726&status=active, available under public license).

Although other preprocessing may be done, for this example, we one-hot encode the categorical features and use RobustScaler to scale the numeric features.

We test with three outlier detectors, Isolation Forest, LOF, and ECOD, all available in the popular PyOD library (which must be pip installed to execute).

We also use an Isolation Forest to clean the data (remove any strong outliers) before any training or testing. This step is not necessary, but is often useful with outlier detection.

This is an example of the second of the two approaches described above, where we train on the original data and test with both the original and doped data.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import RobustScaler
import matplotlib.pyplot as plt
import seaborn as sns
from pyod.models.iforest import IForest
from pyod.models.lof import LOF
from pyod.models.ecod import ECOD

# Collect the data
data = fetch_openml('abalone', version=1)
df = pd.DataFrame(data.data, columns=data.feature_names)
df = pd.get_dummies(df)
df = pd.DataFrame(RobustScaler().fit_transform(df), columns=df.columns)

# Use an Isolation Forest to clean the data
clf = IForest()
clf.fit(df)
if_scores = clf.decision_scores_
top_if_scores = np.argsort(if_scores)[::-1][:10]
clean_df = df.loc[[x for x in df.index if x not in top_if_scores]].copy()

# Create a set of doped records
doped_df = df.copy()
for i in doped_df.index:
col_name = np.random.choice(df.columns)
med_val = clean_df[col_name].median()
if doped_df.loc[i, col_name] > med_val:
doped_df.loc[i, col_name] = \
clean_df[col_name].quantile(np.random.random()/2)
else:
doped_df.loc[i, col_name] = \
clean_df[col_name].quantile(0.5 + np.random.random()/2)

# Define a method to test a specified detector.
def test_detector(clf, title, df, clean_df, doped_df, ax):
clf.fit(clean_df)
df = df.copy()
doped_df = doped_df.copy()
df['Scores'] = clf.decision_function(df)
df['Source'] = 'Real'
doped_df['Scores'] = clf.decision_function(doped_df)
doped_df['Source'] = 'Doped'
test_df = pd.concat([df, doped_df])
sns.boxplot(data=test_df, orient='h', x='Scores', y='Source', ax=ax)
ax.set_title(title)

# Plot each detector in terms of how well they score doped records
# higher than the original records
fig, ax = plt.subplots(nrows=1, ncols=3, sharey=True, figsize=(10, 3))
test_detector(IForest(), "IForest", df, clean_df, doped_df, ax[0])
test_detector(LOF(), "LOF", df, clean_df, doped_df, ax[1])
test_detector(ECOD(), "ECOD", df, clean_df, doped_df, ax[2])
plt.tight_layout()
plt.show()

Here, to create the doped records, we copy the full set of original records, so will have an equal number of doped as original records. For each doped record, we select one feature randomly to modify. If the original value is below the median, we create a random value above the median; if the original is below the median, we create a random value above.

In this example, we see that IF does score the doped records higher, but not significantly so. LOF does an excellent job distinguishing the doped records, at least for this form of doping. ECOD is a detector that detects only unusually small or unusually large single values and does not test for unusual combinations. As the doping used in this example does not create extreme values, only unusual combinations, ECOD is unable to distinguish the doped from the original records.

This example uses boxplots to compare the detectors, but normally we would use an objective score, very often the AUROC (Area Under a Receiver Operator Curve) score to evaluate each detector. We would also typically test many combinations of model type, pre-processing, and parameters.

The above method will tend to create doped records that violate the normal associations between features, but other doping techniques may be used to make this more likely. For example, considering first categorical columns, we may select a new value such that both:

  1. The new value is different from the original value
  2. The new value is different from the value that would be predicted from the other values in the row. To achieve this, we can create a predictive model that predicts the current value of this column, for example a Random Forest Classifier.

With numeric data, we can achieve the equivalent by dividing each numeric feature into four quartiles (or some number of quantiles, but at least three). For each new value in a numeric feature, we then select a value such that both:

  1. The new value is in a different quartile than the original
  2. The new value is in a different quartile than what would be predicted given the other values in the row.

For example, if the original value is in Q1 and the predicted value is in Q2, then we can select a value randomly in either Q3 or Q4. The new value will, then, most likely go against the normal relationships among the features.

There is no definitive way to say how anomalous a record is once doped. However, we can assume that on average the more features modified, and the more they are modified, the more anomalous the doped records will be. We can take advantage of this to create not a single test suite, but multiple test suites, which allows us to evaluate the outlier detectors much more accurately.

For example, we can create a set of doped records that are very obvious (multiple features are modified in each record, each to a value significantly different from the original value), a set of doped records that are very subtle (only a single feature is modified, not significantly from the original value), and many levels of difficulty in between. This can help differentiate the detectors well.

So, we can create a suite of test sets, where each test set has a (roughly estimated) level of difficulty based on the number of features modified and the degree they’re modified. We can also have different sets that modify different features, given that outliers in some features may be more relevant, or may be easier or more difficult to detect.

It is, though, important that any doping performed represents the type of outliers that would be of interest if they did appear in real data. Ideally, the set of doped records also covers well the range of what you would be interested in detecting.

If these conditions are met, and multiple test sets are created, this is very powerful for selecting the best-performing detectors and estimating their performance on future data. We cannot predict how many outliers will be detected or what levels of false positives and false negatives you will see — these depend greatly on the data you will encounter, which in an outlier detection context is very difficult to predict. But, we can have a decent sense of the types of outliers you are likely to detect and to not.

Possibly more importantly, we are also well situated to create an effective ensemble of outlier detectors. In outlier detection, ensembles are typically necessary for most projects. Given that some detectors will catch some types of outliers and miss others, while other detectors will catch and miss other types, we can usually only reliably catch the range of outliers we’re interested in using multiple detectors.

Creating ensembles is a large and involved area in itself, and is different than ensembling with predictive models. But, for this article, we can indicate that having an understanding of what types of outliers each detector is able to detect gives us a sense of which detectors are redundant and which can detect outliers most others are not able to.

It is difficult to assess how well any given outlier detects outliers in the current data, and even harder to asses how well it may do on future (unseen) data. It is also very difficult, given two or more outlier detectors, to assess which would do better, again on both the current and on future data.

There are, though, a number of ways we can estimate these using synthetic data. In this article, we went over, at least quickly (skipping a lot of the nuances, but covering the main ideas), one approach based on doping real records and evaluating how well we’re able to score these more highly than the original data. Although not perfect, these methods can be invaluable and there is very often no other practical alternative with outlier detection.

All images are from the author.



Source link

08Jul

Evaluating The Quality Of RAG & Long-Context LLM Output | by Cobus Greyling | Jul, 2024


Salesforce propose to leverage the task of summarisation as a testbed for evaluating long-context models and RAG systems.

Summarisation requires reasoning over a long context and a careful understanding of the relative importance of content.

The Problem Identified:

Prior work on summarisation evaluation, particularly in evaluating the relevance of summaries, has focused on single-document summarisation or tasks in which the input content is on the order of 1,000–2,000 tokens.

Longer conversational and multi-document news summarisation is still often limited to around 10k tokens.

A major problem in summarisation evaluation is the reliance on low-quality reference summaries and automatic metrics that poorly correlate with human judgments.

Traditional evaluations compare candidate summaries to gold-standard references, assuming higher overlap indicates better quality. This approach is unreliable, especially for long-context settings where high-quality references are expensive to obtain. Even the best automatic metrics for content coverage often fail to correlate well with human judgments.

To address these issues, Salesforce use synthetic data generation.

Considering the image below, the approach from Salesforce involves creating a large corpus of documents (“Haystack”) on a given topic, ensuring certain signals repeat across documents.

By controlling which insights appear in which documents, Salesforce can automatically determine the relevant insights for a search query. The SummHay task requires systems to summarise these insights and cite their sources. Summaries are evaluated based on coverage of expected insights and accuracy in citing source documents.



Source link

07Jul

Understanding and Implementing Medprompt | by Anand Subramanian | Jul, 2024


We now perform choice shuffling ensembling by shuffling the order of answer choices for each test question, creating multiple variants of the same question. The LLM is then prompted with these variants, along with the corresponding few-shot exemplars, to generate reasoning steps and an answer for each variant. Finally, we perform a majority vote over the predictions from all variants and select the final prediction.

The code related to this implementation can be found at this github repo link.

We use the MedQA [6] dataset for implementing and evaluating Medprompt. We first define helper functions for parsing the jsonl files.

def write_jsonl_file(file_path, dict_list):
"""
Write a list of dictionaries to a JSON Lines file.

Args:
- file_path (str): The path to the file where the data will be written.
- dict_list (list): A list of dictionaries to write to the file.
"""
with open(file_path, 'w') as file:
for dictionary in dict_list:
json_line = json.dumps(dictionary)
file.write(json_line + '\n')

def read_jsonl_file(file_path):
"""
Parses a JSONL (JSON Lines) file and returns a list of dictionaries.

Args:
file_path (str): The path to the JSONL file to be read.

Returns:
list of dict: A list where each element is a dictionary representing
a JSON object from the file.
"""
jsonl_lines = []
with open(file_path, 'r', encoding="utf-8") as file:
for line in file:
json_object = json.loads(line)
jsonl_lines.append(json_object)

return jsonl_lines

Implementing Self-Generated CoT

For our implementation, we utilize the training set from MedQA. We implement a zero-shot CoT prompt and process all the training questions. We use GPT-4o in our implementation. For each question, we generate the CoT and the corresponding answer. We define a prompt which is based on the template provided in the Medprompt paper.

system_prompt = """You are an expert medical professional. You are provided with a medical question with multiple answer choices.
Your goal is to think through the question carefully and explain your reasoning step by step before selecting the final answer.
Respond only with the reasoning steps and answer as specified below.
Below is the format for each question and answer:

Input:
## Question: {{question}}
{{answer_choices}}

Output:
## Answer
(model generated chain of thought explanation)
Therefore, the answer is [final model answer (e.g. A,B,C,D)]"""

def build_few_shot_prompt(system_prompt, question, examples, include_cot=True):
"""
Builds the zero-shot prompt.

Args:
system_prompt (str): Task Instruction for the LLM
content (dict): The content for which to create a query, formatted as
required by `create_query`.

Returns:
list of dict: A list of messages, including a system message defining
the task and a user message with the input question.
"""
messages = [{"role": "system", "content": system_prompt}]

for elem in examples:
messages.append({"role": "user", "content": create_query(elem)})
if include_cot:
messages.append({"role": "assistant", "content": format_answer(elem["cot"], elem["answer_idx"])})
else:
answer_string = f"""## Answer\nTherefore, the answer is {elem["answer_idx"]}"""
messages.append({"role": "assistant", "content": answer_string})

messages.append({"role": "user", "content": create_query(question)})
return messages

def get_response(messages, model_name, temperature = 0.0, max_tokens = 10):
"""
Obtains the responses/answers of the model through the chat-completions API.

Args:
messages (list of dict): The built messages provided to the API.
model_name (str): Name of the model to access through the API
temperature (float): A value between 0 and 1 that controls the randomness of the output.
A temperature value of 0 ideally makes the model pick the most likely token, making the outputs deterministic.
max_tokens (int): Maximum number of tokens that the model should generate

Returns:
str: The response message content from the model.
"""
response = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=temperature,
max_tokens=max_tokens
)
return response.choices[0].message.content

We also define helper functions for parsing the reasoning and the final answer option from the LLM response.

def matches_ans_option(s):
"""
Checks if the string starts with the specific pattern 'Therefore, the answer is [A-Z]'.

Args:
s (str): The string to be checked.

Returns:
bool: True if the string matches the pattern, False otherwise.
"""
return bool(re.match(r'^Therefore, the answer is [A-Z]', s))

def extract_ans_option(s):
"""
Extracts the answer option (a single capital letter) from the start of the string.

Args:
s (str): The string containing the answer pattern.

Returns:
str or None: The captured answer option if the pattern is found, otherwise None.
"""
match = re.search(r'^Therefore, the answer is ([A-Z])', s)
if match:
return match.group(1) # Returns the captured alphabet
return None

def matches_answer_start(s):
"""
Checks if the string starts with the markdown header '## Answer'.

Args:
s (str): The string to be checked.

Returns:
bool: True if the string starts with '## Answer', False otherwise.
"""
return s.startswith("## Answer")

def validate_response(s):
"""
Validates a multi-line string response that it starts with '## Answer' and ends with the answer pattern.

Args:
s (str): The multi-line string response to be validated.

Returns:
bool: True if the response is valid, False otherwise.
"""
file_content = s.split("\n")

return matches_ans_option(file_content[-1]) and matches_answer_start(s)

def parse_answer(response):
"""
Parses a response that starts with '## Answer', extracting the reasoning and the answer choice.

Args:
response (str): The multi-line string response containing the answer and reasoning.

Returns:
tuple: A tuple containing the extracted CoT reasoning and the answer choice.
"""
split_response = response.split("\n")
assert split_response[0] == "## Answer"
cot_reasoning = "\n".join(split_response[1:-1]).strip()
ans_choice = extract_ans_option(split_response[-1])
return cot_reasoning, ans_choice

We now process the questions in the training set of MedQA. We obtain CoT responses and answers for all questions and store them to a folder.

train_data = read_jsonl_file("data/phrases_no_exclude_train.jsonl")

cot_responses = []
# os.mkdir("cot_responses")
existing_files = os.listdir("cot_responses/")

for idx, item in enumerate(tqdm(train_data)):
if str(idx) + ".txt" in existing_files:
continue

prompt = build_zero_shot_prompt(system_prompt, item)
try:
response = get_response(prompt, model_name="gpt-4o", max_tokens=500)
cot_responses.append(response)
with open(os.path.join("cot_responses", str(idx) + ".txt"), "w", encoding="utf-8") as f:
f.write(response)
except Exception as e :
print(str(e))
cot_responses.append("")

We now iterate across all the generated responses to check if they are valid and adhere to the prediction format defined in the prompt. We discard responses that do not conform to the required format. After that, we check the predicted answers against the ground truth for each question and only retain questions for which the predicted answers match the ground truth.

questions_dict = []
ctr = 0
for idx, question in enumerate(tqdm(train_data)):
file = open(os.path.join("cot_responses/", str(idx) + ".txt"), encoding="utf-8").read()
if not validate_response(file):
continue

cot, pred_ans = parse_answer(file)

dict_elem = {}
dict_elem["idx"] = idx
dict_elem["question"] = question["question"]
dict_elem["answer"] = question["answer"]
dict_elem["options"] = question["options"]
dict_elem["cot"] = cot
dict_elem["pred_ans"] = pred_ans
questions_dict.append(dict_elem)

filtered_questions_dict = []
for item in tqdm(questions_dict):
pred_ans = item["options"][item["pred_ans"]]
if pred_ans == item["answer"]:
filtered_questions_dict.append(item)

Implementing the KNN model

Having processed the training set and obtained the CoT response for all these questions, we now embed all questions using the text-embedding-ada-002 from OpenAI.

def get_embedding(text, model="text-embedding-ada-002"):
return client.embeddings.create(input = [text], model=model).data[0].embedding

for item in tqdm(filtered_questions_dict):
item["embedding"] = get_embedding(item["question"])
inv_options_map = {v:k for k,v in item["options"].items()}
item["answer_idx"] = inv_options_map[item["answer"]]

We now train a KNN model using these question embeddings. This acts as a retriever at inference time, as it helps us to retrieve similar datapoints from the training set that are most similar to the question from the test set.

import numpy as np
from sklearn.neighbors import NearestNeighbors

embeddings = np.array([d["embedding"] for d in filtered_questions_dict])
indices = list(range(len(filtered_questions_dict)))

knn = NearestNeighbors(n_neighbors=5, algorithm='auto', metric='cosine').fit(embeddings)

Implementing the Dynamic Few-Shot and Choice Shuffling Ensemble Logic

We can now run inference. We subsample 500 questions from the MedQA test set for our evaluation. For each question, we retrieve the 5 most similar questions from the train set using the KNN module, along with their respective CoT reasoning steps and predicted answers. We construct a few-shot prompt using these examples.

For each question, we also shuffle the order of the options 5 times to create different variants. We then utilize the constructed few-shot prompt to get the predicted answer for each of the variants with shuffled options.

def shuffle_option_labels(answer_options):
"""
Shuffles the options of the question.

Parameters:
answer_options (dict): A dictionary with the options.

Returns:
dict: A new dictionary with the shuffled options.
"""
options = list(answer_options.values())
random.shuffle(options)
labels = [chr(i) for i in range(ord('A'), ord('A') + len(options))]
shuffled_options_dict = {label: option for label, option in zip(labels, options)}

return shuffled_options_dict

test_samples = read_jsonl_file("final_processed_test_set_responses_medprompt.jsonl")

for question in tqdm(test_samples, colour ="green"):
question_variants = []
prompt_variants = []
cot_responses = []
question_embedding = get_embedding(question["question"])
distances, top_k_indices = knn.kneighbors([question_embedding], n_neighbors=5)
top_k_dicts = [filtered_questions_dict[i] for i in top_k_indices[0]]
question["outputs"] = []

for idx in range(5):
question_copy = question.copy()
shuffled_options = shuffle_option_labels(question["options"])
inv_map = {v:k for k,v in shuffled_options.items()}

question_copy["options"] = shuffled_options
question_copy["answer_idx"] = inv_map[question_copy["answer"]]
question_variants.append(question_copy)
prompt = build_few_shot_prompt(system_prompt, question_copy, top_k_dicts)
prompt_variants.append(prompt)

for prompt in tqdm(prompt_variants):
response = get_response(prompt, model_name="gpt-4o", max_tokens=500)
cot_responses.append(response)

for question_sample, answer in zip(question_variants, cot_responses):
if validate_response(answer):
cot, pred_ans = parse_answer(answer)

else:
cot = ""
pred_ans = ""

question["outputs"].append({"question": question_sample["question"], "options": question_sample["options"], "cot": cot, "pred_ans": question_sample["options"].get(pred_ans, "")})

We now evaluate the results of Medprompt over the test set. For each question, we have five predictions generated through the ensemble logic. We take the mode, or most frequently occurring prediction, for each question as the final prediction and evaluate the performance. Two edge cases are possible here:

  1. Two different answer options are predicted two times each, with no clear winner.
  2. There is an error with the response generated, meaning that we don’t have a predicted answer option.

For both of these edge cases, we consider the question to be wrongly answered by the LLM.

def find_mode_string_list(string_list):
"""
Finds the most frequently occurring strings.

Parameters:
string_list (list of str): A list of strings.
Returns:
list of str or None: A list containing the most frequent string(s) from the input list.
Returns None if the input list is empty.
"""
if not string_list:
return None

string_counts = Counter(string_list)
max_freq = max(string_counts.values())
mode_strings = [string for string, count in string_counts.items() if count == max_freq]
return mode_strings

ctr = 0
for item in test_samples:
pred_ans = [x["pred_ans"] for x in item["outputs"]]
freq_ans = find_mode_string_list(pred_ans)

if len(freq_ans) > 1:
final_prediction = ""
else:
final_prediction = freq_ans[0]

if final_prediction == item["answer"]:
ctr +=1

print(ctr / len(test_samples))

We evaluate the performance of Medprompt with GPT-4o in terms of accuracy on the MedQA test subset. Additionally, we benchmark the performance of Zero-shot prompting, Random Few-Shot prompting, and Random Few-Shot with CoT prompting.

Results of our evaluation (Image by Author)

We observe that Medprompt and Random Few-Shot CoT prompting outperform the Zero and Few-Shot prompting baselines. However, surprisingly, we notice that Random Few-Shot CoT outperforms our Medprompt performance. This could be due to a couple of reasons:

  1. The original Medprompt paper benchmarked the performance of GPT-4. We observe that GPT-4o outperforms GPT-4T and GPT-4 on various text benchmarks significantly (https://openai.com/index/hello-gpt-4o/), indicating that Medprompt could have a lesser effect on a stronger model like GPT-4o.
  2. We restrict our evaluation to 500 questions subsampled from MedQA. The Medprompt paper evaluates other Medical MCQA datasets and the full version of MedQA. Evaluating GPT-4o on the complete versions of the datasets could give a better picture of the overall performance.

Medprompt is an interesting framework for creating sophisticated prompting pipelines, particularly for adapting a generalist LLM to a specific domain without the need for fine-tuning. It also highlights the considerations involved in deciding between prompting and fine-tuning for various use cases. Exploring how far prompting can be pushed to enhance LLM performance is important, as it offers a resource and cost-efficient alternative to fine-tuning.

[1] Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., … & Horvitz, E. (2023). Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452. (https://arxiv.org/abs/2311.16452)

[2] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., … & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837. (https://openreview.net/pdf?id=_VjQlMeSB_J)

[3] Gekhman, Z., Yona, G., Aharoni, R., Eyal, M., Feder, A., Reichart, R., & Herzig, J. (2024). Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?. arXiv preprint arXiv:2405.05904. (https://arxiv.org/abs/2405.05904)

[4] Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., … & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172–180. (https://www.nature.com/articles/s41586-023-06291-2)

[5] Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., … & Natarajan, V. (2023). Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617. (https://arxiv.org/abs/2305.09617)

[6] Jin, D., Pan, E., Oufattole, N., Weng, W. H., Fang, H., & Szolovits, P. (2021). What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14), 6421. (https://arxiv.org/abs/2009.13081) (Original dataset is released under a MIT License)



Source link

06Jul

LLM Disruption in Chatbot Development Frameworks | by Cobus Greyling | Jul, 2024


Large Language Models (LLMs) have introduced more human-like and contextually aware interactions, allowing developers to build sophisticated chatbots with minimal effort. This innovation reduces the need for extensive rule-based programming and enables rapid deployment across various applications. However, there are challenges…

The image above outlines the various elements and features constituting a Large Language Model (LLM).

The challenge lies in accessing each of these features at the appropriate time, ensuring stability, predictability, and, to a certain extent, reproducibility.

Many organisations and technology providers are navigating the transition from traditional chatbots to incorporating Large Language Models with varying levels of success.

✨ Traditional Chatbot IDEs

Traditional chatbots typically consist of four basic elements:

  1. Intent Detection (NLU)
  2. Entity Extraction (NLU)
  3. Response Messages (Message Abstraction Layer)
  4. Dialog-turn/Conversation State Management (Dialog Flow Control)

Recently, numerous attempts have been made to reimagine this structure, aiming to loosen the rigidity of hard-coded and fixed architectural components.

✨ Natural Language Understanding (NLU)

The NLU engine is the only “AI” component of the chatbot, responsible for detecting intents and entities from the input.

It includes a GUI for defining training data and managing the model.

The typical advantages of NLU engines are:

  • Numerous open-source models.
  • Small footprint and not resource-intensive, making local and edge installations feasible.
  • No-code UIs.
  • Extensive corpus of named entities due to long-term usage.
  • Predefined entities and training data for specific verticals, such as banking, help desks, HR, etc.
  • Rapid model training, with the ability to train models multiple times per day in a production environment.
  • Initial introduction of LLMs for generating NLU training data.
  • Use of LLMs to generate training data for NLU models based on existing conversations and sample data.

✨ Conversation Flow & Dialog Management

The dialog flow and logic are designed and built within a no-code to low-code GUI.

The flow and logic follow a predefined path with specific logic points.

The conversation progresses based on input data matching certain criteria at these logic gates.

Efforts have been made to introduce flexibility to the flow, aiming to add some semblance of intelligence.

✨ Message Abstraction Layer

The message abstraction layer holds predefined bot responses for each dialog turn. These responses are fixed, with templates sometimes used to insert data and create personalised messages.

Managing these messages becomes challenging as the chatbot application grows, and the static nature of the messages can lead to a significant total number.

Introducing multilingual chatbots adds considerable complexity. Whenever the tone or persona of the chatbot needs to change, all of these messages must be revisited and updated.

This is also one of the areas where LLMs were first introduced to leverage the power of Natural Language Generation (NLG).

✨ Out-Of-Domain

Out-Of-Domain questions are handled by knowledge bases & semantic similarity searches.

Knowledge bases were primarily used for QnA and the solutions made use of semantic search. In many regards this could be considered as an early version of RAG.

In conclusion, the integration of Large Language Models (LLMs) into chatbot development frameworks marks a significant leap forward in creating more human-like and contextually aware interactions.

By reducing the reliance on rigid, rule-based programming, LLMs enable developers to build sophisticated chatbots with greater ease and speed.

However, this transition is not without its challenges.

Accessing and effectively utilising the various features of LLMs while ensuring stability and predictability remains a critical concern.

Organisations and technology providers are actively navigating these complexities as they embrace LLMs, each with varying degrees of success.

As innovations in Natural Language Understanding (NLU) and Natural Language Generation (NLG) continue to evolve, the future promises even more seamless and intelligent chatbot interactions, reshaping how we interact with technology in diverse applications.

👉🏼 Follow me on LinkedIn for updates on Large Language Models

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn



Source link

05Jul

LLM Alignment: Reward-Based vs Reward-Free Methods | by Anish Dubey | Jul, 2024


Optimization methods for LLM alignment

Language models have demonstrated remarkable abilities in producing a wide range of compelling text based on prompts provided by users. However, defining what constitutes “good” text is challenging, as it often depends on personal preferences and the specific context. For instance, in storytelling, creativity is key; in crafting informative content, accuracy and reliability are crucial; and when generating code, ensuring it runs correctly is essential. Hence the “LLM alignment problem,” which refers to the challenge of ensuring that large language models (LLMs) act in ways that are consistent with human values, intentions, and preferences.

Designing a loss function that captures the diverse qualities we value in text — like creativity, accuracy, or executability — is highly complex and often impractical. Concepts like these are not differentiable and hence not back-propagated and cannot be trained upon with simple next token generation.

Imagine if we could harness human feedback to evaluate the quality of generated text or, even better, use that feedback as a guiding loss function to improve the model’s performance. This concept is at the heart of Reinforcement Learning from Human Feedback (RLHF). By applying reinforcement learning techniques, RLHF allows us to fine-tune language models based on direct human feedback, aligning the models more closely with nuanced human values and expectations. This approach has opened up new possibilities for training language models that are not only more responsive but also more aligned with the complexity of human preferences.

Below, we will aim to learn more about RLHF via reward-based and then about RLHF via reward-free methods.

Let’s go through Reinforcement learning through human feedback (RLHF). It consist of 3 main stages:

  1. Supervised fine tuning
  2. Reward modeling phase
  3. RL fine-tuning phase

Supervised fine tuning

RLHF is a pre-trained model which is fine tuned already on a high quality data set. Its objective is simple i.e. when given an input (prompt), it produces an output. The ultimate objective here is to further fine tune this model to produce output according to human preference. Hence, let’s call this a base model for reference. Currently, this model is a vanilla base model which is not aware of any human preference.

Reward Modelling Phase

Reward model innovation: This is where the new innovation begins on how reward models are incorporated into RLHF. The idea behind the reward model is that a new LLM model, which can be same as the above mentioned base model, will have the ability to generate human preference score. The reason it is similar to a large language model is because this model also needs to understand the language semantics before it can rate if an output is human preferred or not. Since the reward is scalar, we add a linear layer on top of LLM to generate a scalar score in terms of human preference.

Data collection phase: This is done from the supervised fine tuning stage where the base model is asked to generate 2 outputs for a given text. Example: For an input token x, two output tokens are generated, y1 and y2 by the base model. These outputs are shown to human raters to rate and human preference is recorded for each individual output.

Training phase: Once the data sample is collected from the data collection phase, the reward model is trained with the following prompt. “Given the following input: , LLM generated output. Can you rate the performance of the output?”. The model will output r(reward) and we already know the actual value of reward r1 from the data collection phase. Now, this can be back-propagated with the loss function and the model can be trained. Below is the objective loss function which the model optimises for through back-propagation:

Equation from this paper: https://arxiv.org/pdf/2305.18290

Notation:

  • rΦ(x, y): a reward model parameterized by Φ which estimates the reward. Parameterized means we don’t know the actual value and this needs to be optimized from the above equation. This is the reward LLM model itself. Mostly, the LLM parameters are frozen here and only few parameters are left to change. Most important layer is the linear layer added at the top. This does most of the learning to rate the score of output.
  • Ɗ: A dataset of triplets (x, yw, yl) where x: input, yw: the winner output and yl: the loser output
  • σ: the sigmoid function which maps the difference in reward to a probability (0–1)
  • ∑(x, y,w yl) ~Ɗ means x, yw, yl are all sampled from Ɗ

Example scenario: Imagine you’re training a reward model to evaluate responses. You have pairs of responses to a given prompt, and human feedback tells you which response is better. For context, x(“What is the capital of France?”), you have yw(“The capital of France is Paris.”) as winner and yl(“The capital of France is Berlin.” ) as loser. The reward model should eventually learn to give higher reward for “The capital of France is Paris.” output when compared to “The capital of France is Berlin.” output if “What is the capital of France?” input is given.

RL fine-tuning phase

Reinforcement learning idea: Now the base model and reward model are trained, the idea is how to leverage reward model score and update base model parameters to reflect human preference. Since the reward model outputs a scalar score and is not differentiable, we cannot use simple back-propogation to update the base model param. Hence, we need other techniques to update the base model. This is where reinforcement learning comes which helps the base model to change the params through reward model score. This is done through PPO (proximal policy optimization). Understanding the core architecture of PPO is not required to grasp this concept and hence we will not cover it here but on a high level, the idea is that PPO can use scalar score to update base model parameters. Now let’s understand how base and reward models are incorporated to make base models learn human preference.

RL fine-tuning idea: In reinforcement learning, we have action, space and rewards. The idea is to come up with a policy which any action agent can take in the space which maximizes the reward. This becomes quite complicated but in a simplified sense, π is the policy which is our base LLM model only. Πref means the base model and ΠӨ means a different LLM optimal model which we are trying to generate. We need to find ΠӨ (the base model’s neural network weights will be fine-tuned) which gives human-preferred output. It’s just that we don’t know ΠӨ and the idea is to find this optimal model.

RL training and feedback loop phase: An input x is given to 2 policy models, Πref (baseline model) and ΠӨ (optimal model which we are trying to generate). Initially both models are kept the same. Input x to two models individually will give two outputs correspondingly. The output from ΠӨ model is also fed to reward model (input: x, output: y; as discussed above) and asked to output the reward score which is rΦ(x, y). Now we have 3 things, output from the baseline model, output from the optimal model, and a reward score from the optimal model. There are 2 things we are optimizing here, one is to maximize the reward because eventually we want the model to be as close as human preference and another is to minimize the divergence from baseline model. Maximizing the reward is easy since it is already a scalar quantity but how do we minimize the divergence of baseline and optimal model. Here we use “Kullback–Leibler divergence” which estimates the difference between 2 continuous probability distributions. Let’s take a deeper look into the objective loss function

Equation from this paper: https://arxiv.org/pdf/2305.18290

Notation:

  • rΦ(x, y): a scalar value for an input x and output y (from optimal model). To be explicit, output from the optimal model is fed into the reward model.
  • Dkl (ΠӨ (y | x) || Πref (y | x)): This computes the Kullback–Leibler divergence between 2 probability distributions. Each token from each model is a probability distribution. KL estimates how far the distribution is from each other.
  • β : Hyperparameter which is used to determine how important it is to have optimal model close to baseline model.

Example scenario: Imagine you are asking (“What is the capital of France?”), Πref (baseline model) says: “The capital of France is Berlin.” and ΠӨ (optimal model) “There are 3 capitals, Paris, Versailles, and Lyon, but Paris is considered as the official capital”. Now rΦ(“x: What is the capital…”, “y: There are 3 capital..”) should give low score as it is less human-preferred and Kullback–Leibler divergence of (ΠӨ (y | x) || Πref (y | x)) should be high as well since the probability distribution space differs for both individual output. Hence the loss will be high from both terms. We do not want the model to only optimize for reward but also stay closer to the baseline model and hence both the terms are used to optimize the reward. In the next iteration with learning let’s say, ΠӨ (optimal model) says “The capital of France is Delhi”, in this case model learned to stay closer to Πref (baseline model) and output the format closer to baseline model but the reward component will still be lower. Hopefully, in the third iteration ΠӨ (optimal model) should be able to learn and output “The capital of France is Paris” with higher reward and model output aligning closely with baseline model.

The below diagram helps illustrate the logic. I will also highly recommend to go through RLHF link from hugging face.

Image by author, inspired by https://huggingface.co/blog/rlhf

With RLHF using a reward-based method in mind, let’s move to the reward-free method. According to the paper: “our key insight is to leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies. This change-of-variables approach avoids fitting an explicit, standalone reward model, while still optimizing under existing models of human preferences”. Very complicated to understand, but let’s try to break this down in simple phases in the next section.

Reward-free method’s key idea: In RLHF, a separate new reward model is trained which is expensive and costly to maintain. Is there any mechanism to avoid training a new reward model and use the existing base model to achieve a new optimal model? This is exactly what reward-free method does i.e. it avoids training a new reward model and in turn changes the equation in such a way that there is no reward model term in the loss function of DPO (Direct preference optimization). One way to think about this is that we need to reach optimal model policy(ΠӨ) from base model (Πref). It can be reached either through optimizing the reward function space which helps build a proxy to reach optimal model policy or directly learning a mapping function from reward to policy and in turn optimize for policy itself. This is exactly what the authors have tried by removing the reward function component in loss function and substitute it directly by model policy parameter. This is what the author meant when they say “leverage an analytical mapping from reward function to optimal policies …. into a loss function over policies”. This is the core innovation of the paper.

DPO training and feedback loop phase: Using Πref (baseline model), input x is given and asked to produce 2 outputs (y1 and y2). All x, y1 and y2 are used by human raters to decide winning yw and losing yl. Offline data set is collected with triplet information . With this information, we know what the winning (human preferred) and losing (human not preferred) answers are. Now, the same input x is given to 2 policy (models) Πref (baseline model) and ΠӨ (optimal model). Initially both models are kept the same for training purposes. Input x to two models individually will give two outputs correspondingly. We compute how far the output is from winning and losing answers from both reference and optimal model through “Kullback–Leibler divergence”. Let’s take a deeper look into the objective loss function

Equation

Equation from https://arxiv.org/pdf/2305.18290
  • ΠӨ (yw | x) -> Given x(input), how far is the corresponding output of the model say youtput from the winning output yw. Output youtput and yw are probability distributions and differences among both will be computed through “Kullback–Leibler divergence”. This will be a scalar value. Also this is computed for both models with different combinations of Πref (yw | x), Πref (yl | x), ΠӨ (yw | x) and ΠӨ (yl | x).
  • β : Hyperparameter which is used to determine how important it is to have optimal model close to baseline model.
Image by author, inspired by https://huggingface.co/blog/rlhf
  • Naturally, the question comes down to which one is better, RLHF through reward-based method using PPO or reward-free method using DPO. There is no right answer to this question. A recent paper compares “Is DPO superior to PPO for LLM alignment” (paper link) and concludes that PPO is generally better than DPO and that DPO suffers more heavily from out-of-distribution data. “Out-of-distribution” data means the human preference data is different from the baseline trained data. This can happen if base model training is done on some dataset while preference output is done for some other dataset.
  • Overall, the research is still out on which one is better while we have seen companies like OpenAI, Anthropic, Meta leverage both RLHF via PPO and DPO as a tool for LLM alignment.



Source link

04Jul

TinyStories Is A Synthetic DataSet Created With GPT-4 & Used To Train Phi-3 | by Cobus Greyling | Jul, 2024


The Small Language Model from Microsoft, called Phi-3, was trained using a novel dataset called TinyStories.

Microsoft used the following recipe to create synthetic training data for the Phi-3 language model:

  1. Microsoft researchers created a discrete dataset based on 3,000 words, comprising of roughly equal numbers of nouns, verbs, and adjectives.
  2. They then instructed an LLM to create children’s stories using one noun, one verb, and one adjective from the list.
  3. This prompt repeated millions of times over several days, generating millions of tiny children’s stories.
  4. The TinyStories dataset was created to combine all the qualitative elements of natural language, such as grammar, vocabulary, facts, and reasoning.
  5. The main challenge in using large language models for producing training data is generating a dataset that is sufficiently diverse.
  6. This method also forces the LLM to not be too repetitive in the content generated.

The Small Language Model (SLM) Phi-3 was trained on synthetic data generated by GPT-3.5 and GPT-4. Training data created by large language models can often be too repetitive and lack diversity in verbs, nouns, and adjectives.

The dataset needed to include all the qualitative elements of natural language, such as grammar, vocabulary, facts, and reasoning, but it was designed to be smaller, less diverse, and more restricted in content.

The concept of creating a framework or data topology for the LLM to generate synthetic training data is intriguing.

The study indicates that training generative models on TinyStories can typically be completed in less than a day on a single GPU, while still exhibiting behaviours similar to those observed in larger models.

Instead of relying solely on raw web data, the creators of Phi-3 sought high-quality data. Microsoft researchers created a discrete dataset based on 3,000 words, comprising roughly equal numbers of nouns, verbs, and adjectives.

They then instructed a large language model to create children’s stories using one noun, one verb, and one adjective from the list — a prompt repeated millions of times over several days, generating millions of tiny children’s stories.

Small language models are designed to excel at simpler tasks, making them more accessible and easier to use for organisations with limited resources. They can also be fine-tuned more easily to meet specific needs.



Source link

01Jul

LangChain Just Launched LangGraph Cloud | by Cobus Greyling | Jul, 2024


LangGraph is a fairly recent addition to the ever expanding LangChain ecosystem. With the launch of LangGraph Cloud, a managed, hosted service is introduced for deploying and hosting LangGraph applications.

The LangChain ecosystem is unfolding at a rapid pace, with a combination of Open Source Software (OSS) and Commercial software. The Commercial software includes LangSmith & LangGraph Cloud.

Source

We are all starting to realise that Agentic Applications will become a standard in the near future. The advantages of Agents are numerous…but to name a few:

  1. Agents can handle complex, ambiguous and more implicit user queries in an automated fashion.
  2. Underpinning agents is the capability to create a chain of events on the fly based on the task assigned by the user.
  3. Agents make use of an LLM which acts as the backbone of the agent.
  4. When the agent receives a user query, the agent decomposes the task into sub-tasks, which are then executed in a sequential fashion.
  5. One or more tools are made available to the Agent which can be employed by the agent as the agent deems fit. The agent decides which tool to use based on a tool description which forms part of each tool.
  6. A tool is a unit of capability which includes tasks like web search, mathematics, API calls and more.

Impediments and apprehension to Agent adoption included:

  1. LLM inference cost. The backbone LLM are queried multiple times during the course of a query, should an agent have a large number of users inference cost can skyrocket.
  2. Controllability, inspectability, observability and a more granular control are much needed. In the market there is this fear that agents are too autonomous.
  3. Agents broke the glass ceiling of chatbots, but by a little too much; and some measure of control is now required.
  4. For more complex agents, to decrease latency, there is a requirement to run tasks in parallel, and also stream not only LLM responses, but agent responses as it becomes available.

LangGraph is framework-agnostic, with each node functioning as a regular Python function.

It extends the core Runnable API (a shared interface for streaming, async, and batch calls) to facilitate:

  1. Seamless state management across multiple conversation turns or tool calls.
  2. Flexible routing between nodes based on dynamic criteria
  3. Smooth transitions between LLMs and human intervention
  4. Persistence for long-running, multi-session applications

Below the basic personal workflow is shown. A user will develop their LangGraph application within their IDE of choice. From here they will push their code to GitHub.

From LangGraph Cloud the GitHub code can be accessed and deployed to LangGraph Cloud. From LangGraph Cloud applications can be tested, traces can be run, interrupt can be added, and more.



Source link

30Jun

The History of Convolutional Neural Networks for Image Classification (1989 – Today) | by Avishek Biswas | Jun, 2024


A visual tour of the greatest innovations in Deep Learning and Computer Vision.

Before CNNs, the standard way to train a neural network to classify images was to flatten it into a list of pixels and pass it through a feed-forward neural network to output the image’s class. The problem with flattening the image is that the essential spatial information in the image is discarded.

In 1989, Yann LeCun and team introduced Convolutional Neural Networks — the backbone of Computer Vision research for the last 15 years! Unlike feedforward networks, CNNs preserve the 2D nature of images and are capable of processing information spatially!

In this article, we are going to go through the history of CNNs specifically for Image Classification tasks — starting from those early research years in the 90’s to the golden era of the mid-2010s when many of the most genius Deep Learning architectures ever were conceived, and finally discuss the latest trends in CNN research now as they compete with attention and vision-transformers.

Check out the YouTube video that explains all the concepts in this article visually with animations. Unless otherwise specified, all the images and illustrations used in this article are generated by myself during creating the video version.

The papers we will be discussing today!

At the heart of a CNN is the convolution operation. We scan the filter across the image and calculate the dot product of the filter with the image at each overlapping location. This resulting output is called a feature map and it captures how much and where the filter pattern is present in the image.

How Convolution works — The kernel slides over the input image and calculates the overlap (dot-product) at each location — outputting a feature map in the end!

In a convolution layer, we train multiple filters that extract different feature maps from the input image. When we stack multiple convolutional layers in sequence with some non-linearity, we get a convolutional neural network (CNN).

So each convolution layer simultaneously does 2 things —
1. spatial filtering with the convolution operation between images and kernels, and
2. combining the multiple input channels and output a new set of channels.

90 percent of the research in CNNs has been to modify or to improve just these two things.

The two main things CNN do

The 1989 Paper

This 1989 paper taught us how to train non-linear CNNs from scratch using backpropagation. They input 16×16 grayscale images of handwritten digits, and pass through two convolutional layers with 12 filters of size 5×5. The filters also move with a stride of 2 during scanning. Strided-convolution is useful for downsampling the input image. After the conv layers, the output maps are flattened and passed through two fully connected networks to output the probabilities for the 10 digits. Using the softmax cross-entropy loss, the network is optimized to predict the correct labels for the handwritten digits. After each layer, the tanh nonlinearity is also used — allowing the learned feature maps to be more complex and expressive. With just 9760 parameters, this was a very small network compared to today’s networks which contain hundreds of millions of parameters.

The OG CNN architecture from 1989

Inductive Bias

Inductive Bias is a concept in Machine Learning where we deliberately introduce specific rules and limitations into the learning process to move our models away from generalizations and steer more toward solutions that follow our human-like understanding.

When humans classify images, we also do spatial filtering to look for common patterns to form multiple representations and then combine them together to form our predictions. The CNN architecture is designed to replicate just that. In feedforward networks, each pixel is treated like it’s own isolated feature as each neuron in the layers connects with all the pixels — in CNNs there is more parameter-sharing because the same filter scans the entire image. Inductive biases make CNNs less data-hungry too because they get local pattern recognition for free due to the network design but feedforward networks need to spend their training cycles learning about it from scratch.



Source link

30Jun

A Crash Course of Planning for Perception Engineers in Autonomous Driving | by Patrick Langechuan Liu | Jun, 2024


The fundamentals of planning and decision-making

AlphaGo, ChatGPT and FSD (image credit Elena Popova, Karthik Sridasyam and Jonathan Kemper on Unsplash)

A classical modular autonomous driving system typically consists of perception, prediction, planning, and control. Until around 2023, AI (artificial intelligence) or ML (machine learning) primarily enhanced perception in most mass-production autonomous driving systems, with its influence diminishing in downstream components. In stark contrast to the low integration of AI in the planning stack, end-to-end perception systems (such as the BEV, or birds-eye-view perception pipeline) have been deployed in mass production vehicles.

Classical modular design of an autonomous driving stack, 2023 and prior (Chart created by author)
Classical modular design of an autonomous driving stack, 2023 and prior (Chart created by author)

There are multiple reasons for this. A classical stack based on a human-crafted framework is more explainable and can be iterated faster to fix field test issues (within hours) compared to machine learning-driven features (which could take days or weeks). However, it does not make sense to let readily available human driving data sit idle. Moreover, increasing computing power is more scalable than expanding the engineering team.

Fortunately, there has been a strong trend in both academia and industry to change this situation. First, downstream modules are becoming increasingly data-driven and may also be integrated via different interfaces, such as the one proposed in CVPR 2023’s best paper, UniAD. Moreover, driven by the ever-growing wave of Generative AI, a single unified vision-language-action (VLA) model shows great potential for handling complex robotics tasks (RT-2 in academia, TeslaBot and 1X in industry) and autonomous driving (GAIA-1, DriveVLM in academia, and Wayve AI driver, Tesla FSD in industry). This brings the toolsets of AI and data-driven development from the perception stack to the planning stack.

This blog post aims to introduce the problem settings, existing methodologies, and challenges of the planning stack, in the form of a crash course for perception engineers. As a perception engineer, I finally had some time over the past couple of weeks to systematically learn the classical planning stack, and I would like to share what I learned. I will also share my thoughts on how AI can help from the perspective of an AI practitioner.

The intended audience for this post is AI practitioners who work in the field of autonomous driving, in particular, perception engineers.

The article is a bit long (11100 words), and the table of contents below will most likely help those who want to do quick ctrl+F searches with the keywords.

Table of Contents (ToC)

Why learn planning?
What is planning?
The problem formulation
The Glossary of Planning
Behavior Planning
Frenet vs Cartesian systems
Classical tools-the troika of planning
Searching
Sampling
Optimization
Industry practices of planning
Path-speed decoupled planning
Joint spatiotemporal planning
Decision making
What and why?
MDP and POMDP
Value iteration and Policy iteration
AlphaGo and MCTS-when nets meet trees
MPDM (and successors) in autonomous driving
Industry practices of decision making
Trees
No trees
Self-Reflections
Why NN in planning?
What about e2e NN planners?
Can we do without prediction?
Can we do with just nets but no trees?
Can we use LLMs to make decisions?
The trend of evolution

This brings us to an interesting question: why learn planning, especially the classical stack, in the era of AI?

From a problem-solving perspective, understanding your customers’ challenges better will enable you, as a perception engineer, to serve your downstream customers more effectively, even if your main focus remains on perception work.

Machine learning is a tool, not a solution. The most efficient way to solve problems is to combine new tools with domain knowledge, especially those with solid mathematical formulations. Domain knowledge-inspired learning methods are likely to be more data-efficient. As planning transitions from rule-based to ML-based systems, even with early prototypes and products of end-to-end systems hitting the road, there is a need for engineers who can deeply understand both the fundamentals of planning and machine learning. Despite these changes, classical and learning methods will likely continue to coexist for a considerable period, perhaps shifting from an 8:2 to a 2:8 ratio. It is almost essential for engineers working in this field to understand both worlds.

From a value-driven development perspective, understanding the limitations of classical methods is crucial. This insight allows you to effectively utilize new ML tools to design a system that addresses current issues and delivers immediate impact.

Additionally, planning is a critical part of all autonomous agents, not just in autonomous driving. Understanding what planning is and how it works will enable more ML talents to work on this exciting topic and contribute to the development of truly autonomous agents, whether they are cars or other forms of automation.

The problem formulation

As the “brain” of autonomous vehicles, the planning system is crucial for the safe and efficient driving of vehicles. The goal of the planner is to generate trajectories that are safe, comfortable, and efficiently progressing towards the goal. In other words, safety, comfort, and efficiency are the three key objectives for planning.

As input to the planning systems, all perception outputs are required, including static road structures, dynamic road agents, free space generated by occupancy networks, and traffic wait conditions. The planning system must also ensure vehicle comfort by monitoring acceleration and jerk for smooth trajectories, while considering interaction and traffic courtesy.

The planning systems generate trajectories in the format of a sequence of waypoints for the ego vehicle’s low-level controller to track. Specifically, these waypoints represent the future positions of the ego vehicle at a series of fixed time stamps. For example, each point might be 0.4 seconds apart, covering an 8-second planning horizon, resulting in a total of 20 waypoints.

A classical planning stack roughly consists of global route planning, local behavior planning, and local trajectory planning. Global route planning provides a road-level path from the start point to the end point on a global map. Local behavior planning decides on a semantic driving action type (e.g., car following, nudging, side passing, yielding, and overtaking) for the next several seconds. Based on the decided behavior type from the behavior planning module, local trajectory planning generates a short-term trajectory. The global route planning is typically provided by a map service once navigation is set and is beyond the scope of this post. We will focus on behavior planning and trajectory planning from now on.

Behavior planning and trajectory generation can work explicitly in tandem or be combined into a single process. In explicit methods, behavior planning and trajectory generation are distinct processes operating within a hierarchical framework, working at different frequencies, with behavior planning at 1–5 Hz and trajectory planning at 10–20 Hz. Despite being highly efficient most of the time, adapting to different scenarios may require significant modifications and fine-tuning. More advanced planning systems combine the two into a single optimization problem. This approach ensures feasibility and optimality without any compromise.

Classification of planning design approaches (source: Fluid Dynamics Planner)
Classification of planning design approaches (source: Fluid Dynamics Planner)

The Glossary of Planning

You might have noticed that the terminology used in the above section and the image do not completely match. There is no standard terminology that everyone uses. Across both academia and industry, it is not uncommon for engineers to use different names to refer to the same concept and the same name to refer to different concepts. This indicates that planning in autonomous driving is still under active development and has not fully converged.

Here, I list the notation used in this post and briefly explain other notions present in the literature.

  • Planning: A top-level concept, parallel to control, that generates trajectory waypoints. Together, planning and control are jointly referred to as PnC (planning and control).
  • Control: A top-level concept that takes in trajectory waypoints and generates high-frequency steering, throttle, and brake commands for actuators to execute. Control is relatively well-established compared to other areas and is beyond the scope of this post, despite the common notion of PnC.
  • Prediction: A top-level concept that predicts the future trajectories of traffic agents other than the ego vehicle. Prediction can be considered a lightweight planner for other agents and is also called motion prediction.
  • Behavior Planning: A module that produces high-level semantic actions (e.g., lane change, overtake) and typically generates a coarse trajectory. It is also known as task planning or decision making, particularly in the context of interactions.
  • Motion Planning: A module that takes in semantic actions and produces smooth, feasible trajectory waypoints for the duration of the planning horizon for control to execute. It is also referred to as trajectory planning.
  • Trajectory Planning: Another term for motion planning.
  • Decision Making: Behavior planning with a focus on interactions. Without ego-agent interaction, it is simply referred to as behavior planning. It is also known as tactical decision making.
  • Route Planning: Finds the preferred route over road networks, also known as mission planning.
  • Model-Based Approach: In planning, this refers to manually crafted frameworks used in the classical planning stack, as opposed to neural network models. Model-based methods contrast with learning-based methods.
  • Multimodality: In the context of planning, this typically refers to multiple intentions. This contrasts with multimodality in the context of multimodal sensor inputs to perception or multimodal large language models (such as VLM or VLA).
  • Reference Line: A local (several hundred meters) and coarse path based on global routing information and the current state of the ego vehicle.
  • Frenet Coordinates: A coordinate system based on a reference line. Frenet simplifies a curvy path in Cartesian coordinates to a straight tunnel model. See below for a more detailed introduction.
  • Trajectory: A 3D spatiotemporal curve, in the form of (x, y, t) in Cartesian coordinates or (s, l, t) in Frenet coordinates. A trajectory is composed of both path and speed.
  • Path: A 2D spatial curve, in the form of (x, y) in Cartesian coordinates or (s, l) in Frenet coordinates.
  • Semantic Action: A high-level abstraction of action (e.g., car following, nudge, side pass, yield, overtake) with clear human intention. Also referred to as intention, policy, maneuver, or primitive motion.
  • Action: A term with no fixed meaning. It can refer to the output of control (high-frequency steering, throttle, and brake commands for actuators to execute) or the output of planning (trajectory waypoints). Semantic action refers to the output of behavior prediction.

Different literature may use various notations and concepts. Here are some examples:

These variations illustrate the diversity in terminology and the evolving nature of the field.

Behavior Planning

As a machine learning engineer, you may notice that the behavior planning module is a heavily manually crafted intermediate module. There is no consensus on the exact form and content of its output. Concretely, the output of behavior planning can be a reference path or object labeling on ego maneuvers (e.g., pass from the left or right-hand side, pass or yield). The term “semantic action” has no strict definition and no fixed methods.

The decoupling of behavior planning and motion planning increases efficiency in solving the extremely high-dimensional action space of autonomous vehicles. The actions of an autonomous vehicle need to be reasoned at typically 10 Hz or more (time resolution in waypoints), and most of these actions are relatively straightforward, like going straight. After decoupling, the behavior planning layer only needs to reason about future scenarios at a relatively coarse resolution, while the motion planning layer operates in the local solution space based on the decision made by behavior planning. Another benefit of behavior planning is converting non-convex optimization to convex optimization, which we will discuss further below.

Frenet vs Cartesian systems

The Frenet coordinate system is a widely adopted system that merits its own introduction section. The Frenet frame simplifies trajectory planning by independently managing lateral and longitudinal movements relative to a reference path. The sss coordinate represents longitudinal displacement (distance along the road), while the lll (or ddd) coordinate represents lateral displacement (side position relative to the reference path).

Frenet simplifies a curvy path in Cartesian coordinates to a straight tunnel model. This transformation converts non-linear road boundary constraints on curvy roads into linear ones, significantly simplifying the subsequent optimization problems. Additionally, humans perceive longitudinal and lateral movements differently, and the Frenet frame allows for separate and more flexible optimization of these movements.

Schematics on the conversion from Cartesian frame to Frenet frame (source: Cartesian Planner)

The Frenet coordinate system requires a clean, structured road graph with low curvature lanes. In practice, it is preferred for structured roads with small curvature, such as highways or city expressways. However, the issues with the Frenet coordinate system are amplified with increasing reference line curvature, so it should be used cautiously on structured roads with high curvature, like city intersections with guide lines.

For unstructured roads, such as ports, mining areas, parking lots, or intersections without guidelines, the more flexible Cartesian coordinate system is recommended. The Cartesian system is better suited for these environments because it can handle higher curvature and less structured scenarios more effectively.

Planning in autonomous driving involves computing a trajectory from an initial high-dimensional state (including position, time, velocity, acceleration, and jerk) to a target subspace, ensuring all constraints are satisfied. Searching, sampling, and optimization are the three most widely used tools for planning.

Searching

Classical graph-search methods are popular in planning and are used in route/mission planning on structured roads or directly in motion planning to find the best path in unstructured environments (such as parking or urban intersections, especially mapless scenarios). There is a clear evolution path, from Dijkstra’s algorithm to A* (A-star), and further to hybrid A*.

Dijkstra’s algorithm explores all possible paths to find the shortest one, making it a blind (uninformed) search algorithm. It is a systematic method that guarantees the optimal path, but it is inefficient to deploy. As shown in the chart below, it explores almost all directions. Essentially, Dijkstra’s algorithm is a breadth-first search (BFS) weighted by movement costs. To improve efficiency, we can use information about the location of the target to trim down the search space.

Visualization of Dijkstra’s algorithm and A-star search (Source: PathFinding.js, example inspired by RedBlobGames)

The A* algorithm uses heuristics to prioritize paths that appear to be leading closer to the goal, making it more efficient. It combines the cost so far (Dijkstra) with the cost to go (heuristics, essentially greedy best-first). A* only guarantees the shortest path if the heuristic is admissible and consistent. If the heuristic is poor, A* can perform worse than the Dijkstra baseline and may degenerate into a greedy best-first search.

In the specific application of autonomous driving, the hybrid A* algorithm further improves A* by considering vehicle kinematics. A* may not satisfy kinematic constraints and cannot be tracked accurately (e.g., the steering angle is typically within 40 degrees). While A* operates in grid space for both state and action, hybrid A* separates them, maintaining the state in the grid but allowing continuous action according to kinematics.

Analytical expansion (shot to goal) is another key innovation proposed by hybrid A*. A natural enhancement to A* is to connect the most recently explored nodes to the goal using a non-colliding straight line. If this is possible, we have found the solution. In hybrid A*, this straight line is replaced by Dubins and Reeds-Shepp (RS) curves, which comply with vehicle kinematics. This early stopping method strikes a balance between optimality and feasibility by focusing more on feasibility for the further side.

Hybrid A* is used heavily in parking scenarios and mapless urban intersections. Here is a very nice video showcasing how it works in a parking scenario.

Hybrid A-star algorithm with analytical expansion (source: the 2010 IJRR Hybrid A-star paper and 2012 Udacity class )

Sampling

Another popular method of planning is sampling. The well-known Monte Carlo method is a random sampling method. In essence, sampling involves selecting many candidates randomly or according to a prior, and then selecting the best one according to a defined cost. For sampling-based methods, the fast evaluation of many options is critical, as it directly impacts the real-time performance of the autonomous driving system.

Large Language Models (LLMs) essentially provide samples, and there needs to be an evaluator with a defined cost that aligns with human preferences. This evaluation process ensures that the selected output meets the desired criteria and quality standards.

Sampling can occur in a parameterized solution space if we already know the analytical solution to a given problem or subproblem. For example, typically we want to minimize the time integral of the square of jerk (the third derivative of position p(t)), indicated by the triple dots over p, where one dot represents one order derivative with respect to time), among other criteria.

Minimizing squared jerk for driving comfort (source: Werling et al, ICRA 2010)

It can be mathematically proven that quintic (5th order) polynomials provide the jerk-optimal connection between two states in a position-velocity-acceleration space, even when additional cost terms are considered. By sampling in this parameter space of quintic polynomials, we can find the one with the minimum cost to get the approximate solution. The cost takes into account factors such as speed, acceleration, jerk limit, and collision checks. This approach essentially solves the optimization problem through sampling.

Sampling of lateral movement time profiles (source: Werling et al, ICRA 2010)

Sampling-based methods have inspired numerous ML papers, including CoverNet, Lift-Splat-Shoot, NMP, and MP3. These methods replace mathematically sound quintic polynomials with human driving behavior, utilizing a large database. The evaluation of trajectories can be easily parallelized, which further supports the use of sampling-based methods. This approach effectively leverages a vast amount of expert demonstrations to mimic human-like driving behavior, while avoiding random sampling of acceleration and steering profiles.

Sampling from human-driving data for data-driven planning methods (source: NMP, CoverNet and Lift-splat-shoot)

Optimization

Optimization finds the best solution to a problem by maximizing or minimizing a specific objective function under given constraints. In neural network training, a similar principle is followed using gradient descent and backpropagation to adjust the network’s weights. However, in optimization tasks outside of neural networks, models are usually less complex, and more effective methods than gradient descent are often employed. For example, while gradient descent can be applied to Quadratic Programming, it is generally not the most efficient method.

In autonomous driving, the planning cost to optimize typically considers dynamic objects for obstacle avoidance, static road structures for following lanes, navigation information to ensure the correct route, and ego status to evaluate smoothness.

Optimization can be categorized into convex and non-convex types. The key distinction is that in a convex optimization scenario, there is only one global optimum, which is also the local optimum. This characteristic makes it unaffected by the initial solution to the optimization problems. For non-convex optimization, the initial solution matters a lot, as illustrated in the chart below.

Convex vs non-convex optimization (source: Stanford course materials)

Since planning involves highly non-convex optimization with many local optima, it heavily depends on the initial solution. Additionally, convex optimization typically runs much faster and is therefore preferred for onboard real-time applications such as autonomous driving. A typical approach is to use convex optimization in conjunction with other methods to outline a convex solution space first. This is the mathematical foundation behind separating behavior planning and motion planning, where finding a good initial solution is the role of behavior planning.

Take obstacle avoidance as a concrete example, which typically introduces non-convex problems. If we know the nudging direction, then it becomes a convex optimization problem, with the obstacle position acting as a lower or upper bound constraint for the optimization problem. If we don’t know the nudging direction, we need to decide first which direction to nudge, making the problem a convex one for motion planning to solve. This nudging direction decision falls under behavior planning.

Of course, we can do direct optimization of non-convex optimization problems with tools such as projected gradient descent, alternating minimization, particle swarm optimization (PSO), and genetic algorithms. However, this is beyond the scope of this post.

A convex path planning problem vs a non-convex one (chart made by author)
The solution process of the convex vs non-convex path planning problem (chart made by author)

How do we make such decisions? We can use the aforementioned search or sampling methods to address non-convex problems. Sampling-based methods scatter many options across the parameter space, effectively handling non-convex issues similarly to searching.

You may also question why deciding which direction to nudge from is enough to guarantee the problem space is convex. To explain this, we need to discuss topology. In path space, similar feasible paths can transform continuously into each other without obstacle interference. These similar paths, grouped as “homotopy classes” in the formal language of topology, can all be explored using a single initial solution homotopic to them. All these paths form a driving corridor, illustrated as the red or green shaded area in the image above. For a 3D spatiotemporal case, please refer to the QCraft tech blog.

We can utilize the Generalized Voronoi diagram to enumerate all homotopy classes, which roughly corresponds to the different decision paths available to us. However, this topic delves into advanced mathematical concepts that are beyond the scope of this blog post.

The key to solving optimization problems efficiently lies in the capabilities of the optimization solver. Typically, a solver requires approximately 10 milliseconds to plan a trajectory. If we can boost this efficiency by tenfold, it can significantly impact algorithm design. This exact improvement was highlighted during Tesla AI Day 2022. A similar enhancement has occurred in perception systems, transitioning from 2D perception to Bird’s Eye View (BEV) as available computing power scaled up tenfold. With a more efficient optimizer, more options can be calculated and evaluated, thereby reducing the importance of the decision-making process. However, engineering an efficient optimization solver demands substantial engineering resources.

Every time compute scales up by 10x, algorithm will evolve to next generation.
— — The unverified law of algorithm evolution

A key differentiator in various planning systems is whether they are spatiotemporally decoupled. Concretely, spatiotemporally decoupled methods plan in spatial dimensions first to generate a path, and then plan the speed profile along this path. This approach is also known as path-speed decoupling.

Path-speed decoupling is often referred to as lateral-longitudinal (lat-long) decoupling, where lateral (lat) planning corresponds to path planning and longitudinal (long) planning corresponds to speed planning. This terminology seems to originate from the Frenet coordinate system, which we will explore later.

Decoupled solutions are easier to implement and can solve about 95% of issues. In contrast, coupled solutions have a higher theoretical performance ceiling but are more challenging to implement. They involve more parameters to tune and require a more principled approach to parameter tuning.

The comparison of decoupled and joint planning (source: made by the author, inspired by Qcraft)
Pros and cons of decoupled vs joint spatiotemporal planning (chart made by author)

Path-speed decoupled planning

We can take Baidu Apollo EM planner as an example of a system that uses path-speed decoupled planning.

The EM planner significantly reduces computational complexity by transforming a three-dimensional station-lateral-speed problem into two two-dimensional problems: station-lateral and station-speed. At the core of Apollo’s EM planner is an iterative Expectation-Maximization (EM) step, consisting of path optimization and speed optimization. Each step is divided into an E-step (projection and formulation in a 2D state space) and an M-step (optimization in the 2D state space). The E-step involves projecting the 3D problem into either a Frenet SL frame or an ST speed tracking frame.

The EM iteration in Apollo EM planner (source: Baidu Apollo EM planner )

The M-step (maximization step) in both path and speed optimization involves solving non-convex optimization problems. For path optimization, this means deciding whether to nudge an object on the left or right side, while for speed optimization, it involves deciding whether to overtake or yield to a dynamic object crossing the path. The Apollo EM planner addresses these non-convex optimization challenges using a two-step process: Dynamic Programming (DP) followed by Quadratic Programming (QP).

DP uses a sampling or searching algorithm to generate a rough initial solution, effectively pruning the non-convex space into a convex space. QP then takes the coarse DP results as input and optimizes them within the convex space provided by DP. In essence, DP focuses on feasibility, and QP refines the solution to achieve optimality within the convex constraints.

In our defined terminology, Path DP corresponds to lateral BP, Path QP to lateral MP, Speed DP to longitudinal BP, and Speed QP to longitudinal MP. Thus, the process involves conducting BP (Basic Planning) followed by MP (Master Planning) in both the path and speed steps.

A full autonomous driving stack with path-speed decoupled planning (chart made by author)
A full autonomous driving stack with path-speed decoupled planning (chart made by author)

Joint spatiotemporal planning

Although decoupled planning can resolve 95% of cases in autonomous driving, the remaining 5% involve challenging dynamic interactions where a decoupled solution often results in suboptimal trajectories. In these complex scenarios, demonstrating intelligence is crucial, making it a very hot topic in the field.

For example, in narrow-space passing, the optimal behavior might be to either decelerate to yield or accelerate to pass. Such behaviors are not achievable within the decoupled solution space and require joint optimization. Joint optimization allows for a more integrated approach, considering both path and speed simultaneously to handle intricate dynamic interactions effectively.

A full autonomous driving stack with joint spatiotemporal planning (chart made by author)
A full autonomous driving stack with joint spatiotemporal planning (chart made by author)

However, there are significant challenges in joint spatiotemporal planning. Firstly, solving the non-convex problem directly in a higher-dimensional state space is more challenging and time-consuming than using a decoupled solution. Secondly, considering interactions in spatiotemporal joint planning is even more complex. We will cover this topic in more detail later when we discuss decision-making.

Here we introduce two solving methods: brute force search and constructing a spatiotemporal corridor for optimization.

Brute force search occurs directly in 3D spatiotemporal space (2D in space and 1D in time), and can be performed in either XYT (Cartesian) or SLT (Frenet) coordinates. We will take SLT as an example. SLT space is long and flat, similar to an energy bar. It is elongated in the L dimension and flat in the ST face. For brute force search, we can use hybrid A-star, with the cost being a combination of progress cost and cost to go. During optimization, we must conform to search constraints that prevent reversing in both the s and t dimensions.

Overtake by lane change in spatiotemporal lattice (source: Spatiotemporal optimization with A*)

Another method is constructing a spatiotemporal corridor, essentially a curve with the footprint of a car winding through a 3D spatiotemporal state space (SLT, for example). The SSC (spatiotemporal semantic corridor, RAL 2019), encodes requirements given by semantic elements into a semantic corridor, generating a safe trajectory accordingly. The semantic corridor consists of a series of mutually connected collision-free cubes with dynamical constraints posed by the semantic elements in the spatiotemporal domain. Within each cube, it becomes a convex optimization problem that can be solved using Quadratic Programming (QP).

SSC still requires a BP (Behavior Planning) module to provide a coarse driving trajectory. Complex semantic elements of the environment are projected into the spatiotemporal domain concerning the reference lane. EPSILON (TRO 2021), showcases a system where SSC serves as the motion planner working in tandem with a behavior planner. In the next section, we will discuss behavior planning, especially focusing on interaction. In this context, behavior planning is usually referred to as decision making.

An illustration of the spatiotemporal corridor (source: SSC)

What and why?

Decision making in autonomous driving is essentially behavior planning, but with a focus on interaction with other traffic agents. The assumption is that other agents are mostly rational and will respond to our behavior in a predictable manner, which we can describe as “noisily rational.”

People may question the necessity of decision making when advanced planning tools are available. However, two key aspects — uncertainty and interaction — introduce a probabilistic nature to the environment, primarily due to the presence of dynamic objects. Interaction is the most challenging part of autonomous driving, distinguishing it from general robotics. Autonomous vehicles must not only navigate but also anticipate and react to the behavior of other agents, making robust decision-making essential for safety and efficiency.

In a deterministic (purely geometric) world without interaction, decision making would be unnecessary, and planning through searching, sampling, and optimization would suffice. Brute force searching in the 3D XYT space could serve as a general solution.

In most classical autonomous driving stacks, a prediction-then-plan approach is adopted, assuming zero-order interaction between the ego vehicle and other vehicles. This approach treats prediction outputs as deterministic, requiring the ego vehicle to react accordingly. This leads to overly conservative behavior, exemplified by the “freezing robot” problem. In such cases, prediction fills the entire spatiotemporal space, preventing actions like lane changes in crowded conditions — something humans manage more effectively.

To handle stochastic strategies, Markov Decision Processes (MDP) or Partially Observable Markov Decision Processes (POMDP) frameworks are essential. These approaches shift the focus from geometry to probability, addressing chaotic uncertainty. By assuming that traffic agents behave rationally or at least noisily rationally, decision making can help create a safe driving corridor in the otherwise chaotic spatiotemporal space.

Among the three overarching goals of planning — safety, comfort, and efficiency — decision making primarily enhances efficiency. Conservative actions can maximize safety and comfort, but effective negotiation with other road agents, achievable through decision making, is essential for optimal efficiency. Effective decision making also displays intelligence.

MDP and POMDP

We will first introduce Markov Decision Processes (MDP) and Partially Observable Markov Decision Processes (POMDP), followed by their systematic solutions, such as value iteration and policy iteration.

A Markov Process (MP) is a type of stochastic process that deals with dynamic random phenomena, unlike static probability. In a Markov Process, the future state depends only on the current state, making it sufficient for prediction. For autonomous driving, the relevant state may only include the last second of data, expanding the state space to allow for a shorter history window.

A Markov Decision Process (MDP) extends a Markov Process to include decision-making by introducing action. MDPs model decision-making where outcomes are partly random and partly controlled by the decision maker or agent. An MDP can be modeled with five factors:

  1. State (S): The state of the environment.
  2. Action (A): The actions the agent can take to affect the environment.
  3. Reward (R): The reward the environment provides to the agent as a result of the action.
  4. Transition Probability (P): The probability of transitioning from the old state to a new state upon the agent’s action.
  5. Gamma (γ): A discount factor for future rewards.

This is also the common framework used by reinforcement learning (RL), which is essentially an MDP. The goal of MDP or RL is to maximize the cumulative reward received in the long run. This requires the agent to make good decisions given a state from the environment, according to a policy.

A policy, π, is a mapping from each state, s ∈ S, and action, a ∈ A(s), to the probability π(a|s) of taking action a when in state s. MDP or RL studies the problem of how to derive the optimal policy.

The agent-environment interface in MDP and RL (source: Reinforcement Learning: An Introduction)

A Partially Observable Markov Decision Process (POMDP) adds an extra layer of complexity by recognizing that states cannot be directly observed but rather inferred through observations. In a POMDP, the agent maintains a belief — a probability distribution over possible states — to estimate the state of the environment. Autonomous driving scenarios are better represented by POMDPs due to their inherent uncertainties and the partial observability of the environment. An MDP can be considered a special case of a POMDP where the observation perfectly reveals the state.

MDP vs POMDP (source: POMDPs as stochastic contingent planning)

POMDPs can actively collect information, leading to actions that gather necessary data, demonstrating the intelligent behavior of these models. This capability is particularly valuable in scenarios like waiting at intersections, where gathering information about other vehicles’ intentions and the state of the traffic light is crucial for making safe and efficient decisions.

Value iteration and policy iteration are systematic methods for solving MDP or POMDP problems. While these methods are not commonly used in real-world applications due to their complexity, understanding them provides insight into exact solutions and how they can be simplified in practice, such as using MCTS in AlphaGo or MPDM in autonomous driving.

To find the best policy in an MDP, we must assess the potential or expected reward from a state, or more specifically, from an action taken in that state. This expected reward includes not just the immediate reward but also all future rewards, formally known as the return or cumulative discounted reward. (For a deeper understanding, refer to “Reinforcement Learning: An Introduction,” often considered the definitive guide on the subject.)

The value function (V) characterizes the quality of states by summing the expected returns. The action-value function (Q) assesses the quality of actions for a given state. Both functions are defined according to a given policy. The Bellman Optimality Equation states that an optimal policy will choose the action that maximizes the immediate reward plus the expected future rewards from the resulting new states. In simple terms, the Bellman Optimality Equation advises considering both the immediate reward and the future consequences of an action. For example, when switching jobs, consider not only the immediate pay raise (R) but also the future value (S’) the new position offers.

Bellman’s equation of optimality (chart made by author)

It is relatively straightforward to extract the optimal policy from the Bellman Optimality Equation once the optimal value function is available. But how do we find this optimal value function? This is where value iteration comes to the rescue.

Extract best policy from optimal values (chart made by author)

Value iteration finds the best policy by repeatedly updating the value of each state until it stabilizes. This process is derived by turning the Bellman Optimality Equation into an update rule. Essentially, we use the optimal future picture to guide the iteration toward it. In plain language, “fake it until you make it!”

Update value functions under the guidance of Bellman’s Equation (chart made by author)

Value iteration is guaranteed to converge for finite state spaces, regardless of the initial values assigned to the states (for a detailed proof, please refer to the Bible of RL). If the discount factor gamma is set to 0, meaning we only consider immediate rewards, the value iteration will converge after just one iteration. A smaller gamma leads to faster convergence because the horizon of consideration is shorter, though it may not always be the best option for solving concrete problems. Balancing the discount factor is a key aspect of engineering practice.

One might ask how this works if all states are initialized to zero. The immediate reward in the Bellman Equation is crucial for bringing in additional information and breaking the initial symmetry. Think about the states that immediately lead to the goal state; their value propagates through the state space like a virus. In plain language, it’s about making small wins, frequently.

Value and policy functions interact until they converge to optimum together (source: Reinforcement Learning: An Introduction)

However, value iteration also suffers from inefficiency. It requires taking the optimal action at each iteration by considering all possible actions, similar to Dijkstra’s algorithm. While it demonstrates feasibility as a basic approach, it is typically not practical for real-world applications.

The contrast of Bellman Equation and Bellman Optimality Equation (chart made by author)

Policy iteration improves on this by taking actions according to the current policy and updating it based on the Bellman Equation (not the Bellman Optimality Equation). Policy iteration decouples policy evaluation from policy improvement, making it a much faster solution. Each step is taken based on a given policy instead of exploring all possible actions to find the one that maximizes the objective. Although each iteration of policy iteration can be more computationally intensive due to the policy evaluation step, it generally results in a faster convergence overall.

In simple terms, if you can only fully evaluate the consequence of one action, it’s better to use your own judgment and do your best with the current information available.

AlphaGo and MCTS — when nets meet trees

We have all heard the unbelievable story of AlphaGo beating the best human player in 2016. AlphaGo formulates the gameplay of Go as an MDP and solves it with Monte Carlo Tree Search (MCTS). But why not use value iteration or policy iteration?

Value iteration and policy iteration are systematic, iterative methods that solve MDP problems. However, even with improved policy iteration, it still requires performing time-consuming operations to update the value of every state. A standard 19×19 Go board has roughly 2e170 possible states. This vast number of states makes it intractable to solve with traditional value iteration or policy iteration techniques.

AlphaGo and its successors use a Monte Carlo tree search (MCTS) algorithm to find their moves, guided by a value network and a policy network, trained on both human and computer play. Let’s take a look at vanilla MCTS first.

The four steps of MCTS by AlphaGo, combining both value network and policy network (source: AlphaGo, Nature 2016)

Monte Carlo Tree Search (MCTS) is a method for policy estimation that focuses on decision-making from the current state. One iteration involves a four-step process: selection, expansion, simulation (or evaluation), and backup.

  1. Selection: The algorithm follows the most promising path based on previous simulations until it reaches a leaf node, a position not yet fully explored.
  2. Expansion: One or more child nodes are added to represent possible next moves from the leaf node.
  3. Simulation (Evaluation): The algorithm plays out a random game from the new node until the end, known as a “rollout.” This assesses the potential outcome from the expanded node by simulating random moves until a terminal state is reached.
  4. Backup: The algorithm updates the values of the nodes on the path taken based on the game’s result. If the outcome is a win, the value of the nodes increases; if it is a loss, the value decreases. This process propagates the result of the rollout back up the tree, refining the policy based on simulated outcomes.

After a given number of iterations, MCTS provides the percentage frequency with which immediate actions were selected from the root during simulations. During inference, the action with the most visits is selected. Here is an interactive illustration of MTCS with the game of tic-tac-toe for simplicity.

MCTS in AlphaGo is enhanced by two neural networks. Value Network evaluates the winning rate from a given state (board configuration). Policy Network evaluates the action distribution for all possible moves. These neural networks improve MCTS by reducing the effective depth and breadth of the search tree. The policy network helps in sampling actions, focusing the search on promising moves, while the value network provides a more accurate evaluation of positions, reducing the need for extensive rollouts. This combination allows AlphaGo to perform efficient and effective searches in the vast state space of Go.

The policy network and value network of AlphaGo (source: AlphaGo, Nature 2016)

In the expansion step, the policy network samples the most likely positions, effectively pruning the breadth of the search space. In the evaluation step, the value network provides an instinctive scoring of the position, while a faster, lightweight policy network performs rollouts until the game ends to collect rewards. MCTS then uses a weighted sum of the evaluations from both networks to make the final assessment.

Note that a single evaluation of the value network approaches the accuracy of Monte Carlo rollouts using the RL policy network but with 15,000 times less computation. This mirrors the fast-slow system design, akin to intuition versus reasoning, or System 1 versus System 2 as described by Nobel laureate Daniel Kahneman. Similar designs can be observed in more recent works, such as DriveVLM.

To be exact, AlphaGo incorporates two slow-fast systems at different levels. On the macro level, the policy network selects moves while the faster rollout policy network evaluates these moves. On the micro level, the faster rollout policy network can be approximated by a value network that directly predicts the winning rate of board positions.

What can we learn from AlphaGo for autonomous driving? AlphaGo demonstrates the importance of extracting an excellent policy using a robust world model (simulation). Similarly, autonomous driving requires a highly accurate simulation to effectively leverage algorithms similar to those used by AlphaGo. This approach underscores the value of combining strong policy networks with detailed, precise simulations to enhance decision-making and optimize performance in complex, dynamic environments.

In the game of Go, all states are immediately available to both players, making it a perfect information game where observation equals state. This allows the game to be characterized by an MDP process. In contrast, autonomous driving is a POMDP process, as the states can only be estimated through observation.

POMDPs connect perception and planning in a principled way. The typical solution for a POMDP is similar to that for an MDP, with a limited lookahead. However, the main challenges lie in the curse of dimensionality (explosion in state space) and the complex interactions with other agents. To make real-time progress tractable, domain-specific assumptions are typically made to simplify the POMDP problem.

MPDM (and the two follow-ups, and the white paper) is one pioneering study in this direction. MPDM reduces the POMDP to a closed-loop forward simulation of a finite, discrete set of semantic-level policies, rather than evaluating every possible control input for every vehicle. This approach addresses the curse of dimensionality by focusing on a manageable number of meaningful policies, allowing for effective real-time decision-making in autonomous driving scenarios.

Semantic actions help control the curse of dimensionality (source: EPSILON)

The assumptions of MPDM are twofold. First, much of the decision-making by human drivers involves discrete high-level semantic actions (e.g., slowing, accelerating, lane-changing, stopping). These actions are referred to as policies in this context. The second implicit assumption concerns other agents: other vehicles will make reasonably safe decisions. Once a vehicle’s policy is decided, its action (trajectory) is determined.

The framework of MPDM (chart created by author)

MPDM first selects one policy for the ego vehicle from many options (hence the “multi-policy” in its name) and selects one policy for each nearby agent based on their respective predictions. It then performs forward simulation (similar to a fast rollout in MCTS). The best interaction scenario after evaluation is then passed on to motion planning, such as the Spatiotemporal Semantic Corridor (SCC) mentioned in the joint spatiotemporal planning session.

MPDM enables intelligent and human-like behavior, such as actively cutting into dense traffic flow even when there is no sufficient gap present. This is not possible with a predict-then-plan pipeline, which does not explicitly consider interactions. The prediction module in MPDM is tightly integrated with the behavior planning model through forward simulation.

MPDM assumes a single policy throughout the decision horizon (10 seconds). Essentially, MPDM adopts an MCTS approach with one layer deep and super wide, considering all possible agent predictions. This leaves room for improvement, inspiring many follow-up works such as EUDM, EPSILON, and MARC. For example, EUDM considers more flexible ego policies and assigns a policy tree with a depth of four, with each policy covering a time duration of 2 seconds over an 8-second decision horizon. To compensate for the extra computation induced by the increased tree depth, EUDM performs more efficient width pruning by guided branching, identifying critical scenarios and key vehicles. This approach explores a more balanced policy tree.

The forward simulation in MPDM and EUDM uses very simplistic driver models (IDM for longitudinal simulation and Pure Pursuit for lateral simulation). MPDM points out that high fidelity realism matters less than the closed-loop nature itself, as long as policy-level decisions are not affected by low-level action execution inaccuracies.

The conceptual diagram of decision making, where prediction, BP and MP integrates tightly (chart created by author)
The conceptual diagram of decision making, where prediction, BP and MP integrates tightly (chart created by author)

Contingency planning in the context of autonomous driving involves generating multiple potential trajectories to account for various possible future scenarios. A key motivating example is that experienced drivers anticipate multiple future scenarios and always plan for a safe backup plan. This anticipatory approach leads to a smoother driving experience, even when cars perform sudden cut-ins into the ego lane.

A critical aspect of contingency planning is deferring the decision bifurcation point. This means delaying the point at which different potential trajectories diverge, allowing the ego vehicle more time to gather information and respond to different outcomes. By doing so, the vehicle can make more informed decisions, resulting in smoother and more confident driving behaviors, similar to those of an experienced driver.

Risk-aware contingency planning (source: MARC, RAL 2023)

One possible drawback of MPDM and all its follow-up works is their reliance on simple policies designed for highway-like structured environments, such as lane keeping and lane changing. This reliance may limit the capability of forward simulation to handle complex interactions. To address this, following the example of MPDM, the key to making POMDPs more effective is to simplify the action and state space through the growth of a high-level policy tree. It might be possible to create a more flexible policy tree, for example, by enumerating spatiotemporal relative position tags to all relative objects and then performing guided branching.

Decision-making remains a hot topic in current research. Even classical optimization methods have not been fully explored yet. Machine learning methods could shine and have a disruptive impact, especially with the advent of Large Language Models (LLMs), empowered by techniques like Chain of Thought (CoT) or Monte Carlo Tree Search (MCTS).

Trees

Trees are systematic ways to perform decision-making. Tesla AI Day 2021 and 2022 showcased their decision-making capabilities, heavily influenced by AlphaGo and the subsequent MuZero, to address highly complex interactions.

At a high level, Tesla’s approach follows behavior planning (decision making) followed by motion planning. It searches for a convex corridor first and then feeds it into continuous optimization, using spatiotemporal joint planning. This approach effectively addresses scenarios such as narrow passing, a typical bottleneck for path-speed decoupled planning.

Neural network heuristics guided MCTS (source: Tesla AI Day 2021)

Tesla also adopts a hybrid system that combines data-driven and physics-based checks. Starting with defined goals, Tesla’s system generates seed trajectories and evaluates key scenarios. It then branches out to create more scenario variants, such as asserting or yielding to a traffic agent. Such an interaction search over the policy tree is showcased in the presentations of the years 2021 and 2022.

One highlight of Tesla’s use of machine learning is the acceleration of tree search via trajectory optimization. For each node, Tesla uses physics-based optimization and a neural planner, achieving a 10 ms vs. 100 µs time frame — resulting in a 10x to 100x improvement. The neural network is trained with expert demonstrations and offline optimizers.

Trajectory scoring is performed by combining classical physics-based checks (such as collision checks and comfort analysis) with neural network evaluators that predict intervention likelihood and rate human-likeness. This scoring helps prune the search space, focusing computation on the most promising outcomes.

While many argue that machine learning should be applied to high-level decision-making, Tesla uses ML fundamentally to accelerate optimization and, consequently, tree search.

The Monte Carlo Tree Search (MCTS) method appears to be an ultimate tool for decision-making. Interestingly, those studying Large Language Models (LLMs) are trying to incorporate MCTS into LLMs, while those working on autonomous driving are attempting to replace MCTS with LLMs.

As of roughly two years ago, Tesla’s technology followed this approach. However, since March 2024, Tesla’s Full Self-Driving (FSD) has switched to a more end-to-end approach, significantly different from their earlier methods.

We can still consider interactions without implicitly growing trees. Ad-hoc logics can be implemented to perform one-order interaction between prediction and planning. Even one-order interaction can already generate good behavior, as demonstrated by TuSimple. MPDM, in its original form, is essentially one-order interaction, but executed in a more principled and extendable way.

Multi-order interaction between prediction and planning (source: TuSImple AI day, in Chinese, translated by author)

TuSimple has also demonstrated the capability to perform contingency planning, similar to the approach proposed in MARC (though MARC can also accommodate a customized risk preference).

Contingency planning (source: TuSImple AI day, in Chinese, translated by author)

After learning the basic building blocks of classical planning systems, including behavior planning, motion planning, and the principled way to handle interaction through decision-making, I have been reflecting on potential bottlenecks in the system and how machine learning (ML) and neural networks (NN) may help. I am documenting my thought process here for future reference and for others who may have similar questions. Note that the information in this section may contain personal biases and speculations.

Let’s look at the problem from three different perspectives: in the existing modular pipeline, as an end-to-end (e2e) NN planner, or as an e2e autonomous driving system.

Going back to the drawing board, let’s review the problem formulation of a planning system in autonomous driving. The goal is to obtain a trajectory that ensures safety, comfort, and efficiency in a highly uncertain and interactive environment, all while adhering to real-time engineering constraints onboard the vehicle. These factors are summarized as goals, environments, and constraints in the chart below.

The potentials of NN in planning (chart made by author)

Uncertainty in autonomous driving can refer to uncertainty in perception (observation) and predicting long-term agent behaviors into the future. Planning systems must also handle the uncertainty in future trajectory predictions of other agents. As discussed earlier, a principled decision-making system is an effective way to manage this.

Additionally, a typically overlooked aspect is that planning must tolerate uncertain, imperfect, and sometimes incomplete perception results, especially in the current age of vision-centric and HD map-less driving. Having a Standard Definition (SD) map onboard as a prior helps alleviate this uncertainty, but it still poses significant challenges to a heavily handcrafted planner system. This perception uncertainty was considered a solved problem by Level 4 (L4) autonomous driving companies through the heavy use of Lidar and HD maps. However, it has resurfaced as the industry moves toward mass production autonomous driving solutions without these two crutches. An NN planner is more robust and can handle largely imperfect and incomplete perception results, which is key to mass production vision-centric and HD-mapless Advanced Driver Assistance Systems (ADAS).

Interaction should be treated with a principled decision-making system such as Monte Carlo Tree Search (MCTS) or a simplified version of MPDM. The main challenge is dealing with the curse of dimensionality (combinatorial explosion) by growing a balanced policy tree with smart pruning through domain knowledge of autonomous driving. MPDM and its variants, in both academia and industry (e.g., Tesla), provide good examples of how to grow this tree in a balanced way.

NNs can also enhance the real-time performance of planners by speeding up motion planning optimization. This can shift the compute load from CPU to GPU, achieving orders of magnitude speedup. A tenfold increase in optimization speed can fundamentally impact high-level algorithm design, such as MCTS.

Trajectories also need to be more human-like. Human likeness and takeover predictors can be trained with the vast amount of human driving data available. It is more scalable to increase the compute pool rather than maintain a growing army of engineering talents.

The NN-based planning stack can leverage human-driving data more effectively (Chart created by author)

An end-to-end (e2e) neural network (NN) planner still constitutes a modular autonomous driving (AD) design, accepting structured perception results (and potentially latent features) as its input. This approach combines prediction, decision, and planning into a single network. Companies such as DeepRoute (2022) and Huawei (2024) claim to utilize this method. Note that relevant raw sensor inputs, such as navigation and ego vehicle information, are omitted here.

A full autonomous driving stack with an e2e planner (chart made by author)
A full autonomous driving stack with an e2e planner (chart made by author)

This e2e planner can be further developed into an end-to-end autonomous driving system that combines both perception and planning. This is what Wayve’s LINGO-2 (2024) and Tesla’s FSDv12 (2024) claim to achieve.

The benefits of this approach are twofold. First, it addresses perception issues. There are many aspects of driving that we cannot easily model explicitly with commonly used perception interfaces. For example, it is quite challenging to handcraft a driving system to nudge around a puddle of water or slow down for dips or potholes. While passing intermediate perception features might help, it may not fundamentally resolve the issue.

Additionally, emergent behavior will likely help resolve corner cases more systematically. The intelligent handling of edge cases, such as the examples above, may result from the emergent behavior of large models.

A full autonomous driving stack with a one-model e2e driver (chart made by author)
A full autonomous driving stack with a one-model e2e driver (chart made by author)

My speculation is that, in its ultimate form, the end-to-end (e2e) driver would be a large vision and action-native multimodal model enhanced by Monte Carlo Tree Search (MCTS), assuming no computational constraints.

A world model in autonomous driving, as of 2024 consensus, is typically a multimodal model covering at least vision and action modes (or a VA model). While language can be beneficial for accelerating training, adding controllability, and providing explainability, it is not essential. In its fully developed form, a world model would be a VLA (vision-language-action) model.

There are at least two approaches to developing a world model:

  1. Video-Native Model: Train a model to predict future video frames, conditioned on or outputting accompanying actions, as demonstrated by models like GAIA-1.
  2. Multimodality Adaptors: Start with a pretrained Large Language Model (LLM) and add multimodality adaptors, as seen in models like Lingo-2, RT2, or ApolloFM. These multimodal LLMs are not native to vision or action but require significantly less training resources.

A world model can produce a policy itself through the action output, allowing it to drive the vehicle directly. Alternatively, MCTS can query the world model and use its policy outputs to guide the search. This World Model-MCTS approach, while much more computationally intensive, could have a higher ceiling in handling corner cases due to its explicit reasoning logic.

Can we do without prediction?

Most current motion prediction modules represent the future trajectories of agents other than the ego vehicle as one or multiple discrete trajectories. It remains a question whether this prediction-planning interface is sufficient or necessary.

In a classical modular pipeline, prediction is still needed. However, a predict-then-plan pipeline definitely caps the upper limit of autonomous driving systems, as discussed in the decision-making session. A more critical question is how to integrate this prediction module more effectively into the overall autonomous driving stack. Prediction should aid decision-making, and a queryable prediction module within an overall decision-making framework, such as MPDM and its variants, is preferred. There are no severe issues with concrete trajectory predictions as long as they are integrated correctly, such as through policy tree rollouts.

Another issue with prediction is that open-loop Key Performance Indicators (KPIs), such as Average Displacement Error (ADE) and Final Displacement Error (FDE), are not effective metrics as they fail to reflect the impact on planning. Instead, metrics like recall and precision at the intent level should be considered.

In an end-to-end system, an explicit prediction module may not be necessary, but implicit supervision — along with other domain knowledge from a classical stack — can definitely help or at least boost the data efficiency of the learning system. Evaluating the prediction behavior, whether explicit or implicit, will also be helpful in debugging such an e2e system.

Conclusions First. For an assistant, neural networks (nets) can achieve very high, even superhuman performance. For agents, I believe that using a tree structure is still beneficial (though not necessarily a must).

First of all, trees can boost nets. Trees enhance the performance of a given network, whether it’s NN-based or not. In AlphaGo, even with a policy network trained via supervised learning and reinforcement learning, the overall performance was still inferior to the MCTS-based AlphaGo, which integrates the policy network as one component.

Second, nets can distill trees. In AlphaGo, MCTS used both a value network and the reward from a fast rollout policy network to evaluate a node (state or board position) in the tree. The AlphaGo paper also mentioned that while a value function alone could be used, combining the results of the two yielded the best results. The value network essentially distilled the knowledge from the policy rollout by directly learning the state-value pair. This is akin to how humans distill the logical thinking of the slow System 2 into the fast, intuitive responses of System 1. Daniel Kahneman, in his book “Thinking, Fast and Slow,” describes how a chess master can quickly recognize patterns and make rapid decisions after years of practice, whereas a novice would require significant effort to achieve similar results. Similarly, the value network in AlphaGo was trained to provide a fast evaluation of a given board position.

Grandmaster-Level Chess Without Search (source: DeepMind, 2024)

Recent papers explore the upper limits of this fast system with neural networks. The “chess without search” paper demonstrates that with sufficient data (prepared through tree search using a conventional algorithm), it is possible to achieve grandmaster-level proficiency. There is a clear “scaling law” related to data size and model size, indicating that as the amount of data and the complexity of the model increase, so does the proficiency of the system.

So here we are with a power duo: trees boost nets, and nets distill trees. This positive feedback loop is essentially what AlphaZero uses to bootstrap itself to reach superhuman performance in multiple games.

The same principles apply to the development of large language models (LLMs). For games, since we have clearly defined rewards as wins or losses, we can use forward rollout to determine the value of a certain action or state. For LLMs, the rewards are not as clear-cut as in the game of Go, so we rely on human preferences to rate the models via reinforcement learning with human feedback (RLHF). However, with models like ChatGPT already trained, we can use supervised fine-tuning (SFT), which is essentially imitation learning, to distill smaller yet still powerful models without RLHF.

Returning to the original question, nets can achieve extremely high performance with large quantities of high-quality data. This could be good enough for an assistant, depending on the tolerance for errors, but it may not be sufficient for an autonomous agent. For systems targeting driving assistance (ADAS), nets via imitation learning may be adequate.

Trees can significantly boost the performance of nets with an explicit reasoning loop, making them perhaps more suitable for fully autonomous agents. The extent of the tree or reasoning loop depends on the return on investment of engineering resources. For example, even one order of interaction can provide substantial benefits, as demonstrated in TuSimple AI Day.

From the summary below of the hottest representatives of AI systems, we can see that LLMs are not designed to perform decision-making. In essence, LLMs are trained to complete documents, and even SFT-aligned LLM assistants treat dialogues as a special type of document (completing a dialogue record).

Representative AI products as of 2024 (chart made by author)

I do not fully agree with recent claims that LLMs are slow systems (System 2). They are unnecessarily slow in inference due to hardware constraints, but in their vanilla form, LLMs are fast systems as they cannot perform counterfactual checks. Prompting techniques such as Chain of Thought (CoT) or Tree of Thoughts (ToT) are actually simplified forms of MCTS, making LLMs function more like slower systems.

There is extensive research trying to integrate full-blown MCTS with LLMs. Specifically, LLM-MCTS (NeurIPS 2023) treats the LLM as a commonsense “world model” and uses LLM-induced policy actions as a heuristic to guide the search. LLM-MCTS outperforms both MCTS alone and policies induced by LLMs by a wide margin for complex, novel tasks. The highly speculated Q-star from OpenAI seems to follow the same approach of boosting LLMs with MCTS, as the name suggests.

Below is a rough evolution of the planning stack in autonomous driving. It is rough as the listed solutions are not necessarily more advanced than the ones above, and their debut may not follow the exact chronological order. Nonetheless, we can observe general trends. Note that the listed representative solutions from the industry are based on my interpretation of various press releases and could be subject to error.

One trend is the movement towards a more end-to-end design with more modules consolidated into one. We see the stack evolve from path-speed decoupled planning to joint spatiotemporal planning, and from a predict-then-plan system to a joint prediction and planning system. Another trend is the increasing incorporation of machine learning-based components, especially in the last three stages. These two trends converge towards an end-to-end NN planner (without perception) or even an end-to-end NN driver (with perception).

A rough history of evolution of planning (Chart made by author)
  • ML as a Tool: Machine learning is a tool, not a standalone solution. It can assist with planning even in current modular designs.
  • Full Formulation: Start with a full problem formulation, then make reasonable assumptions to balance performance and resources. This helps create a clear direction for a future-proof system design and allows for improvements as resources increase. Recall the transition from POMDP’s formulation to engineering solutions like AlphaGo’s MCTS and MPDM.
  • Adapting Algorithms: Theoretically beautiful algorithms (e.g., Dijkstra and Value Iteration) are great for understanding concepts but need adaptation for practical engineering (Value Iteration to MCTS as Dijkstra’s algorithm to Hybrid A-star).
  • Deterministic vs. Stochastic: Planning excels in resolving deterministic (not necessarily static) scenes. Decision-making in stochastic scenes is the most challenging task toward full autonomy.
  • Contingency Planning: This can help merge multiple futures into a common action. It’s beneficial to be aggressive to the degree that you can always resort to a backup plan.
  • End-to-end Models: Whether an end-to-end model can solve full autonomy remains unclear. It may still need classical methods like MCTS. Neural networks can handle assistants, while trees can manage agents.



Source link

Protected by Security by CleanTalk