14Nov

Writing LLMs in Rust: Looking for an Efficient Matrix Multiplication | by Stefano Bosisio | Nov, 2024


Starting from Karpathy llm.c, I wonder myself “Could I write this in Rust?” Here are the lessons I learned and how I am writing llm.rust. In this first article, let’s tackle the matrix multiplication problem.

Image by GoogleDeepMind on Unsplash

Matrix multiplication may be the most important operation in Machine Learning. I still remember when I was an engineering student, and in one of the first linear algebra lessons, the teacher started to explain matrices, eigenvectors, and basis and orthonormal basis. I was very confused, my head took a little while to start understanding why we were bothering so much about matrices and basis sets, and what a good basis implies for our world. From there, I always found linear algebra so fascinating, and, from a pure computer science point of view, how amazing all those algorithms that try to be more and more efficient in handling matrices.

In particular, we know that the matrix-vector product is pretty simple, but things are getting more and more complicated when we have matrices-matrices or tensors-tensors products. From here, many methodologies have been implemented to optimize the matrix multiplication. For example, a long time ago I posted about DeepMind



Source link

13Nov

Summer Fellowship 2024 Wrap Up – What Did Our Fellows Work On?


Summer and Winter Fellowships provide an opportunity for early-career individuals and established professionals new to the field of AI governance to spend three months working on an AI governance research project, deepening their knowledge of the field, and forging connections with other researchers and practitioners. 

Our 2024 Summer Fellows came from a variety of disciplines and a range of prior experience – some fellows ventured into entirely new intellectual territory for their projects, while others used the time to extend their previous work.

We extend our sincere appreciation to all our supervisors for their dedicated mentorship and guidance this summer.

If you’re interested in applying for future fellowships, check out our Opportunities page. You can register your expression of interest here.



Source link

13Nov

Building Conversational AI Agents By Integrating Reasoning, Speaking & Acting With LLMs | by Cobus Greyling | Nov, 2024


1. When an agent seeks user guidance to refine its search strategy, it actively involves the user in defining the best approach, improving accuracy by ensuring its search aligns with user expectations.

2. This type of dialogue encourages collaboration, allowing users to clarify ambiguous instructions or adjust the search path as new insights arise.

3. Sharing status updates on task progress is essential for transparency, as it informs users of what the agent has completed and any challenges encountered.

4. Regular updates help users feel informed and give them an opportunity to provide additional instructions if the task requires it.

5. Soliciting user preferences is another valuable dialogue type, where the agent gathers input to shape task outcomes, ensuring decisions align closely with user needs.

6. This approach supports more personalised results, making the task execution feel interactive and responsive to individual preferences.

7. Together, these dialogue types create a flexible, two-way interaction that enhances the quality of task completion by combining automated assistance with user-specific insights.

8. Ultimately, these interactions improve alignment, trust, and satisfaction as the agent works to adapt and optimise its actions based on direct user input.



Source link

11Nov

My Medium Journey as a Data Scientist: 6 Months, 18 Articles, and 3,000 Followers | by Yu Dong | Nov, 2024


Real numbers, earnings, and data-driven growth strategy for Medium writers

I started writing data science and AI content on Medium in May 2024. This is my sixth month and I just hit a major milestone — 3,000 followers! I am very proud of my achievements.

In this article, I will share how this journey started, what I have been writing, and what I learned. Plus, as a data scientist, I always enjoy analyzing my own data. I collected a dataset of my Medium stats, including article views👀, reads📖, claps👏, earnings💵, etc. Join me as I break down my Medium experience using data and share my data-driven writing strategies.

Image created by DALL·E

How it all began

My writing habit dates back well before I started writing on Medium. I have been running my data science portfolio site since 2018, back when I started my first full-time job. I post articles there and occasionally share them on LinkedIn. It helps me connect with friends and colleagues in the data domain. Earlier this year, I posted an article about my experimentation with the custom GPTs, and it reached nearly 10k impressions on LinkedIn. This is not bad at all but it…



Source link

10Nov

AdaBoost Classifier, Explained: A Visual Guide with Code Examples | by Samy Baladram | Nov, 2024


ENSEMBLE LEARNING

Putting the weight where weak learners need it most

Everyone makes mistakes — even the simplest decision trees in machine learning. Instead of ignoring them, AdaBoost (Adaptive Boosting) algorithm does something different: it learns (or adapts) from these mistakes to get better.

Unlike Random Forest, which makes many trees at once, AdaBoost starts with a single, simple tree and identifies the instances it misclassifies. It then builds new trees to fix those errors, learning from its mistakes and getting better with each step.

Here, we’ll illustrate exactly how AdaBoost makes its predictions, building strength by combining targeted weak learners just like a workout routine that turns focused exercises into full-body power.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

AdaBoost is an ensemble machine learning model that creates a sequence of weighted decision trees, typically using shallow trees (often just single-level “stumps”). Each tree is trained on the entire dataset, but with adaptive sample weights that give more importance to previously misclassified examples.

For classification tasks, AdaBoost combines the trees through a weighted voting system, where better-performing trees get more influence in the final decision.

The model’s strength comes from its adaptive learning process — while each simple tree might be a “weak learner” that performs only slightly better than random guessing, the weighted combination of trees creates a “strong learner” that progressively focuses on and corrects mistakes.

AdaBoost is part of the boosting family of algorithms because it builds trees one at a time. Each new tree tries to fix the mistakes made by the previous trees. It then uses a weighted vote to combine their answers and make its final prediction.

Throughout this article, we’ll focus on the classic golf dataset as an example for classification.

Columns: ‘Outlook (one-hot-encoded into 3 columns)’, ’Temperature’ (in Fahrenheit), ‘Humidity’ (in %), ‘Windy’ (Yes/No) and ‘Play’ (Yes/No, target feature)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Create and prepare dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast',
'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy',
'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast',
'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes',
'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
# Prepare data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

# Rearrange columns
column_order = ['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']
df = df[column_order]

# Prepare features and target
X,y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)Main Mechanism

Here’s how AdaBoost works:

  1. Initialize Weights: Assign equal weight to each training example.
  2. Iterative Learning: In each step, a simple decision tree is trained and its performance is checked. Misclassified examples get more weight, making them a priority for the next tree. Correctly classified examples stay the same, and all weights are adjusted to add up to 1.
  3. Build Weak Learners: Each new, simple tree targets the mistakes of the previous ones, creating a sequence of specialized weak learners.
  4. Final Prediction: Combine all trees through weighted voting, where each tree’s vote is based on its importance value, giving more influence to more accurate trees.
An AdaBoost Classifier makes predictions by using many simple decision trees (usually 50–100). Each tree, called a “stump,” focuses on one important feature, like temperature or humidity. The final prediction is made by combining all the trees’ votes, each weighted by how important that tree is (“alpha”).

Here, we’ll follow the SAMME (Stagewise Additive Modeling using a Multi-class Exponential loss function) algorithm, the standard approach in scikit-learn that handles both binary and multi-class classification.

1.1. Decide the weak learner to be used. A one-level decision tree (or “stump”) is the default choice.
1.2. Decide how many weak learner (in this case the number of trees) you want to build (the default is 50 trees).

We begin with depth-1 decision trees (stumps) as our weak learners. Each stump makes just one split, and we’ll train 50 of them sequentially, adjusting weights along the way.

1.3. Start by giving each training example equal weight:
· Each sample gets weight = 1/N (N is total number of samples)
· All weights together sum to 1

All data points start with equal weights (0.0714), with the total weight adding up to 1. This ensures every example is equally important when training begins.

For the First Tree

2.1. Build a decision stump while considering sample weights

Before making the first split, the algorithm examines all data points with their weights to find the best splitting point. These weights influence how important each example is in making the split decision.

a. Calculate initial weighted Gini impurity for the root node

The algorithm calculates the Gini impurity score at the root node, but now considers the weights of all data points.

b. For each feature:
· Sort data by feature values (exactly like in Decision Tree classifier)

For each feature, the algorithm sorts the data and identifies potential split points, exactly like the standard Decision Tree.

· For each possible split point:
·· Split samples into left and right groups
·· Calculate weighted Gini impurity for both groups
·· Calculate weighted Gini impurity reduction for this split

The algorithm calculates weighted Gini impurity for each potential split and compares it to the parent node. For feature “sunny” with split point 0.5, this impurity reduction (0.066) shows how much this split improves the data separation.

c. Pick the split that gives the largest Gini impurity reduction

After checking all possible splits across features, the column ‘overcast’ (with split point 0.5) gives the highest impurity reduction of 0.102. This means it’s the most effective way to separate the classes, making it the best choice for the first split.

d. Create a simple one-split tree using this decision

Using the best split point found, the algorithm divides the data into two groups, each keeping their original weights. This simple decision tree is purposely kept small and imperfect, making it just slightly better than random guessing.

2.2. Evaluate how good this tree is
a. Use the tree to predict the label of the training set.
b. Add up the weights of all misclassified samples to get error rate

The first weak learner makes predictions on the training data, and we check where it made mistakes (marked with X). The error rate of 0.357 shows this simple tree gets some predictions wrong, which is expected and will help guide the next steps of training.

c. Calculate tree importance (α) using:
α = learning_rate × log((1-error)/error)

Using the error rate, we calculate the tree’s influence score (α = 0.5878). Higher scores mean more accurate trees, and this tree earned moderate importance for its decent performance.

2.3. Update sample weights
a. Keep the original weights for correctly classified samples
b. Multiply the weights of misclassified samples by e^(α).
c. Divide each weight by the sum of all weights. This normalization ensures all weights still sum to 1 while maintaining their relative proportions.

Cases where the tree made mistakes (marked with X) get higher weights for the next round. After increasing these weights, all weights are normalized to sum to 1, ensuring misclassified examples get more attention in the next tree.

For the Second Tree

2.1. Build a new stump, but now using the updated weights
a. Calculate new weighted Gini impurity for root node:
· Will be different because misclassified samples now have bigger weights
· Correctly classified samples now have smaller weights

Using the updated weights (where misclassified examples now have higher importance), the algorithm calculates the weighted Gini impurity at the root node. This begins the process of building the second decision tree.

b. For each feature:
· Same process as before, but the weights have changed
c. Pick the split with best weighted Gini impurity reduction
· Often completely different from the first tree’s split
· Focuses on samples the first tree got wrong

With updated weights, different split points show different effectiveness. Notice that “overcast” is no longer the best split — the algorithm now finds temperature (84.0) gives the highest impurity reduction, showing how weight changes affect split selection.

d. Create the second stump

Using temperature ≤ 84.0 as the split point, the algorithm assigns YES/NO to each leaf based on which class has more total weight in that group, not just by counting examples. This weighted voting helps correct the previous tree’s mistakes.

2.2. Evaluate this new tree
a. Calculate error rate with current weights
b. Calculate its importance (α) using the same formula as before
2.3. Update weights again — Same process: increase weights for mistakes then normalize.

The second tree achieves a lower error rate (0.222) and higher importance score (α = 1.253) than the first tree. Like before, misclassified examples get higher weights for the next round.

For the Third Tree onwards

Repeat Step 2.1–2.3 for all remaining trees.

The algorithm builds 50 simple decision trees sequentially, each with its own importance score (α). Each tree learns from previous mistakes by focusing on different aspects of the data, creating a strong combined model. Notice how some trees (like Tree 2) get higher importance scores when they perform better.

Step 3: Final Ensemble
3.1. Keep all trees and their importance scores

The 50 simple decision trees work together as a team, each with its own importance score (α). When making predictions, trees with higher α values (like Tree 2 with 1.253) have more influence on the final decision than trees with lower scores.
from sklearn.tree import plot_tree
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Train AdaBoost
np.random.seed(42) # For reproducibility
clf = AdaBoostClassifier(algorithm='SAMME', n_estimators=50, random_state=42)
clf.fit(X_train, y_train)

# Create visualizations for trees 1, 2, and 50
trees_to_show = [0, 1, 49]
feature_names = X_train.columns.tolist()
class_names = ['No', 'Yes']

# Set up the plot
fig, axes = plt.subplots(1, 3, figsize=(14,4), dpi=300)
fig.suptitle('Decision Stumps from AdaBoost', fontsize=16)

# Plot each tree
for idx, tree_idx in enumerate(trees_to_show):
plot_tree(clf.estimators_[tree_idx],
feature_names=feature_names,
class_names=class_names,
filled=True,
rounded=True,
ax=axes[idx],
fontsize=12) # Increased font size
axes[idx].set_title(f'Tree {tree_idx + 1}', fontsize=12)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])

Each node shows its ‘value’ parameter as [weight_NO, weight_YES], which represents the weighted proportion of each class at that node. These weights come from the sample weights we calculated during training.

Testing Step

For predicting:
a. Get each tree’s prediction
b. Multiply each by its importance score (α)
c. Add them all up
d. The class with higher total weight will be the final prediction

When predicting for new data, each tree makes its prediction and multiplies it by its importance score (α). The final decision comes from adding up all weighted votes — here, the NO class gets a higher total score (23.315 vs 15.440), so the model predicts NO for this unseen example.

Evaluation Step

After building all the trees, we can evaluate the test set.

By iteratively training and weighting weak learners to focus on misclassified examples, AdaBoost creates a strong classifier that achieves high accuracy — typically better than single decision trees or simpler models!
# Get predictions
y_pred = clf.predict(X_test)

# Create DataFrame with actual and predicted values
results_df = pd.DataFrame({
'Actual': y_test,
'Predicted': y_pred
})
print(results_df) # Display results DataFrame

# Calculate and display accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")

Here are the key parameters for AdaBoost, particularly in scikit-learn:

estimator: This is the base model that AdaBoost uses to build its final solution. The 3 most common weak learners are:
a. Decision Tree with depth 1 (Decision Stump): This is the default and most popular choice. Because it only has one split, it is considered a very weak learner that is just a bit better than random guessing, exactly what is needed for boosting process.
b. Logistic Regression: Logistic regression (especially with high-penalty) can also be used here even though it is not really a weak learner. It could be useful for data that has linear relationship.
c. Decision Trees with small depth (e.g., depth 2 or 3): These are slightly more complex than decision stumps. They’re still fairly simple, but can handle slightly more complex patterns than the decision stump.

AdaBoost’s base models can be simple decision stumps (depth=1), small trees (depth 2–3), or penalized linear models. Each type is kept simple to avoid overfitting while offering different ways to capture patterns.

n_estimators: The number of weak learners to combine, typically around 50–100. Using more than 100 rarely helps.

learning_rate: Controls how much each classifier affects the final result. Common starting values are 0.1, 0.5, or 1.0. Lower numbers (like 0.1) and a bit higher n_estimator usually work better.

Key differences from Random Forest

As both Random Forest and AdaBoost works with multiple trees, it is easy to confuse the parameters involved. The key difference is that Random Forest combines many trees independently (bagging) while AdaBoost builds trees one after another to fix mistakes (boosting). Here are some other details about their differences:

  1. No bootstrap parameter because AdaBoost uses all data but with changing weights
  2. No oob_score because AdaBoost doesn’t use bootstrap sampling
  3. learning_rate becomes crucial (not present in Random Forest)
  4. Tree depth is typically kept very shallow (usually just stumps) unlike Random Forest’s deeper trees
  5. The focus shifts from parallel independent trees to sequential dependent trees, making parameters like n_jobs less relevant

Pros:

  • Adaptive Learning: AdaBoost gets better by giving more weight to mistakes it made. Each new tree pays more attention to the hard cases it got wrong.
  • Resists Overfitting: Even though it keeps adding more trees one by one, AdaBoost usually doesn’t get too focused on training data. This is because it uses weighted voting, so no single tree can control the final answer too much.
  • Built-in Feature Selection: AdaBoost naturally finds which features matter most. Each simple tree picks the most useful feature for that round, which means it automatically selects important features as it trains.

Cons:

  • Sensitive to Noise: Because it gives more weight to mistakes, AdaBoost can have trouble with messy or wrong data. If some training examples have wrong labels, it might focus too much on these bad examples, making the whole model worse.
  • Must Be Sequential: Unlike Random Forest which can train many trees at once, AdaBoost must train one tree at a time because each new tree needs to know how the previous trees did. This makes it slower to train.
  • Learning Rate Sensitivity: While it has fewer settings to tune than Random Forest, the learning rate really affects how well it works. If it’s too high, it might learn the training data too exactly. If it’s too low, it needs many more trees to work well.

AdaBoost is a key boosting algorithm that many newer methods learned from. Its main idea — getting better by focusing on mistakes — has helped shape many modern machine learning tools. While other methods try to be perfect from the start, AdaBoost tries to show that sometimes the best way to solve a problem is to learn from your errors and keep improving.

AdaBoost also works best in binary classification problems and when your data is clean. While Random Forest might be better for more general tasks (like predicting numbers) or messy data, AdaBoost can give really good results when used in the right way. The fact that people still use it after so many years shows just how well the core idea works!

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast',
'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy',
'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast',
'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes',
'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Prepare data
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

# Split features and target
X, y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Train AdaBoost
ada = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1), # Create base estimator (decision stump)
n_estimators=50, # Typically fewer trees than Random Forest
learning_rate=1.0, # Default learning rate
algorithm='SAMME', # The only currently available algorithm (will be removed in future scikit-learn updates)
random_state=42
)
ada.fit(X_train, y_train)

# Predict and evaluate
y_pred = ada.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")



Source link

09Nov

Core AI For Any Rummy Variant. Step by Step guide to a Rummy AI | by Iheb Rachdi | Nov, 2024


Identifying and Collecting key Data

I explored several algorithms to optimize and reduce the search space for all possible combos. However, the fact that each card can appear twice increased the number of potential combos, making it challenging to track and validate each one. While competing on Codeforces, I encountered a problem that reminded me of the ‘island problem,’ which gave me new insight into approaching the hand evaluator system.

We can represent the hand as a 2D grid of size 4×13, where each column represents ranks from 1 to 13 and each row corresponds to the 4 suits. Each cell in this grid contains the count of cards in the hand in our case either 1, 2, or 0 . This allows us to divide the hand into ‘islands,’ which are defined as groups of connected land cells with counts of 1 or 2 based on the following connectivity rules:

1. Two cells are considered connected if they share a side (left, right, above, or below) in the grid.

2. All cells within the same column are also connected if they both contain at least 1s, even if they are not adjacent (above or below).

EXP of ‘ hand A’ : 11C 3H 4H 11D 3D 5H 9D 2H 6H 3C 4H 3D 4D 5H 12D 3C

Table representation of ‘hand A’

Our first task is to identify and label all distinct islands. Since each island is independent of the others, we can make our life easier by mapping each island to a class type let’s name it _cardGraph. This class will be responsible for that island in terms of extracting, modifying, or deleting operations.

For clarity, let’s isolate one island and work on it in the upcoming sections, so it’s easier for you to follow. If it helps, you can think of each island as a connected graph, as Shown in the figure below:

in Left: Island Represented in the Table; in Right: Same Island in a Connected Graph Perspective

Now If you take multiple island examples and try to extract the possible combos, you’ll notice that some cards have unique roles in branching out to a potential combinations. We’ll call these type of cards a control points or Cpts for short, as they play an essential role by reducing the search space significantly as you will see in the following steps.

Cpts: For a card to be considered a Cpts, it must be in a position where we have to make a choice on which meld (run or set) to append it to. If a card can naturally fit into multiple melds without forcing a choice (for example, a duplicate card with two options for melds each card will append to a meld), it won’t be considered a Cpts.

In the case of our island example the 3 of heart is identified as a cpts. Below are all the melds that the 3 of Hearts could attach to, one at a time.

Our next step is to mark each card that qualifies as a Cpts. To do this, we’ll create a 4×13 (in byte type) table lets call it _flagMap . Now for memory efficiency, you can make this a shared table each _cardGraph instance created from the hand can reference it and use it . In this table, each card in an island will be assigned a bitstream at the corresponding index in _flagMap, this byte will represents its potential placements in different runs or sets. If a card qualifies as a Cpts, it will be stored in a stack (we will need later), which we’ll call _cptsStack. Here’s a breakdown of the byte structure: the first bit indicates whether the card belongs to a run, the second bit indicates its placement in an additional run, the third bit represents whether it belongs to a set, and the fourth bit specifies if it belongs to multiple sets.

Here’s an example of a bitstream: 00000111 In here we have:

The first bit (1) means the card can belong to a run.

The second bit (1) means the card can belong to a second run.

The third bit (1) means the card belongs to a set.

The fourth bit (0) means the card doesn’t belong to a second set.

We might be in case where the configuration is 00000101 for one card (no copy), meaning the card belongs to a run or a set. Or another configuration could be 00000011, meaning the card belongs to two different runs.

To identify a cpts, simply count the ‘1’s in its bit representation. If this count exceeds the total number of that card in the hand, it’s considered a cpts. For instance, if a card appears twice (i.e., has two copies) and its bit representation is 00000101, it’s not a cpts. However, if the bit representation is 00000111 like the example , then it qualifies as a cpts.

In our island example, here’s how the _flagMap table would look :

_FlagMap Representation of the ‘hand A’ Example

Once we’ve populated the _flagMap and identified the cpts, the next task is to decompose the island into horizontal and vertical lines. But why? Breaking down the card graph into these lines simplifies the process of identifying runs and sets, as it allows us to focus on contiguous sequences of cards that can be processed more efficiently. As you might guess, the vertical lines will represent the sets, while the horizontal lines will represent the runs.

Island decomposed into Horizontal and Vertical Lines

We’ll store each horizontal line in a list of a tuple type, where the first item represents the starting index of the line and the last item represents the end index (inclusive). For the vertical lines, it’s sufficient to simply store the column index in a list.

Tip: We can accomplish this task along with the bit representation step in a single loop, achieving O(n) complexity.

Generate Combos

Now, let’s take a break and recap: we have identified the control points (CPTs) and stored them in the _cptsStack. We also decomposed the island into vertical and horizontal lines, and populated the _flagMap with card bit representation.

With our data in place, what remains is to use it to generate all possible valid combos of the island. But how do we do that? Here’s a simplified approach:

1. Assign Valid Placements for the Control Points (Cpts):
We take the bit representation of a cpts from _flagMap, which indicates all possible placements for that cpts. Then, we look at the number of copies of the cpts in the _cardGraph and adjust its bit representation to a current valid configuration. For example, if the cpts has a bit representation of 00001111 and 2 copies, we can generate all valid placements for it, which is C(4,2)=6C(4,2) = 6C(4,2)=6. Possible combinations would be 0011, 0101, 1100, 1010, 1001, and 0110.

2. Using DFS to Configure All Possible Combinations for Each Cpts:
We’ll use a depth-first search (DFS) to iterate over the valid placements for each cpts as shown in step 1. Each node in the DFS tree represents a possible placement for a given cpts, so each unique DFS path represents a valid combo configuration. For each “leaf” node (end of the DFS path), we proceed to the next step.

3. Generating Combos:
In this step, we iterate over the horizontal and vertical lines in the island to identify runs, sets, and a dump list. This is done in two passes for each line, as follows:

  • Pass 1: For a horizontal line, for example, we continuously append cards from [line start to line end] into a list to form a run. We stop adding if ( card_bit_representation | 00000001 == 0 ). If the length of the run is greater than or equal to 3, we add it to the run combo; otherwise, each card goes into the dump list, and we continue trying to form another run until we reach the line end.
  • Pass 2: Repeat the process, this time looking for cards that match a different bit pattern with or operation ( 00000010). This allows us to identify possible second runs.

The same approach applies to extracting sets, but we use bit operations with 00000100 and 00001000.

4. Register the Valid Combo and Move to the Next DFS Configuration:
After completing all runs, sets, and dumps for the current combo, we save the combo and then move on to the next DFS configuration to repeat the process. This way, we systematically explore all potential configurations for valid combos.

if you coded everything correctly and feed it our island example : ”2H3H4H5H4H5H6H3C3C3D3D4D”, it should be decomposed as shown bellow. Notice that I’ve added some calculation to each generated combo so that we can get a sense of how the AI will act.

Console Output Showing the Generated Combo For the Island Example

In the next article, I’ll dive into the rest of the system, focusing on the dynamic modification of the hand and the AI strategy. If you’ve followed along so far, it won’t be hard to see how we can optimize adding and removing cards, as well as incorporate the two rules we set aside at the beginning. Stay tuned, and see you next time! “hopefully 😉”.

Unless otherwise noted, all images are created by the author using Lucidchart ,Gimp and Python



Source link

06Nov

An Introduction to VLMs: The Future of Computer Vision Models | by Ro Isachenko | Nov, 2024


Building a 28% more accurate multimodal image search engine with VLMs.

Until recently, AI models were narrow in scope and limited to understanding either language or specific images, but rarely both.

In this respect, general language models like GPTs were a HUGE leap since we went from specialized models to general yet much more powerful models.

But even as language models progressed, they remained separate from computer vision аreas, each domain advancing in silos without bridging the gap. Imagine what would happen if you could only listen but not see, or vice versa.

My name is Roman Isachenko, and I’m part of the Computer Vision team at Yandex.

In this article, I’ll discuss visual language models (VLMs), which I believe are the future of compound AI systems.

I’ll explain the basics and training process for developing a multimodal neural network for image search and explore the design principles, challenges, and architecture that make it all possible.

Towards the end, I’ll also show you how we used an AI-powered search product to handle images and text and what changed with the introduction of a VLM.

Let’s begin!

What Are VLMs?

LLMs with billions or even hundreds of billions of parameters are no longer a novelty.

We see them everywhere!

The next key focus in LLM research has been more inclined towards developing multimodal models (omni-models) — models that can understand and process multiple data types.

Multimodal models (Image by Author)

As the name suggests, these models can handle more than just text. They can also analyze images, video, and audio.

But why are we doing this?

Jack of all trades, master of none, oftentimes better than master of one.

In recent years, we’ve seen a trend where general approaches dominate narrow ones.

Think about it.

Today’s language-driven ML models have become relatively advanced and general-purpose. One model can translate, summarize, identify speech tags, and much more.

General NLP model (Image by Author)

But earlier, these models used to be task-specific (we have them now as well, but fewer than before).

  • A dedicated model for translating.
  • A dedicated model for summarizing, etc.

In other words, today’s NLP models (LLMs, specifically) can serve multiple purposes that previously required developing highly specific solutions.

Second, this approach allows us to exponentially scale the data available for model training, which is crucial given the finite amount of text data. Earlier, however, one would need task-specific data:

  • A dedicated translation labeled dataset.
  • A dedicated summarization dataset, etc.

Third, we believe that training a multimodal model can enhance the performance of each data type, just like it does for humans.

For this article, we’ll simplify the “black box” concept to a scenario where the model receives an image and some text (which we call the “instruct”) as input and outputs only text (the response).

As a result, we end up with a much simpler process as shown below:

A simplified multimodal model (Image by Author)

We’ll discuss image-discriminative models that analyze and interpret what an image depicts.

Before delving into the technical details, consider the problems these models can solve.

A few examples are shown below:

Examples of tasks (Image by Author)
  • Top left image: We ask the model to describe the image. This is specified with text.
  • Top mid image: We ask the model to interpret the image.
  • Top right image: We ask the model to interpret the image and tell us what would happen if we followed the sign.
  • Bottom image: This is the most complicated example. We give the model some math problems. From these examples, you can see that the range of tasks is vast and diverse.

VLMs are a new frontier in computer vision that can solve various fundamental CV-related tasks (classification, detection, description) in zero-shot and one-shot modes.

While VLMs may not excel in every standard task yet, they are advancing quickly.

Now, let’s understand how they work.

VLM Architecture

These models typically have three main components:

Simplified representation of VLM (Image by Author)
  1. LLM — a text model (YandexGPT, in our case) that doesn’t understand images.
  2. Image encoder — an image model (CNN or Vision Transformer) that doesn’t understand text.
  3. Adapter — a model that acts as a mediator to ensure that the LLM and image encoder get along well.

The pipeline is pretty straightforward:

  • Feed an image into the image encoder.
  • Transform the output of the image encoder into some representation using the adapter.
  • Integrate the adapter’s output into the LLM (more on that below).
  • While the image is processed, convert the text instruct into a sequence of tokens and feed them into the LLM.

More Information About Adapters

The adapter is the most exciting and important part of the model, as it precisely facilitates the communication/interaction between the LLM and the image encoder.

There are two types of adapters:

  • Prompt-based adapters
  • Cross-attention-based adapters

Prompt-based adapters were first proposed in BLIP-2 and LLaVa models.

The idea is simple and intuitive, as evident from the name itself.

We take the output of the image encoder (a vector, a sequence of vectors, or a tensor — depending on the architecture) and transform it into a sequence of vectors (tokens), which we feed into the LLM. You could take a simple MLP model with a couple of layers and use it as an adapter, and the results will likely be pretty good.

Cross-attention-based adapters are a bit more sophisticated in this respect.

They were used in recent papers on Llama 3.2 and NVLM.

These adapters aim to transform the image encoder’s output to be used in the LLM’s cross-attention block as key/value matrices. Examples of such adapters include transformer architectures like perceiver resampler or Q‑former.

Prompt-based adapters (left) and Cross-attention-based adapters (right) (Image by Author)

Prompt-based adapters (left) and Cross-attention-based adapters (right)

Both approaches have pros and cons.

Currently, prompt-based adapters deliver better results but take away a large chunk of the LLM’s input context, which is important since LLMs have limited context length (for now).

Cross-attention-based adapters don’t take away from the LLM’s context but require a large number of parameters to achieve good quality.

VLM Training

With the architecture sorted out, let’s dive into training.

Firstly, note that VLMs aren’t trained from scratch (although we think it’s only a matter of time) but are built on pre-trained LLMs and image encoders.

Using these pre-trained models, we fine-tune our VLM in multimodal text and image data.

This process involves two steps:

  • Pre-training
  • Alignment: SFT + RL (optional)

Training procedure of VLMs (Image by Author)

Notice how these stages resemble LLM training?

This is because the two processes are similar in concept. Let’s take a brief look at these stages.

VLM Pre-training

Here’s what we want to achieve at this stage:

  • Link the text and image modalities together (remember that our model includes an adapter we haven’t trained before).
  • Load world knowledge into our model (the images have a lot of specifics, for one, OCR skills).

There are three types of data used in pre-training VLMs:

  • Interleaved Pre-training: This mirrors the LLM pre-training phase, where we teach the model to perform the next token prediction task by feeding it web documents. With VLM pre-training, we pick web documents with images and train the model to predict text. The key difference here is that a VLM considers both the text and the images on the page. Such data is easy to come by, so this type of pre-training isn’t hard to scale up. However, the data quality isn’t great, and boosting it proves to be a tough job.
Interleaved Pre-training dataset (Image by Author)

Image-Text Pairs Pre-training: We train the model to perform one specific task: captioning images. You need a large corpus of images with relevant descriptions to do that. This approach is more popular because many such corpora are used to train other models (text-to-image generation, image-to-text retrieval).

Image-Text Pairs Pre-training dataset (Image by Author)

Instruct-Based Pre-training: During inference, we’ll feed the model images and text. Why not train the model this way from the start? This is precisely what instruct-based pre-training does: It trains the model on a massive dataset of image-instruct-answer triplets, even if the data isn’t always perfect.

Instruct-Based Pre-training dataset (Image by Author)

How much data is needed to train a VLM model properly is a complex question. At this stage, the required dataset size can vary from a few million to several billion (thankfully, not a trillion!) samples.

Our team used instruct-based pre-training with a few million samples. However, we believe interleaved pre-training has great potential, and we’re actively working in that direction.

VLM Alignment

Once pre-training is complete, it’s time to start on alignment.

It comprises SFT training and an optional RL stage. Since we only have the SFT stage, I’ll focus on that.

Still, recent papers (like this and this) often include an RL stage on top of VLM, which uses the same methods as for LLMs (DPO and various modifications differing by the first letter in the method name).

Anyway, back to SFT.

Strictly speaking, this stage is similar to instruct-based pre-training.

The distinction lies in our focus on high-quality data with proper response structure, formatting, and strong reasoning capabilities.

This means that the model must be able to understand the image and make inferences about it. Ideally, it should respond equally well to text instructs without images, so we’ll also add high-quality text-only data to the mix.

Ultimately, this stage’s data typically ranges between hundreds of thousands to a few million examples. In our case, the number is somewhere in the six digits.

Quality Evaluation

Let’s discuss the methods for evaluating the quality of VLMs. We use two approaches:

  • Calculate metrics on open-source benchmarks.
  • Compare the models using side-by-side (SBS) evaluations, where an assessor compares two model responses and chooses the better one.

The first method allows us to measure surrogate metrics (like accuracy in classification tasks) on specific subsets of data.

However, since most benchmarks are in English, they can’t be used to compare models trained in other languages, like German, French, Russian, etc.

While translation can be used, the errors introduced by translation models make the results unreliable.

The second approach allows for a more in-depth analysis of the model but requires meticulous (and expensive) manual data annotation.

Our model is bilingual and can respond in both English and Russian. Thus, we can use English open-source benchmarks and run side-by-side comparisons.

We trust this method and invest a lot in it. Here’s what we ask our assessors to evaluate:

  • Grammar
  • Readability
  • Comprehensiveness
  • Relevance to the instruct
  • Errors (logical and factual)
  • Hallucinations

We strive to evaluate a complete and diverse subset of our model’s skills.

The following pie chart illustrates the distribution of tasks in our SbS evaluation bucket.

Distribution of tasks for quality evaluation (Image by Author)

This summarizes the overview of VLM fundamentals and how one can train a model and evaluate its quality.

Pipeline Architecture

This spring, we added multimodality to Neuro, an AI-powered search product, allowing users to ask questions using text and images.

Until recently, its underlying technology wasn’t truly multimodal.

Here’s what this pipeline looked like before.

Pipeline architecture (Image by Author)

This diagram seems complex, but it’s straightforward once you break it down into steps.

Here’s what the process used to look like

  1. The user submits an image and a text query.
  2. We send the image to our visual search еngine, which would return a wealth of information about the image (tags, recognized text, information card).
  3. We formulate a text query using a rephraser (a fine-tuned LLM) with this information and the original query.
  4. With the rephrased text query, we use Yandex Search to retrieve relevant documents (or excerpts, which we call infocontext).
  5. Finally, with all this information (original query, visual search information, rephrased text query, and info context), we generate the final response using a generator model (another fine-tuned LLM).

Done!

As you can see, we used to rely on two unimodal LLMs and our visual search engine. This solution worked well on a small sample of queries but had limitations.

Below is an example (albeit slightly exaggerated) of how things could go wrong.

The problem with two unimodal LLMs (Image by Author)

Here, the rephraser receives the output of the visual search service and simply doesn’t understand the user’s original intent.

In turn, the LLM model, which knows nothing about the image, generates an incorrect search query, getting tags about the pug and the apple simultaneously.

To improve the quality of our multimodal response and allow users to ask more complex questions, we introduced a VLM into our architecture.

More specifically, we made two major modifications:

  1. We replaced the LLM rephraser with a VLM rephraser. Essentially, we started feeding the original image to the rephraser’s input on top of the text from the visual search engine.
  2. We added a separate VLM captioner to the pipeline. This model provides an image description, which we use as info context for the final generator.

You might wonder

Why not make the generator itself VLM-based?

That’s a good idea!

But there’s a catch.

Our generator training inherits from Neuro’s text model, which is frequently updated.

To update the pipeline faster and more conveniently, it was much easier for us to introduce a separate VLM block.

Plus, this setup works just as well, which is shown below:

Using VLM in AI-powered search (Image by Author)

Training VLM rephraser and VLM captioner are two separate tasks.

For this, we use mentioned earlierse VLM, as mentioned e for thise-tuned it for these specific tasks.

Fine-tuning these models required collecting separate training datasets comprising tens of thousands of samples.

We also had to make significant changes to our infrastructure to make the pipeline computationally efficient.

Gauging the Quality

Now for the grand question:

Did introducing a VLM to a fairly complex pipeline improve things?

In short, yes, it did!

We ran side-by-side tests to measure the new pipeline’s performance and compared our previous LLM framework with the new VLM one.

This evaluation is similar to the one discussed earlier for the core technology. However, in this case, we use a different set of images and queries more aligned with what users might ask.

Below is the approximate distribution of clusters in this bucket.

Cluster distribution (Image by Author)

Our offline side-by-side evaluation shows that we’ve substantially improved the quality of the final response.

The VLM pipeline noticeably increases the response quality and covers more user scenarios.

Accuracy of VLM vs LLM in Neuro (Image by Author)

We also wanted to test the results on a live audience to see if our users would notice the technical changes that we believe would improve the product experience.

So, we conducted an online split test, comparing our LLM pipeline to the new VLM pipeline. The preliminary results show the following change:

  • The number of instructs that include an image increased by 17%.
  • The number of sessions (the user entering multiple queries in a row) saw an uptick of 4.5%.

To reiterate what was said above, we firmly believe that VLMs are the future of computer vision models.

VLMs are already capable of solving many out-of-the-box problems. With a bit of fine-tuning, they can absolutely deliver state-of-the-art quality.

Thanks for reading!



Source link

06Nov

Language Models Emerging Technologies | by Cobus Greyling | Nov, 2024


What Trended in 2024 — Six Technologies Which Dominated Timelines

In 𝟮𝟬𝟮𝟰, we saw the technology focus shifting from 𝗖𝗵𝗮𝗶𝗻 𝗼𝗳 𝗧𝗵𝗼𝘂𝗴𝗵𝘁 (𝗖𝗼𝗧) approach to 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹-𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 (𝗥𝗔𝗚), reflecting the need for precise, contextual responses in generative AI.

Building on 𝗥𝗔𝗚, 𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗥𝗔𝗚 emerged, adding 𝘢𝘶𝘵𝘰𝘯𝘰𝘮𝘰𝘶𝘴 𝘤𝘢𝘱𝘢𝘣𝘪𝘭𝘪𝘵𝘪𝘦𝘴 for AI to dynamically retrieve, interpret, and act on data.

As these technologies advanced, attention grew around 𝗦𝗺𝗮𝗹𝗹 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 and 𝗙𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗠𝗼𝗱𝗲𝗹𝘀, balancing task-specific efficiency with general-purpose adaptability.

Follow me on LinkedIn

This trajectory then accelerated toward 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 capable of more complex, interactive roles, paving the way for something I like to call 𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗫 — a framework embedding agentic capabilities directly into applications, making them proactive, adaptive, and contextually aware in meeting user goals independently.



Source link

05Nov

The AI Summit Series: What Should Its Niche Be?


GovAI research blog posts represent the views of their authors, rather than the views of the organisation.

Introduction

The Paris AI Action Summit, to be held in February 2025, will be the third iteration of a new international summit series, following 2023’s AI Safety Summit in Bletchley and 2024’s AI Seoul Summit. The first two summits produced several striking outcomes, including: the first international declaration on the importance of ensuring the safety of advanced AI, voluntary safety commitments from leading AI companies, the commissioning of the first International Scientific Report on the Safety of Advanced AI, and the launch of several national AI safety institutes.

The path forward after the February 2025 summit, however, is unclear. There are not yet any public plans for the series after Paris. Many questions about its future remain unanswered.

One crucial question is: What should the summit series’ distinct focus (niche) be? The international AI governance landscape is already crowded, with many international forums competing for engagement. The long term success of the series will therefore depend on its carving out a niche, establishing a distinct identity and offering key stakeholders unique value for their engagement. Clarity on the specific need that the series will fulfil will also make it easier for participants to focus their attention — and easier for each summit’s organisers to craft agendas that progress coherently toward long term goals. A more clearly-defined niche would enable the series to more neatly feed into broader processes, potentially including those within the United Nations.

Across the first three summits, however, the niche of the series has become less clear. The scope has expanded from a specific focus on the safety of advanced AI systems to a broader set of topics. While this expansion has facilitated more holistic conversations, it has also created more overlap with other forums and limited continuity across summits.

Ultimately, I will argue, the most compelling niche for the summit series is: a multistakeholder forum that focuses mainly on “advanced AI” 1 with an agenda that is closely aligned with the work of AI safety institutes. This specific niche or focus can be institutionalised by developing a strategic framework to guide future summits, ensuring long-term coherence and clearly distinguishing the series from other international forums.

The Need to Define a Niche

The scope of the summit series’ content has expanded significantly across the three summits. The initial summit (the “AI Safety Summit”) focused purely on the safety of advanced AI systems. The second summit (the “AI Seoul Summit”) expanded the initial summit’s focus on advanced AI safety by also introducing innovation and inclusivity as core themes. The third summit (the “AI Action Summit”) will include five expansive workstreams — covering “public interest AI”, “future of work”, “innovation and ecosystems”, “AI in trust” (including safety topics as a subset), and “global governance of AI” — and will cover both advanced AI and other kinds of AI systems. 

This trend towards expanding the breadth of topics offers some obvious benefits. It will bring additional consideration and discussion to several important issues, beyond what they receive in other forums. It may also foster more “holistic” discussions, which acknowledge the intersections between different topics in AI governance.

However, crucially, this trend towards breadth has begun to blur the series’ distinct identity among other international AI governance forums. With a more distinct identity, the summit will encourage and enable more sustainable and meaningful participation from key players in AI. Key players are already expected to participate in numerous high-level events annually, including the G7 summit, the G20 summit, the Global Partnership on AI – OECD summit, the UNESCO Ethics of AI summit, the annual meeting of the World Economic Forum, and the ITU AI for Good summit. The highest level of representation from participating countries and companies cannot attend every AI-related event and must inevitably prioritise. If there is not a clear and compelling story about what the AI Summit Series offers that is distinct from these existing summits, then high-level representatives simply will not prioritise it.

Furthermore, as the focus of summits grows broader, it will become more difficult to make progress within summits. The top representatives from different countries and companies will not, in practice, have the capacity to engage deeply with several different workstreams in a single summit; either their attention will be spread shallowly across several workstreams or some workstreams will be neglected.

Growing breadth will also make it harder for host countries — especially comparatively less well- resourced host countries — to build successfully on each other’s work. A clear and stable niche would make it easier for hosts to design agendas that build on past summits in order to reach  consistent long term goals,and dedicate sufficient resources to all parts of these agendas. A clear niche would also prevent declarations and commitments on different topics from accumulating across summits, which could make it challenging to keep track of agreed-upon goals and initiatives.

Ultimately, the summit series will face several key challenges as it moves forward: 

  1. Offering a unique value proposition to participants by addressing specific gaps in the landscape of international AI governance initiatives
  2. Supporting productive conversations, by maintaining focus
  3. Enabling momentum and high execution quality, by maintaining consistency across events

Overcoming these challenges will require careful coordination with existing initiatives to ensure complementary contributions to international efforts. 

The organizers of the next summit should respond to these challenges and agree on a compelling unified vision for the series. This framework would outline themes and objectives for the series, creating a natural continuation strategy and providing a clear mandate and structure. This will enable each summit to build on the progress of previous summits, while remaining flexible enough to adapt to new developments in AI.

Proposing a Distinct Niche for the Series

The most natural niche for the summit series is a multistakeholder forum that focuses mainly on advanced AI, with an agenda that is closely aligned with the work of safety institutes. This focus would clearly distinguish the series from existing initiatives and address three key challenges listed above.

The following discussion considers each component of this niche individually.

Embracing a Multistakeholder Approach

Many other international AI governance forums only involve industry and civil society participants in limited ways. In contrast, the AI summit series has offered more significant roles to companies and to civil society organisations. For example, one of the key outcomes of the AI Seoul summit was a set of voluntary safety commitments made by leading AI companies worldwide. These commitments are an example of a valuable outcome that simply could not have occurred in a more state-focused venue, such as the G7 or G20.

The AI summit series could also capitalise on its successful multistakeholder approach by — for example — giving equal representation to different categories of stakeholders on joint planning committees and working groups. This would introduce a collaborative approach to the summit process, from the initial concept stage to final recommendations. While governments would retain final decision-making authority, this model would provide a structured forum for industry and academic insights to directly inform policy development.

Focusing Primarily on Advanced AI

While many international forums address AI broadly, no major forum other than the AI Summit Series has focused primarily on “advanced AI” (defined as general-purpose AI systems that exceed or approximately match the capabilities of the most powerful systems available). This topic is also complex and policy-relevant enough to warrant focused attention within at least one forum.  Advanced AI may present relatively distinct risks and opportunities and warrant some distinct measures to manage these risks. For that reason, several high-profile regulatory efforts — including the US Executive Order on AI and the EU AI Act — identify advanced AI as a distinct regulatory category. There is clearly a growing demand for focused discussions on advanced AI, which other major forums are not yet providing.

Most of the successes of the first two summits pertain primarily to advanced AI. Even if future summits do have somewhat broader scopes, it should be a central priority to build upon the advanced-AI-focused successes of past summits.

Aligning with the Efforts of AI Safety Institutes

The creation of AI safety institutes has been considered a result of the AI Summit Series: the first institutes were announced ahead of the initial AI Safety Summit, then at the AI Seoul Summit there was an announcement of plans for an international network of AI safety institutes.

This close connection to the emerging AI safety institute network is a distinct asset for the AI Summit Series, one which can inform strategies for future progress. First, the summit can draw on the expertise housed with AI safety institutes to support informed international discussions. Second, the summit can also serve as a coordination hub for national AI safety institutes. For example, it could help these institutes to establish a shared research agenda on pressing AI safety challenges, facilitate data sharing agreements, or develop common evaluation methods and benchmarks for AI systems. The summit could also support more ambitious projects that the institutes may contribute to or lead in the future, such as efforts to create a shared framework for rapid response to emerging AI risks or efforts to make safety certification processes across participating countries more closely coordinated. By supporting coordination between safety institutes, while also drawing on and spreading their expertise internationally, the summit series can contribute to a more unified and effective global approach to AI safety.

Support Structures and Norms

For the summit series to maintain a clear niche while retaining the flexibility to evolve in response to a changing AI landscape, it will be useful to institute a number of structures and norms.

As discussed above, consistency across summits could be supported by the creation of a high-level strategic framework. This framework could outline key objectives and themes for the series on a multiyear timeframe and help organisers to prioritise when setting summit agendas. The strategic framework should be updated regularly to reflect evolving priorities and the evolving international governance landscape, but should not change substantially in most years.

The creation of joint working groups, which carry over from one summit to the next, could help to ensure consistency. The working groups would support continuity in the series’ themes, although there would also need to be room for groups to be added, removed, or merged over time.

It will be important to establish structures and norms to ensure that the series remains aligned with other international processes, even as these other processes evolve. The fact that other processes are constantly evolving means that the summit series’ niche will also need to evolve over time — and that work must be done to ensure it continues to complement them, rather than competing with them. It will be important, for example, to align with the Global Digital Compact, a comprehensive framework adopted by 193 countries at the United Nations, becoming the first framework for global AI governance.  

There are three main ways to ensure alignment:

  1. Invite representatives from other initiatives to participate (as has been done in the past)
  2. Establish formal partnerships with other initiatives and create structured channels to contribute effectively
  3. Design summit agendas that explicitly build on and complement the work of other initiatives, in consultation with them

Together, these structures and norms could help to ensure that the summit series continues to have a clear niche, while also continuing to evolve to meet new challenges and opportunities.

Conclusion

As the Paris AI Action Summit concludes the three-summit cycle initiated at Bletchley, the series’ future remains uncertain, yet full of potential. The international AI governance landscape is crowded, with numerous forums competing for attention and engagement. The landscape is likely to become more crowded over time, as reflected, for example, by the recently-adopted UN Global Digital Compact’s comprehensive plans for a global dialogue on AI governance. Stakeholders have limited capacity to engage with AI-related events and need to clearly understand the purpose of making time for this series. The series will struggle, therefore, if it does not make its niche clear.

We can learn from the successes of the past two summits, which yielded several concrete outcomes that influenced national agendas and company priorities. These successes suggest a particular niche: the summit series should be a multistakeholder forum dedicated to advanced AI, closely aligned with the efforts of safety institutes

The challenge beyond Paris is to formalise this niche, developing a strategic initiative with a long-term vision and concrete goals for the next 3-5 years. This approach would position the summit series as a cornerstone of international AI governance, capable of driving meaningful progress in a rapidly changing field.

Lucia Velasco

The views expressed in this article are those of the author and do not represent the views of their employer. The author would like to thank Ben Garfinkel for his valuable feedback.



Source link

05Nov

Anthropic ACI (AI Agent Computer Interface) | by Cobus Greyling | Nov, 2024


An AI Agent Computer Interface is a tool in an Agent’s toolbox which enables the agent to leverage a web browser as a human would.

This interface often supports seamless, context-aware exchanges, letting AI Agents handle complex tasks through intuitive commands and adaptive responses.

General problems with a Web GUI is time of query executing and errors in interpreting the screen. Human supervision is something which can really help a lot it ensuring a smooth GUI agent journey.

What is a AI Agent (Agentic) Computer Interface?

An ACI is a piece of software which can receive compound and complex input from a user, and answer the question by making use of a Computer Interface. Very much in the same fashion we as humans will interact with a computer.

As you will see later in this article, the ACI acts as an agent tool in the context of the Anthropic example.

The interfaces should support natural, intuitive interactions with AI Agents to improve accessibility and usability, allowing users to engage effortlessly.

AI Agents should have context-sensitive capabilities, adapting responses based on past interactions and user needs for continuity and relevance.

Effective interfaces facilitate task automation, enabling agents to assist in complex workflows by taking over repetitive or straightforward actions.

Continuous user feedback integration enhances the agent’s ability to learn, adjust, and optimise performance over time.

The AI Agent has one of its tools which are available, a Computer interface.

A new capability called computer use is now available in public beta, enabling developers to guide Claude in interacting with computers similarly to humans — navigating screens, clicking, and typing.

Claude 3.5 Sonnet is the first frontier AI model to support this functionality in a public beta, allowing for real-time experimentation and user feedback.

Though still in an early stage and occasionally prone to errors, this feature is expected to evolve quickly based on input from developers.

I think it is important to note that many models support vision, and that vision enabled models from OpenAI and others have been used in frameworks to deliver AI Agents which interfaces to computers.

The most notable, for me at least, is the LangChain implementation of WebVoyager.

Hence it is important to note that this is a Computer User Interface framework made available by Anthropic. This has been an approach followed by many model providers, to provide frameworks through which value is delivered. And hence make their offering more compelling.

I made use of the docker container locally on my MacBook…

Once the container is running, see the Accessing the demo app section below for instructions on how to connect to the interface.

Once the container is running, open your browser to http://localhost:8080 to access the combined interface that includes both the agent chat and desktop view.

The container stores settings like the API key and custom system prompt in ~/.anthropic/. Mount this directory to persist these settings between container runs.

Alternative access points:

Below is the script I made use of tho initiate the docker container…

Find the GitHub quick start here.



Source link

Protected by Security by CleanTalk