14Nov

Writing LLMs in Rust: Looking for an Efficient Matrix Multiplication | by Stefano Bosisio | Nov, 2024


Starting from Karpathy llm.c, I wonder myself “Could I write this in Rust?” Here are the lessons I learned and how I am writing llm.rust. In this first article, let’s tackle the matrix multiplication problem.

Image by GoogleDeepMind on Unsplash

Matrix multiplication may be the most important operation in Machine Learning. I still remember when I was an engineering student, and in one of the first linear algebra lessons, the teacher started to explain matrices, eigenvectors, and basis and orthonormal basis. I was very confused, my head took a little while to start understanding why we were bothering so much about matrices and basis sets, and what a good basis implies for our world. From there, I always found linear algebra so fascinating, and, from a pure computer science point of view, how amazing all those algorithms that try to be more and more efficient in handling matrices.

In particular, we know that the matrix-vector product is pretty simple, but things are getting more and more complicated when we have matrices-matrices or tensors-tensors products. From here, many methodologies have been implemented to optimize the matrix multiplication. For example, a long time ago I posted about DeepMind



Source link

13Nov

Building Conversational AI Agents By Integrating Reasoning, Speaking & Acting With LLMs | by Cobus Greyling | Nov, 2024


1. When an agent seeks user guidance to refine its search strategy, it actively involves the user in defining the best approach, improving accuracy by ensuring its search aligns with user expectations.

2. This type of dialogue encourages collaboration, allowing users to clarify ambiguous instructions or adjust the search path as new insights arise.

3. Sharing status updates on task progress is essential for transparency, as it informs users of what the agent has completed and any challenges encountered.

4. Regular updates help users feel informed and give them an opportunity to provide additional instructions if the task requires it.

5. Soliciting user preferences is another valuable dialogue type, where the agent gathers input to shape task outcomes, ensuring decisions align closely with user needs.

6. This approach supports more personalised results, making the task execution feel interactive and responsive to individual preferences.

7. Together, these dialogue types create a flexible, two-way interaction that enhances the quality of task completion by combining automated assistance with user-specific insights.

8. Ultimately, these interactions improve alignment, trust, and satisfaction as the agent works to adapt and optimise its actions based on direct user input.



Source link

11Nov

My Medium Journey as a Data Scientist: 6 Months, 18 Articles, and 3,000 Followers | by Yu Dong | Nov, 2024


Real numbers, earnings, and data-driven growth strategy for Medium writers

I started writing data science and AI content on Medium in May 2024. This is my sixth month and I just hit a major milestone — 3,000 followers! I am very proud of my achievements.

In this article, I will share how this journey started, what I have been writing, and what I learned. Plus, as a data scientist, I always enjoy analyzing my own data. I collected a dataset of my Medium stats, including article views👀, reads📖, claps👏, earnings💵, etc. Join me as I break down my Medium experience using data and share my data-driven writing strategies.

Image created by DALL·E

How it all began

My writing habit dates back well before I started writing on Medium. I have been running my data science portfolio site since 2018, back when I started my first full-time job. I post articles there and occasionally share them on LinkedIn. It helps me connect with friends and colleagues in the data domain. Earlier this year, I posted an article about my experimentation with the custom GPTs, and it reached nearly 10k impressions on LinkedIn. This is not bad at all but it…



Source link

10Nov

AdaBoost Classifier, Explained: A Visual Guide with Code Examples | by Samy Baladram | Nov, 2024


ENSEMBLE LEARNING

Putting the weight where weak learners need it most

Everyone makes mistakes — even the simplest decision trees in machine learning. Instead of ignoring them, AdaBoost (Adaptive Boosting) algorithm does something different: it learns (or adapts) from these mistakes to get better.

Unlike Random Forest, which makes many trees at once, AdaBoost starts with a single, simple tree and identifies the instances it misclassifies. It then builds new trees to fix those errors, learning from its mistakes and getting better with each step.

Here, we’ll illustrate exactly how AdaBoost makes its predictions, building strength by combining targeted weak learners just like a workout routine that turns focused exercises into full-body power.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

AdaBoost is an ensemble machine learning model that creates a sequence of weighted decision trees, typically using shallow trees (often just single-level “stumps”). Each tree is trained on the entire dataset, but with adaptive sample weights that give more importance to previously misclassified examples.

For classification tasks, AdaBoost combines the trees through a weighted voting system, where better-performing trees get more influence in the final decision.

The model’s strength comes from its adaptive learning process — while each simple tree might be a “weak learner” that performs only slightly better than random guessing, the weighted combination of trees creates a “strong learner” that progressively focuses on and corrects mistakes.

AdaBoost is part of the boosting family of algorithms because it builds trees one at a time. Each new tree tries to fix the mistakes made by the previous trees. It then uses a weighted vote to combine their answers and make its final prediction.

Throughout this article, we’ll focus on the classic golf dataset as an example for classification.

Columns: ‘Outlook (one-hot-encoded into 3 columns)’, ’Temperature’ (in Fahrenheit), ‘Humidity’ (in %), ‘Windy’ (Yes/No) and ‘Play’ (Yes/No, target feature)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Create and prepare dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast',
'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy',
'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast',
'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes',
'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
# Prepare data
df = pd.DataFrame(dataset_dict)
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

# Rearrange columns
column_order = ['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']
df = df[column_order]

# Prepare features and target
X,y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)Main Mechanism

Here’s how AdaBoost works:

  1. Initialize Weights: Assign equal weight to each training example.
  2. Iterative Learning: In each step, a simple decision tree is trained and its performance is checked. Misclassified examples get more weight, making them a priority for the next tree. Correctly classified examples stay the same, and all weights are adjusted to add up to 1.
  3. Build Weak Learners: Each new, simple tree targets the mistakes of the previous ones, creating a sequence of specialized weak learners.
  4. Final Prediction: Combine all trees through weighted voting, where each tree’s vote is based on its importance value, giving more influence to more accurate trees.
An AdaBoost Classifier makes predictions by using many simple decision trees (usually 50–100). Each tree, called a “stump,” focuses on one important feature, like temperature or humidity. The final prediction is made by combining all the trees’ votes, each weighted by how important that tree is (“alpha”).

Here, we’ll follow the SAMME (Stagewise Additive Modeling using a Multi-class Exponential loss function) algorithm, the standard approach in scikit-learn that handles both binary and multi-class classification.

1.1. Decide the weak learner to be used. A one-level decision tree (or “stump”) is the default choice.
1.2. Decide how many weak learner (in this case the number of trees) you want to build (the default is 50 trees).

We begin with depth-1 decision trees (stumps) as our weak learners. Each stump makes just one split, and we’ll train 50 of them sequentially, adjusting weights along the way.

1.3. Start by giving each training example equal weight:
· Each sample gets weight = 1/N (N is total number of samples)
· All weights together sum to 1

All data points start with equal weights (0.0714), with the total weight adding up to 1. This ensures every example is equally important when training begins.

For the First Tree

2.1. Build a decision stump while considering sample weights

Before making the first split, the algorithm examines all data points with their weights to find the best splitting point. These weights influence how important each example is in making the split decision.

a. Calculate initial weighted Gini impurity for the root node

The algorithm calculates the Gini impurity score at the root node, but now considers the weights of all data points.

b. For each feature:
· Sort data by feature values (exactly like in Decision Tree classifier)

For each feature, the algorithm sorts the data and identifies potential split points, exactly like the standard Decision Tree.

· For each possible split point:
·· Split samples into left and right groups
·· Calculate weighted Gini impurity for both groups
·· Calculate weighted Gini impurity reduction for this split

The algorithm calculates weighted Gini impurity for each potential split and compares it to the parent node. For feature “sunny” with split point 0.5, this impurity reduction (0.066) shows how much this split improves the data separation.

c. Pick the split that gives the largest Gini impurity reduction

After checking all possible splits across features, the column ‘overcast’ (with split point 0.5) gives the highest impurity reduction of 0.102. This means it’s the most effective way to separate the classes, making it the best choice for the first split.

d. Create a simple one-split tree using this decision

Using the best split point found, the algorithm divides the data into two groups, each keeping their original weights. This simple decision tree is purposely kept small and imperfect, making it just slightly better than random guessing.

2.2. Evaluate how good this tree is
a. Use the tree to predict the label of the training set.
b. Add up the weights of all misclassified samples to get error rate

The first weak learner makes predictions on the training data, and we check where it made mistakes (marked with X). The error rate of 0.357 shows this simple tree gets some predictions wrong, which is expected and will help guide the next steps of training.

c. Calculate tree importance (α) using:
α = learning_rate × log((1-error)/error)

Using the error rate, we calculate the tree’s influence score (α = 0.5878). Higher scores mean more accurate trees, and this tree earned moderate importance for its decent performance.

2.3. Update sample weights
a. Keep the original weights for correctly classified samples
b. Multiply the weights of misclassified samples by e^(α).
c. Divide each weight by the sum of all weights. This normalization ensures all weights still sum to 1 while maintaining their relative proportions.

Cases where the tree made mistakes (marked with X) get higher weights for the next round. After increasing these weights, all weights are normalized to sum to 1, ensuring misclassified examples get more attention in the next tree.

For the Second Tree

2.1. Build a new stump, but now using the updated weights
a. Calculate new weighted Gini impurity for root node:
· Will be different because misclassified samples now have bigger weights
· Correctly classified samples now have smaller weights

Using the updated weights (where misclassified examples now have higher importance), the algorithm calculates the weighted Gini impurity at the root node. This begins the process of building the second decision tree.

b. For each feature:
· Same process as before, but the weights have changed
c. Pick the split with best weighted Gini impurity reduction
· Often completely different from the first tree’s split
· Focuses on samples the first tree got wrong

With updated weights, different split points show different effectiveness. Notice that “overcast” is no longer the best split — the algorithm now finds temperature (84.0) gives the highest impurity reduction, showing how weight changes affect split selection.

d. Create the second stump

Using temperature ≤ 84.0 as the split point, the algorithm assigns YES/NO to each leaf based on which class has more total weight in that group, not just by counting examples. This weighted voting helps correct the previous tree’s mistakes.

2.2. Evaluate this new tree
a. Calculate error rate with current weights
b. Calculate its importance (α) using the same formula as before
2.3. Update weights again — Same process: increase weights for mistakes then normalize.

The second tree achieves a lower error rate (0.222) and higher importance score (α = 1.253) than the first tree. Like before, misclassified examples get higher weights for the next round.

For the Third Tree onwards

Repeat Step 2.1–2.3 for all remaining trees.

The algorithm builds 50 simple decision trees sequentially, each with its own importance score (α). Each tree learns from previous mistakes by focusing on different aspects of the data, creating a strong combined model. Notice how some trees (like Tree 2) get higher importance scores when they perform better.

Step 3: Final Ensemble
3.1. Keep all trees and their importance scores

The 50 simple decision trees work together as a team, each with its own importance score (α). When making predictions, trees with higher α values (like Tree 2 with 1.253) have more influence on the final decision than trees with lower scores.
from sklearn.tree import plot_tree
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Train AdaBoost
np.random.seed(42) # For reproducibility
clf = AdaBoostClassifier(algorithm='SAMME', n_estimators=50, random_state=42)
clf.fit(X_train, y_train)

# Create visualizations for trees 1, 2, and 50
trees_to_show = [0, 1, 49]
feature_names = X_train.columns.tolist()
class_names = ['No', 'Yes']

# Set up the plot
fig, axes = plt.subplots(1, 3, figsize=(14,4), dpi=300)
fig.suptitle('Decision Stumps from AdaBoost', fontsize=16)

# Plot each tree
for idx, tree_idx in enumerate(trees_to_show):
plot_tree(clf.estimators_[tree_idx],
feature_names=feature_names,
class_names=class_names,
filled=True,
rounded=True,
ax=axes[idx],
fontsize=12) # Increased font size
axes[idx].set_title(f'Tree {tree_idx + 1}', fontsize=12)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])

Each node shows its ‘value’ parameter as [weight_NO, weight_YES], which represents the weighted proportion of each class at that node. These weights come from the sample weights we calculated during training.

Testing Step

For predicting:
a. Get each tree’s prediction
b. Multiply each by its importance score (α)
c. Add them all up
d. The class with higher total weight will be the final prediction

When predicting for new data, each tree makes its prediction and multiplies it by its importance score (α). The final decision comes from adding up all weighted votes — here, the NO class gets a higher total score (23.315 vs 15.440), so the model predicts NO for this unseen example.

Evaluation Step

After building all the trees, we can evaluate the test set.

By iteratively training and weighting weak learners to focus on misclassified examples, AdaBoost creates a strong classifier that achieves high accuracy — typically better than single decision trees or simpler models!
# Get predictions
y_pred = clf.predict(X_test)

# Create DataFrame with actual and predicted values
results_df = pd.DataFrame({
'Actual': y_test,
'Predicted': y_pred
})
print(results_df) # Display results DataFrame

# Calculate and display accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")

Here are the key parameters for AdaBoost, particularly in scikit-learn:

estimator: This is the base model that AdaBoost uses to build its final solution. The 3 most common weak learners are:
a. Decision Tree with depth 1 (Decision Stump): This is the default and most popular choice. Because it only has one split, it is considered a very weak learner that is just a bit better than random guessing, exactly what is needed for boosting process.
b. Logistic Regression: Logistic regression (especially with high-penalty) can also be used here even though it is not really a weak learner. It could be useful for data that has linear relationship.
c. Decision Trees with small depth (e.g., depth 2 or 3): These are slightly more complex than decision stumps. They’re still fairly simple, but can handle slightly more complex patterns than the decision stump.

AdaBoost’s base models can be simple decision stumps (depth=1), small trees (depth 2–3), or penalized linear models. Each type is kept simple to avoid overfitting while offering different ways to capture patterns.

n_estimators: The number of weak learners to combine, typically around 50–100. Using more than 100 rarely helps.

learning_rate: Controls how much each classifier affects the final result. Common starting values are 0.1, 0.5, or 1.0. Lower numbers (like 0.1) and a bit higher n_estimator usually work better.

Key differences from Random Forest

As both Random Forest and AdaBoost works with multiple trees, it is easy to confuse the parameters involved. The key difference is that Random Forest combines many trees independently (bagging) while AdaBoost builds trees one after another to fix mistakes (boosting). Here are some other details about their differences:

  1. No bootstrap parameter because AdaBoost uses all data but with changing weights
  2. No oob_score because AdaBoost doesn’t use bootstrap sampling
  3. learning_rate becomes crucial (not present in Random Forest)
  4. Tree depth is typically kept very shallow (usually just stumps) unlike Random Forest’s deeper trees
  5. The focus shifts from parallel independent trees to sequential dependent trees, making parameters like n_jobs less relevant

Pros:

  • Adaptive Learning: AdaBoost gets better by giving more weight to mistakes it made. Each new tree pays more attention to the hard cases it got wrong.
  • Resists Overfitting: Even though it keeps adding more trees one by one, AdaBoost usually doesn’t get too focused on training data. This is because it uses weighted voting, so no single tree can control the final answer too much.
  • Built-in Feature Selection: AdaBoost naturally finds which features matter most. Each simple tree picks the most useful feature for that round, which means it automatically selects important features as it trains.

Cons:

  • Sensitive to Noise: Because it gives more weight to mistakes, AdaBoost can have trouble with messy or wrong data. If some training examples have wrong labels, it might focus too much on these bad examples, making the whole model worse.
  • Must Be Sequential: Unlike Random Forest which can train many trees at once, AdaBoost must train one tree at a time because each new tree needs to know how the previous trees did. This makes it slower to train.
  • Learning Rate Sensitivity: While it has fewer settings to tune than Random Forest, the learning rate really affects how well it works. If it’s too high, it might learn the training data too exactly. If it’s too low, it needs many more trees to work well.

AdaBoost is a key boosting algorithm that many newer methods learned from. Its main idea — getting better by focusing on mistakes — has helped shape many modern machine learning tools. While other methods try to be perfect from the start, AdaBoost tries to show that sometimes the best way to solve a problem is to learn from your errors and keep improving.

AdaBoost also works best in binary classification problems and when your data is clean. While Random Forest might be better for more general tasks (like predicting numbers) or messy data, AdaBoost can give really good results when used in the right way. The fact that people still use it after so many years shows just how well the core idea works!

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Create dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast',
'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy',
'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast',
'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0,
72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0,
88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0,
90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0,
65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True,
True, False, True, True, False, False, True, False, True, True, False,
True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes',
'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes',
'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Prepare data
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

# Split features and target
X, y = df.drop('Play', axis=1), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Train AdaBoost
ada = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1), # Create base estimator (decision stump)
n_estimators=50, # Typically fewer trees than Random Forest
learning_rate=1.0, # Default learning rate
algorithm='SAMME', # The only currently available algorithm (will be removed in future scikit-learn updates)
random_state=42
)
ada.fit(X_train, y_train)

# Predict and evaluate
y_pred = ada.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")



Source link

09Nov

Core AI For Any Rummy Variant. Step by Step guide to a Rummy AI | by Iheb Rachdi | Nov, 2024


Identifying and Collecting key Data

I explored several algorithms to optimize and reduce the search space for all possible combos. However, the fact that each card can appear twice increased the number of potential combos, making it challenging to track and validate each one. While competing on Codeforces, I encountered a problem that reminded me of the ‘island problem,’ which gave me new insight into approaching the hand evaluator system.

We can represent the hand as a 2D grid of size 4×13, where each column represents ranks from 1 to 13 and each row corresponds to the 4 suits. Each cell in this grid contains the count of cards in the hand in our case either 1, 2, or 0 . This allows us to divide the hand into ‘islands,’ which are defined as groups of connected land cells with counts of 1 or 2 based on the following connectivity rules:

1. Two cells are considered connected if they share a side (left, right, above, or below) in the grid.

2. All cells within the same column are also connected if they both contain at least 1s, even if they are not adjacent (above or below).

EXP of ‘ hand A’ : 11C 3H 4H 11D 3D 5H 9D 2H 6H 3C 4H 3D 4D 5H 12D 3C

Table representation of ‘hand A’

Our first task is to identify and label all distinct islands. Since each island is independent of the others, we can make our life easier by mapping each island to a class type let’s name it _cardGraph. This class will be responsible for that island in terms of extracting, modifying, or deleting operations.

For clarity, let’s isolate one island and work on it in the upcoming sections, so it’s easier for you to follow. If it helps, you can think of each island as a connected graph, as Shown in the figure below:

in Left: Island Represented in the Table; in Right: Same Island in a Connected Graph Perspective

Now If you take multiple island examples and try to extract the possible combos, you’ll notice that some cards have unique roles in branching out to a potential combinations. We’ll call these type of cards a control points or Cpts for short, as they play an essential role by reducing the search space significantly as you will see in the following steps.

Cpts: For a card to be considered a Cpts, it must be in a position where we have to make a choice on which meld (run or set) to append it to. If a card can naturally fit into multiple melds without forcing a choice (for example, a duplicate card with two options for melds each card will append to a meld), it won’t be considered a Cpts.

In the case of our island example the 3 of heart is identified as a cpts. Below are all the melds that the 3 of Hearts could attach to, one at a time.

Our next step is to mark each card that qualifies as a Cpts. To do this, we’ll create a 4×13 (in byte type) table lets call it _flagMap . Now for memory efficiency, you can make this a shared table each _cardGraph instance created from the hand can reference it and use it . In this table, each card in an island will be assigned a bitstream at the corresponding index in _flagMap, this byte will represents its potential placements in different runs or sets. If a card qualifies as a Cpts, it will be stored in a stack (we will need later), which we’ll call _cptsStack. Here’s a breakdown of the byte structure: the first bit indicates whether the card belongs to a run, the second bit indicates its placement in an additional run, the third bit represents whether it belongs to a set, and the fourth bit specifies if it belongs to multiple sets.

Here’s an example of a bitstream: 00000111 In here we have:

The first bit (1) means the card can belong to a run.

The second bit (1) means the card can belong to a second run.

The third bit (1) means the card belongs to a set.

The fourth bit (0) means the card doesn’t belong to a second set.

We might be in case where the configuration is 00000101 for one card (no copy), meaning the card belongs to a run or a set. Or another configuration could be 00000011, meaning the card belongs to two different runs.

To identify a cpts, simply count the ‘1’s in its bit representation. If this count exceeds the total number of that card in the hand, it’s considered a cpts. For instance, if a card appears twice (i.e., has two copies) and its bit representation is 00000101, it’s not a cpts. However, if the bit representation is 00000111 like the example , then it qualifies as a cpts.

In our island example, here’s how the _flagMap table would look :

_FlagMap Representation of the ‘hand A’ Example

Once we’ve populated the _flagMap and identified the cpts, the next task is to decompose the island into horizontal and vertical lines. But why? Breaking down the card graph into these lines simplifies the process of identifying runs and sets, as it allows us to focus on contiguous sequences of cards that can be processed more efficiently. As you might guess, the vertical lines will represent the sets, while the horizontal lines will represent the runs.

Island decomposed into Horizontal and Vertical Lines

We’ll store each horizontal line in a list of a tuple type, where the first item represents the starting index of the line and the last item represents the end index (inclusive). For the vertical lines, it’s sufficient to simply store the column index in a list.

Tip: We can accomplish this task along with the bit representation step in a single loop, achieving O(n) complexity.

Generate Combos

Now, let’s take a break and recap: we have identified the control points (CPTs) and stored them in the _cptsStack. We also decomposed the island into vertical and horizontal lines, and populated the _flagMap with card bit representation.

With our data in place, what remains is to use it to generate all possible valid combos of the island. But how do we do that? Here’s a simplified approach:

1. Assign Valid Placements for the Control Points (Cpts):
We take the bit representation of a cpts from _flagMap, which indicates all possible placements for that cpts. Then, we look at the number of copies of the cpts in the _cardGraph and adjust its bit representation to a current valid configuration. For example, if the cpts has a bit representation of 00001111 and 2 copies, we can generate all valid placements for it, which is C(4,2)=6C(4,2) = 6C(4,2)=6. Possible combinations would be 0011, 0101, 1100, 1010, 1001, and 0110.

2. Using DFS to Configure All Possible Combinations for Each Cpts:
We’ll use a depth-first search (DFS) to iterate over the valid placements for each cpts as shown in step 1. Each node in the DFS tree represents a possible placement for a given cpts, so each unique DFS path represents a valid combo configuration. For each “leaf” node (end of the DFS path), we proceed to the next step.

3. Generating Combos:
In this step, we iterate over the horizontal and vertical lines in the island to identify runs, sets, and a dump list. This is done in two passes for each line, as follows:

  • Pass 1: For a horizontal line, for example, we continuously append cards from [line start to line end] into a list to form a run. We stop adding if ( card_bit_representation | 00000001 == 0 ). If the length of the run is greater than or equal to 3, we add it to the run combo; otherwise, each card goes into the dump list, and we continue trying to form another run until we reach the line end.
  • Pass 2: Repeat the process, this time looking for cards that match a different bit pattern with or operation ( 00000010). This allows us to identify possible second runs.

The same approach applies to extracting sets, but we use bit operations with 00000100 and 00001000.

4. Register the Valid Combo and Move to the Next DFS Configuration:
After completing all runs, sets, and dumps for the current combo, we save the combo and then move on to the next DFS configuration to repeat the process. This way, we systematically explore all potential configurations for valid combos.

if you coded everything correctly and feed it our island example : ”2H3H4H5H4H5H6H3C3C3D3D4D”, it should be decomposed as shown bellow. Notice that I’ve added some calculation to each generated combo so that we can get a sense of how the AI will act.

Console Output Showing the Generated Combo For the Island Example

In the next article, I’ll dive into the rest of the system, focusing on the dynamic modification of the hand and the AI strategy. If you’ve followed along so far, it won’t be hard to see how we can optimize adding and removing cards, as well as incorporate the two rules we set aside at the beginning. Stay tuned, and see you next time! “hopefully 😉”.

Unless otherwise noted, all images are created by the author using Lucidchart ,Gimp and Python



Source link

06Nov

An Introduction to VLMs: The Future of Computer Vision Models | by Ro Isachenko | Nov, 2024


Building a 28% more accurate multimodal image search engine with VLMs.

Until recently, AI models were narrow in scope and limited to understanding either language or specific images, but rarely both.

In this respect, general language models like GPTs were a HUGE leap since we went from specialized models to general yet much more powerful models.

But even as language models progressed, they remained separate from computer vision аreas, each domain advancing in silos without bridging the gap. Imagine what would happen if you could only listen but not see, or vice versa.

My name is Roman Isachenko, and I’m part of the Computer Vision team at Yandex.

In this article, I’ll discuss visual language models (VLMs), which I believe are the future of compound AI systems.

I’ll explain the basics and training process for developing a multimodal neural network for image search and explore the design principles, challenges, and architecture that make it all possible.

Towards the end, I’ll also show you how we used an AI-powered search product to handle images and text and what changed with the introduction of a VLM.

Let’s begin!

What Are VLMs?

LLMs with billions or even hundreds of billions of parameters are no longer a novelty.

We see them everywhere!

The next key focus in LLM research has been more inclined towards developing multimodal models (omni-models) — models that can understand and process multiple data types.

Multimodal models (Image by Author)

As the name suggests, these models can handle more than just text. They can also analyze images, video, and audio.

But why are we doing this?

Jack of all trades, master of none, oftentimes better than master of one.

In recent years, we’ve seen a trend where general approaches dominate narrow ones.

Think about it.

Today’s language-driven ML models have become relatively advanced and general-purpose. One model can translate, summarize, identify speech tags, and much more.

General NLP model (Image by Author)

But earlier, these models used to be task-specific (we have them now as well, but fewer than before).

  • A dedicated model for translating.
  • A dedicated model for summarizing, etc.

In other words, today’s NLP models (LLMs, specifically) can serve multiple purposes that previously required developing highly specific solutions.

Second, this approach allows us to exponentially scale the data available for model training, which is crucial given the finite amount of text data. Earlier, however, one would need task-specific data:

  • A dedicated translation labeled dataset.
  • A dedicated summarization dataset, etc.

Third, we believe that training a multimodal model can enhance the performance of each data type, just like it does for humans.

For this article, we’ll simplify the “black box” concept to a scenario where the model receives an image and some text (which we call the “instruct”) as input and outputs only text (the response).

As a result, we end up with a much simpler process as shown below:

A simplified multimodal model (Image by Author)

We’ll discuss image-discriminative models that analyze and interpret what an image depicts.

Before delving into the technical details, consider the problems these models can solve.

A few examples are shown below:

Examples of tasks (Image by Author)
  • Top left image: We ask the model to describe the image. This is specified with text.
  • Top mid image: We ask the model to interpret the image.
  • Top right image: We ask the model to interpret the image and tell us what would happen if we followed the sign.
  • Bottom image: This is the most complicated example. We give the model some math problems. From these examples, you can see that the range of tasks is vast and diverse.

VLMs are a new frontier in computer vision that can solve various fundamental CV-related tasks (classification, detection, description) in zero-shot and one-shot modes.

While VLMs may not excel in every standard task yet, they are advancing quickly.

Now, let’s understand how they work.

VLM Architecture

These models typically have three main components:

Simplified representation of VLM (Image by Author)
  1. LLM — a text model (YandexGPT, in our case) that doesn’t understand images.
  2. Image encoder — an image model (CNN or Vision Transformer) that doesn’t understand text.
  3. Adapter — a model that acts as a mediator to ensure that the LLM and image encoder get along well.

The pipeline is pretty straightforward:

  • Feed an image into the image encoder.
  • Transform the output of the image encoder into some representation using the adapter.
  • Integrate the adapter’s output into the LLM (more on that below).
  • While the image is processed, convert the text instruct into a sequence of tokens and feed them into the LLM.

More Information About Adapters

The adapter is the most exciting and important part of the model, as it precisely facilitates the communication/interaction between the LLM and the image encoder.

There are two types of adapters:

  • Prompt-based adapters
  • Cross-attention-based adapters

Prompt-based adapters were first proposed in BLIP-2 and LLaVa models.

The idea is simple and intuitive, as evident from the name itself.

We take the output of the image encoder (a vector, a sequence of vectors, or a tensor — depending on the architecture) and transform it into a sequence of vectors (tokens), which we feed into the LLM. You could take a simple MLP model with a couple of layers and use it as an adapter, and the results will likely be pretty good.

Cross-attention-based adapters are a bit more sophisticated in this respect.

They were used in recent papers on Llama 3.2 and NVLM.

These adapters aim to transform the image encoder’s output to be used in the LLM’s cross-attention block as key/value matrices. Examples of such adapters include transformer architectures like perceiver resampler or Q‑former.

Prompt-based adapters (left) and Cross-attention-based adapters (right) (Image by Author)

Prompt-based adapters (left) and Cross-attention-based adapters (right)

Both approaches have pros and cons.

Currently, prompt-based adapters deliver better results but take away a large chunk of the LLM’s input context, which is important since LLMs have limited context length (for now).

Cross-attention-based adapters don’t take away from the LLM’s context but require a large number of parameters to achieve good quality.

VLM Training

With the architecture sorted out, let’s dive into training.

Firstly, note that VLMs aren’t trained from scratch (although we think it’s only a matter of time) but are built on pre-trained LLMs and image encoders.

Using these pre-trained models, we fine-tune our VLM in multimodal text and image data.

This process involves two steps:

  • Pre-training
  • Alignment: SFT + RL (optional)

Training procedure of VLMs (Image by Author)

Notice how these stages resemble LLM training?

This is because the two processes are similar in concept. Let’s take a brief look at these stages.

VLM Pre-training

Here’s what we want to achieve at this stage:

  • Link the text and image modalities together (remember that our model includes an adapter we haven’t trained before).
  • Load world knowledge into our model (the images have a lot of specifics, for one, OCR skills).

There are three types of data used in pre-training VLMs:

  • Interleaved Pre-training: This mirrors the LLM pre-training phase, where we teach the model to perform the next token prediction task by feeding it web documents. With VLM pre-training, we pick web documents with images and train the model to predict text. The key difference here is that a VLM considers both the text and the images on the page. Such data is easy to come by, so this type of pre-training isn’t hard to scale up. However, the data quality isn’t great, and boosting it proves to be a tough job.
Interleaved Pre-training dataset (Image by Author)

Image-Text Pairs Pre-training: We train the model to perform one specific task: captioning images. You need a large corpus of images with relevant descriptions to do that. This approach is more popular because many such corpora are used to train other models (text-to-image generation, image-to-text retrieval).

Image-Text Pairs Pre-training dataset (Image by Author)

Instruct-Based Pre-training: During inference, we’ll feed the model images and text. Why not train the model this way from the start? This is precisely what instruct-based pre-training does: It trains the model on a massive dataset of image-instruct-answer triplets, even if the data isn’t always perfect.

Instruct-Based Pre-training dataset (Image by Author)

How much data is needed to train a VLM model properly is a complex question. At this stage, the required dataset size can vary from a few million to several billion (thankfully, not a trillion!) samples.

Our team used instruct-based pre-training with a few million samples. However, we believe interleaved pre-training has great potential, and we’re actively working in that direction.

VLM Alignment

Once pre-training is complete, it’s time to start on alignment.

It comprises SFT training and an optional RL stage. Since we only have the SFT stage, I’ll focus on that.

Still, recent papers (like this and this) often include an RL stage on top of VLM, which uses the same methods as for LLMs (DPO and various modifications differing by the first letter in the method name).

Anyway, back to SFT.

Strictly speaking, this stage is similar to instruct-based pre-training.

The distinction lies in our focus on high-quality data with proper response structure, formatting, and strong reasoning capabilities.

This means that the model must be able to understand the image and make inferences about it. Ideally, it should respond equally well to text instructs without images, so we’ll also add high-quality text-only data to the mix.

Ultimately, this stage’s data typically ranges between hundreds of thousands to a few million examples. In our case, the number is somewhere in the six digits.

Quality Evaluation

Let’s discuss the methods for evaluating the quality of VLMs. We use two approaches:

  • Calculate metrics on open-source benchmarks.
  • Compare the models using side-by-side (SBS) evaluations, where an assessor compares two model responses and chooses the better one.

The first method allows us to measure surrogate metrics (like accuracy in classification tasks) on specific subsets of data.

However, since most benchmarks are in English, they can’t be used to compare models trained in other languages, like German, French, Russian, etc.

While translation can be used, the errors introduced by translation models make the results unreliable.

The second approach allows for a more in-depth analysis of the model but requires meticulous (and expensive) manual data annotation.

Our model is bilingual and can respond in both English and Russian. Thus, we can use English open-source benchmarks and run side-by-side comparisons.

We trust this method and invest a lot in it. Here’s what we ask our assessors to evaluate:

  • Grammar
  • Readability
  • Comprehensiveness
  • Relevance to the instruct
  • Errors (logical and factual)
  • Hallucinations

We strive to evaluate a complete and diverse subset of our model’s skills.

The following pie chart illustrates the distribution of tasks in our SbS evaluation bucket.

Distribution of tasks for quality evaluation (Image by Author)

This summarizes the overview of VLM fundamentals and how one can train a model and evaluate its quality.

Pipeline Architecture

This spring, we added multimodality to Neuro, an AI-powered search product, allowing users to ask questions using text and images.

Until recently, its underlying technology wasn’t truly multimodal.

Here’s what this pipeline looked like before.

Pipeline architecture (Image by Author)

This diagram seems complex, but it’s straightforward once you break it down into steps.

Here’s what the process used to look like

  1. The user submits an image and a text query.
  2. We send the image to our visual search еngine, which would return a wealth of information about the image (tags, recognized text, information card).
  3. We formulate a text query using a rephraser (a fine-tuned LLM) with this information and the original query.
  4. With the rephrased text query, we use Yandex Search to retrieve relevant documents (or excerpts, which we call infocontext).
  5. Finally, with all this information (original query, visual search information, rephrased text query, and info context), we generate the final response using a generator model (another fine-tuned LLM).

Done!

As you can see, we used to rely on two unimodal LLMs and our visual search engine. This solution worked well on a small sample of queries but had limitations.

Below is an example (albeit slightly exaggerated) of how things could go wrong.

The problem with two unimodal LLMs (Image by Author)

Here, the rephraser receives the output of the visual search service and simply doesn’t understand the user’s original intent.

In turn, the LLM model, which knows nothing about the image, generates an incorrect search query, getting tags about the pug and the apple simultaneously.

To improve the quality of our multimodal response and allow users to ask more complex questions, we introduced a VLM into our architecture.

More specifically, we made two major modifications:

  1. We replaced the LLM rephraser with a VLM rephraser. Essentially, we started feeding the original image to the rephraser’s input on top of the text from the visual search engine.
  2. We added a separate VLM captioner to the pipeline. This model provides an image description, which we use as info context for the final generator.

You might wonder

Why not make the generator itself VLM-based?

That’s a good idea!

But there’s a catch.

Our generator training inherits from Neuro’s text model, which is frequently updated.

To update the pipeline faster and more conveniently, it was much easier for us to introduce a separate VLM block.

Plus, this setup works just as well, which is shown below:

Using VLM in AI-powered search (Image by Author)

Training VLM rephraser and VLM captioner are two separate tasks.

For this, we use mentioned earlierse VLM, as mentioned e for thise-tuned it for these specific tasks.

Fine-tuning these models required collecting separate training datasets comprising tens of thousands of samples.

We also had to make significant changes to our infrastructure to make the pipeline computationally efficient.

Gauging the Quality

Now for the grand question:

Did introducing a VLM to a fairly complex pipeline improve things?

In short, yes, it did!

We ran side-by-side tests to measure the new pipeline’s performance and compared our previous LLM framework with the new VLM one.

This evaluation is similar to the one discussed earlier for the core technology. However, in this case, we use a different set of images and queries more aligned with what users might ask.

Below is the approximate distribution of clusters in this bucket.

Cluster distribution (Image by Author)

Our offline side-by-side evaluation shows that we’ve substantially improved the quality of the final response.

The VLM pipeline noticeably increases the response quality and covers more user scenarios.

Accuracy of VLM vs LLM in Neuro (Image by Author)

We also wanted to test the results on a live audience to see if our users would notice the technical changes that we believe would improve the product experience.

So, we conducted an online split test, comparing our LLM pipeline to the new VLM pipeline. The preliminary results show the following change:

  • The number of instructs that include an image increased by 17%.
  • The number of sessions (the user entering multiple queries in a row) saw an uptick of 4.5%.

To reiterate what was said above, we firmly believe that VLMs are the future of computer vision models.

VLMs are already capable of solving many out-of-the-box problems. With a bit of fine-tuning, they can absolutely deliver state-of-the-art quality.

Thanks for reading!



Source link

06Nov

Language Models Emerging Technologies | by Cobus Greyling | Nov, 2024


What Trended in 2024 — Six Technologies Which Dominated Timelines

In 𝟮𝟬𝟮𝟰, we saw the technology focus shifting from 𝗖𝗵𝗮𝗶𝗻 𝗼𝗳 𝗧𝗵𝗼𝘂𝗴𝗵𝘁 (𝗖𝗼𝗧) approach to 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹-𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 (𝗥𝗔𝗚), reflecting the need for precise, contextual responses in generative AI.

Building on 𝗥𝗔𝗚, 𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗥𝗔𝗚 emerged, adding 𝘢𝘶𝘵𝘰𝘯𝘰𝘮𝘰𝘶𝘴 𝘤𝘢𝘱𝘢𝘣𝘪𝘭𝘪𝘵𝘪𝘦𝘴 for AI to dynamically retrieve, interpret, and act on data.

As these technologies advanced, attention grew around 𝗦𝗺𝗮𝗹𝗹 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 and 𝗙𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗠𝗼𝗱𝗲𝗹𝘀, balancing task-specific efficiency with general-purpose adaptability.

Follow me on LinkedIn

This trajectory then accelerated toward 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 capable of more complex, interactive roles, paving the way for something I like to call 𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗫 — a framework embedding agentic capabilities directly into applications, making them proactive, adaptive, and contextually aware in meeting user goals independently.



Source link

05Nov

Anthropic ACI (AI Agent Computer Interface) | by Cobus Greyling | Nov, 2024


An AI Agent Computer Interface is a tool in an Agent’s toolbox which enables the agent to leverage a web browser as a human would.

This interface often supports seamless, context-aware exchanges, letting AI Agents handle complex tasks through intuitive commands and adaptive responses.

General problems with a Web GUI is time of query executing and errors in interpreting the screen. Human supervision is something which can really help a lot it ensuring a smooth GUI agent journey.

What is a AI Agent (Agentic) Computer Interface?

An ACI is a piece of software which can receive compound and complex input from a user, and answer the question by making use of a Computer Interface. Very much in the same fashion we as humans will interact with a computer.

As you will see later in this article, the ACI acts as an agent tool in the context of the Anthropic example.

The interfaces should support natural, intuitive interactions with AI Agents to improve accessibility and usability, allowing users to engage effortlessly.

AI Agents should have context-sensitive capabilities, adapting responses based on past interactions and user needs for continuity and relevance.

Effective interfaces facilitate task automation, enabling agents to assist in complex workflows by taking over repetitive or straightforward actions.

Continuous user feedback integration enhances the agent’s ability to learn, adjust, and optimise performance over time.

The AI Agent has one of its tools which are available, a Computer interface.

A new capability called computer use is now available in public beta, enabling developers to guide Claude in interacting with computers similarly to humans — navigating screens, clicking, and typing.

Claude 3.5 Sonnet is the first frontier AI model to support this functionality in a public beta, allowing for real-time experimentation and user feedback.

Though still in an early stage and occasionally prone to errors, this feature is expected to evolve quickly based on input from developers.

I think it is important to note that many models support vision, and that vision enabled models from OpenAI and others have been used in frameworks to deliver AI Agents which interfaces to computers.

The most notable, for me at least, is the LangChain implementation of WebVoyager.

Hence it is important to note that this is a Computer User Interface framework made available by Anthropic. This has been an approach followed by many model providers, to provide frameworks through which value is delivered. And hence make their offering more compelling.

I made use of the docker container locally on my MacBook…

Once the container is running, see the Accessing the demo app section below for instructions on how to connect to the interface.

Once the container is running, open your browser to http://localhost:8080 to access the combined interface that includes both the agent chat and desktop view.

The container stores settings like the API key and custom system prompt in ~/.anthropic/. Mount this directory to persist these settings between container runs.

Alternative access points:

Below is the script I made use of tho initiate the docker container…

Find the GitHub quick start here.



Source link

04Nov

When Machines Think Ahead: The Rise of Strategic AI | by Hans Christian Ekne | Nov, 2024


Image generated by the author using Canva Magic Studio

Games have provided an amazing proving ground for developing strategic AI. The closed nature of games makes it easier to train models and develop solution techniques than in open ended systems. Games are clearly defined; the players are known and so are the payoffs. One of the biggest and earliest milestones was Deep Blue, the machine that beat the world champion in chess.

Early Milestones: Deep Blue

Deep Blue was a chess-playing supercomputer developed by IBM in the 1990s. As stated in the prologue, it made history in May 1997 by defeating the reigning world chess champion, Garry Kasparov, in a six-game match. Deep Blue utilized specialized hardware and algorithms capable of evaluating 200 million chess positions per second. It combined brute-force search techniques with heuristic evaluation functions, enabling it to search deeper into potential move sequences than any previous system. What made Deep Blue special was its ability to process vast numbers of positions quickly, effectively handling the combinatorial complexity of chess and marking a significant milestone in artificial intelligence.

However, as Gary Kasparov notes in his interview with Lex Fridman¹, Deep Blue was more of a brute force machine than anything else, so it’s perhaps hard to qualify it as any type of intelligence. The core of the search is basically just trial and error. And speaking of errors, it makes significantly less errors than humans, and according to Kasparov this is one of the features which made it hard to beat.

Advancements in Complex Games: AlphaGo

19 years after the Deep Blue victory in chess, a team from Google’s DeepMind produced another model that would contribute to a special moment in the history of AI. In 2016, AlphaGo became the first AI model to defeat a world champion go player, Lee Sedol.

Go is a very old board game with origins in Asia, known for its deep complexity and vast number of possible positions, far exceeding those in chess. AlphaGo combined deep neural networks with Monte Carlo tree search, allowing it to evaluate positions and plan moves effectively. The more time AlphaGo was given at inference, the better it performs.

The AI trained on a dataset of human expert games and improved further through self-play. What made AlphaGo special was its ability to handle the complexity of Go, utilizing advanced machine learning techniques to achieve superhuman performance in a domain previously thought to be resistant to AI mastery.

One could argue AlphaGo exhibits more intelligence than Deep Blue, given its exceptional ability to deeply evaluate board states and select moves. Move 37 from its 2016 game against Lee Sedol is a classic example. For those acquainted with Go, it was a shoulder hit at the fifth line and initially baffled commentators, including Lee Sedol himself. But as would later become clear, the move was a brilliant play and showcased how AlphaGo would explore strategies that human players might overlook and disregard.

Combining Chess and Go: AlphaZero

One year later, Google DeepMind made headlines again. This time, they took many of the learnings from AlphaGo and created AlphaZero, which was more of a general-purpose AI system that mastered chess, as well as Go and shogi. The researchers were able to build the AI solely through self-play and reinforcement learning without prior human knowledge or data. Unlike traditional chess engines that rely on handcrafted evaluation functions and extensive opening libraries, AlphaZero used deep neural networks and a novel algorithm combining Monte Carlo tree search with self-learning.

The system started with only the basic rules and learned optimal strategies by playing millions of games against itself. What made AlphaZero special was its ability to discover creative and efficient strategies, showcasing a new paradigm in AI that leverages self-learning over human-engineered knowledge.

Integrating Speed and Strategy: Star Craft II

Continuing its domination in the AI space, the Google DeepMind team changed its focus to a highly popular computer game, StarCraft II. In 2019 they developed an AI called AlphaStar² which was able to achieve Grandmaster level play and rank higher than 99.8% of human players on the competitive leaderboard.

StarCraft II is a real time strategy game that provided several novel challenges for the team at DeepMind. The goal of the game is to conquer the opposing player or players, by gathering resources, constructing buildings and amassing armies that can defeat the opponent. The main challenges in this game arise from the enormous action space that needs to be considered, the real-time decision making, partial observability due to fog of war and the need for long-term strategic planning, as some games can last for hours.

By building on some of the techniques developed for previous AIs, like reinforcement learning through self-play and deep neural networks, the team was able to make a unique game engine. Firstly, they trained a neural net using supervised learning and human play. Then, they used that to seed another algorithm that could play against itself in a multi-agent game framework. The DeepMind team created a virtual league where the agents could explore strategies against each other and where the dominant strategies would be rewarded. Ultimately, they combined the strategies from the league into a super strategy that could be effective against many different opponents and strategies. In their own words³:

The final AlphaStar agent consists of the components of the Nash distribution of the league — in other words, the most effective mixture of strategies that have been discovered — that run on a single desktop GPU.

Deep Dive into Pluribus and Poker

I love playing poker, and when I was living and studying in Trondheim, we used to have a weekly cash game which could get quite intense! One of the last milestones to be eclipsed by strategic AI was in the game of poker. Specifically, in one of the most popular forms of poker, 6-player no-limit Texas hold’em. In this game we use a regular deck of cards with 52 cards, and the play follows the following structure:

  1. The Preflop: All players are given 2 cards (hole cards) which only they themselves know the value of.
  2. The Flop: 3 cards are drawn and laid face up so that all players can see them.
  3. The Turn: Another card is drawn and laid face up.
  4. The River: A final 5th card is drawn and laid face up.

The players can use the cards on the table and the two cards on their hand to assemble a 5-card poker hand. For each round of the game, the players take turns placing bets, and the game can end at any of the rounds if one player places a bet that no one else is willing to call.

Though reasonably simple to learn, one only needs to know the hierarchy of the various poker hands, this game proved to be very difficult to solve with AI, despite ongoing efforts for several decades.

There are multiple factors contributing to the difficulty of solving poker. Firstly, we have the issue of hidden information, because you don’t know which cards the other players have. Secondly, we have a multiplayer setup with many players, with each extra player increasing the number of possible interactions and strategies exponentially. Thirdly, we have the no-limit betting rules, which allow for a complex betting structure where one player can suddenly decide to bet his entire stack. Fourth, we have an enormous game tree complexity due to the combinations of hole cards, community cards, and betting sequences. In addition, we also have complexity due to the stochastic nature of the cards, the potential for bluffing and the opponent modelling!

It was only in 2019 that a couple of researchers, Noam Brown and Tuomas Sandholm, finally cracked the code. In a paper published in Science, they describe a novel poker AI — Pluribus — that managed to beat the best players in the world in 6-player no-limit Texas hold’em.⁴ They conducted two different experiments, each consisting of a 10000 poker hands, and both experiments clearly showed the dominance of Pluribus.

In the first experiment, Pluribus played against 5 human opponents, achieving an average win rate of 48 mbb/game, with a standard deviation of 25 mbb/game. (mbb/game stands for milli big blind per game, how many big blinds is won per 1000 games played.) 48 mbb/game is considered a very high win rate, especially among elite poker players, and implies that Pluribus is stronger than the human opponents.

In the second experiment, the researchers had 5 versions of Pluribus play against 1 human. They set up the experiment so that 2 different humans would each play 5000 hands each against the 5 machines. Pluribus ended up beating the humans by an average of 32 mbb/game with a standard error of 15 mbb/game, again showing its strategic superiority.

The dominance of Pluribus is quite amazing, especially given all the complexities the researchers had to overcome. Brown and Sandholm came up with several smart strategies that helped Pluribus to become superhuman and computationally much more efficient than previous top poker AIs. Some of their techniques include:

  1. The use of two different algorithms for evaluating moves. They would first use a so called “blueprint strategy” which was created by having the program play against itself using a method called Monte Carlo counterfactual regret minimization. This blueprint strategy would be used in the first round of betting, but in subsequent betting rounds, Pluribus conducts a real-time search to find a better more granular strategy.
  2. To make its real-time search algorithm be more computationally efficient, they would use a dept-limited search and evaluate 4 different possible strategies that the opponents might choose to play. Firstly, they would evaluate each strategy for 2 moves ahead. In addition, they would only evaluate four different strategies for the opponents, including the original blueprint strategy, a blueprint strategy biased towards folding, a blueprint strategy biased towards calling and a final blueprint strategy biased towards raising.
  3. They also used various abstraction techniques to reduce the number of possible game states. For example, because a 9 high straight is fundamentally similar to a 8 high straight these can be viewed in a similar way.
  4. Pluribus would discretize the continuous betting space into a limited set of buckets, making it easier to consider and evaluate various betting sizes.
  5. In addition, Pluribus also balances its strategy in way that for any given hand it is playing, it would also consider other possible hands it could have in that situation and evaluate how it would play those hands, so that the final play would be balanced and thus harder to counter.

There are quite a few interesting observations to draw from Pluribus, but perhaps the most interesting is that it doesn’t vary its play against different opponents, but instead has developed a robust strategy that is effective against a wide variety of players. Since a lot of poker players think they have to adjust their play to various situations and people, Pluribus shows us that this is not needed and probably not even optimal, given how it beat all the humans it played against.

In our short foray into game theory, we noted that if you play the NE strategy in two-player zero-sum games you are guaranteed not to lose in expectation. However, for a multiplayer game like 6-player poker there is no such guarantee. Noam Brown speculates⁵ that it is perhaps the adversarial nature of a game like poker which still makes it suitable to try to approach it with a NE strategy. Conversely, in a game like Risk where players can cooperate more, pursuing a NE strategy is not guaranteed to work, because, if you are playing a risk game with 6 people, there is nothing you can do if your 5 opponents decide to gang up on you and kill you.

Evaluating the Trend in Strategic AI

Summarizing the history of strategic AI in games, we see a clear trend emerging. The games are slowly but surely becoming closer to the real-world strategic situations that humans find themselves in on an everyday basis.

Firstly, we are moving from a two-player to a multiplayer setting. This can be seen from the initial success in two-player games to multiplayer games like 6-player poker. Secondly, we are seeing an increase in the mastery of games with hidden information. Thirdly we are also seeing an increase in mastery of games with more stochastic elements.

Hidden information, multiplayer settings and stochastic events are the norm rather than the exception in strategic interactions among humans, so mastering these complexities is key in achieving a more general superhuman strategic AI that can navigate in the real world.



Source link

03Nov

Beyond Skills: Unlocking the Full Potential of Data Scientists. | by Eric Colson | Oct, 2024


Image created through DALL-E / OpenAI by author.

Unlock the hidden value of data scientists by empowering them beyond technical tasks to drive innovation and strategic insights.

[This piece is cross-posted from O’Reilly Radar here]

Modern organizations regard data as a strategic asset that drives efficiency, enhances decision making, and creates new value for customers. Across the organization — product management, marketing, operations, finance, and more — teams are overflowing with ideas on how data can elevate the business. To bring these ideas to life, companies are eagerly hiring data scientists for their technical skills (Python, statistics, machine learning, SQL, etc.).

Despite this enthusiasm, many companies are significantly underutilizing their data scientists. Organizations remain narrowly focused on employing data scientists to execute preexisting ideas, overlooking the broader value they bring. Beyond their skills, data scientists possess a unique perspective that allows them to come up with innovative business ideas of their own — ideas that are novel, strategic, or differentiating and are unlikely to come from anyone but a data scientist.

Sadly, many companies behave in ways that suggest they are uninterested in the ideas of data scientists. Instead, they treat data scientists as a resource to be used for their skills alone. Functional teams provide requirements documents with fully specified plans: “Here’s how you are to build this new system for us. Thank you for your partnership.” No context is provided, and no input is sought — other than an estimate for delivery. Data scientists are further inundated with ad hoc requests for tactical analyses or operational dashboards¹. The backlog of requests grows so large that the work queue is managed through Jira-style ticketing systems, which strip the requests of any business context (e.g., “get me the top products purchased by VIP customers”). One request begets another², creating a Sisyphean endeavor that leaves no time for data scientists to think for themselves. And then there’s the myriad of opaque requests for data pulls: “Please get me this data so I can analyze it.” This is marginalizing — like asking Steph Curry to pass the ball so you can take the shot. It’s not a partnership; it’s a subordination that reduces data science to a mere support function, executing ideas from other teams. While executing tasks may produce some value, it won’t tap into the full potential of what data scientists truly have to offer.

The untapped potential of data scientists lies not in their ability to execute requirements or requests but in their ideas for transforming a business. By “ideas” I mean new capabilities or strategies that can move the business in better or new directions — leading to increased³ revenue, profit, or customer retention while simultaneously providing a sustainable competitive advantage (i.e., capabilities or strategies that are difficult for competitors to replicate). These ideas often take the form of machine learning algorithms that can automate decisions within a production system⁴. For example, a data scientist might develop an algorithm to better manage inventory by optimally balancing overage and underage costs. Or they might create a model that detects hidden customer preferences, enabling more effective personalization. If these sound like business ideas, that’s because they are — but they’re not likely to come from business teams. Ideas like these typically emerge from data scientists, whose unique cognitive repertoires and observations in the data make them well-suited to uncovering such opportunities.

A cognitive repertoire is the range of tools, strategies, and approaches an individual can draw upon for thinking, problem-solving, or processing information (Page 2017). These repertoires are shaped by our backgrounds — education, experience, training, and so on. Members of a given functional team often have similar repertoires due to their shared backgrounds. For example, marketers are taught frameworks like SWOT analysis and ROAS, while finance professionals learn models such as ROIC and Black-Scholes.

Data scientists have a distinctive cognitive repertoire. While their academic backgrounds may vary — ranging from statistics to computer science to computational neuroscience — they typically share a quantitative tool kit. This includes frameworks for widely applicable problems, often with accessible names like the “newsvendor model,” the “traveling salesman problem,” the “birthday problem,” and many others. Their tool kit also includes knowledge of machine learning algorithms⁵ like neural networks, clustering, and principal components, which are used to find empirical solutions to complex problems. Additionally, they include heuristics such as big O notation, the central limit theorem, and significance thresholds. All of these constructs can be expressed in a common mathematical language, making them easily transferable across different domains, including business — perhaps especially business.

The repertoires of data scientists are particularly relevant to business innovation since, in many industries⁶, the conditions for learning from data are nearly ideal in that they have high-frequency events, a clear objective function⁷, and timely and unambiguous feedback. Retailers have millions of transactions that produce revenue. A streaming service sees millions of viewing events that signal customer interest. And so on — millions or billions of events with clear signals that are revealed quickly. These are the units of induction that form the basis for learning, especially when aided by machines. The data science repertoire, with its unique frameworks, machine learning algorithms, and heuristics, is remarkably geared for extracting knowledge from large volumes of event data.

Ideas are born when cognitive repertoires connect with business context. A data scientist, while attending a business meeting, will regularly experience pangs of inspiration. Her eyebrows raise from behind her laptop as an operations manager describes an inventory perishability problem, lobbing the phrase “We need to buy enough, but not too much.” “Newsvendor model,” the data scientist whispers to herself. A product manager asks, “How is this process going to scale as the number of products increases?” The data scientist involuntarily scribbles “O(N²)” on her notepad, which is big O notation to indicate that the process will scale superlinearly. And when a marketer brings up the topic of customer segmentation, bemoaning, “There are so many customer attributes. How do we know which ones are most important?,” the data scientist sends a text to cancel her evening plans. Instead, tonight she will eagerly try running principal components analysis on the customer data⁸.

No one was asking for ideas. This was merely a tactical meeting with the goal of reviewing the state of the business. Yet the data scientist is practically goaded into ideating. “Oh, oh. I got this one,” she says to herself. Ideation can even be hard to suppress. Yet many companies unintentionally seem to suppress that creativity. In reality our data scientist probably wouldn’t have been invited to that meeting. Data scientists are not typically invited to operating meetings. Nor are they typically invited to ideation meetings, which are often limited to the business teams. Instead, the meeting group will assign the data scientist Jira tickets of tasks to execute. Without the context, the tasks will fail to inspire ideas. The cognitive repertoire of the data scientist goes unleveraged — a missed opportunity to be sure.

Beyond their cognitive repertoires, data scientists bring another key advantage that makes their ideas uniquely valuable. Because they are so deeply immersed in the data, data scientists discover unforeseen patterns and insights that inspire novel business ideas. They are novel in the sense that no one would have thought of them — not product managers, executives, marketers — not even a data scientist for that matter. There are many ideas that cannot be conceived of but rather are revealed by observation in the data.

Company data repositories (data warehouses, data lakes, and the like) contain a primordial soup of insights lying fallow in the information. As they do their work, data scientists often stumble upon intriguing patterns — an odd-shaped distribution, an unintuitive relationship, and so forth. The surprise finding piques their curiosity, and they explore further.

Imagine a data scientist doing her work, executing on an ad hoc request. She is asked to compile a list of the top products purchased by a particular customer segment. To her surprise, the products bought by the various segments are hardly different at all. Most products are bought at about the same rate by all segments. Weird. The segments are based on profile descriptions that customers opted into, and for years the company had assumed them to be meaningful groupings useful for managing products. “There must be a better way to segment customers,” she thinks. She explores further, launching an informal, impromptu analysis. No one is asking her to do this, but she can’t help herself. Rather than relying on the labels customers use to describe themselves, she focuses on their actual behavior: what products they click on, view, like, or dislike. Through a combination of quantitative techniques — matrix factorization and principal component analysis — she comes up with a way to place customers into a multidimensional space. Clusters of customers adjacent to one another in this space form meaningful groupings that better reflect customer preferences. The approach also provides a way to place products into the same space, allowing for distance calculations between products and customers. This can be used to recommend products, plan inventory, target marketing campaigns, and many other business applications. All of this is inspired from the surprising observation that the tried-and-true customer segments did little to explain customer behavior. Solutions like this have to be driven by observation since, absent the data saying otherwise, no one would have thought to inquire about a better way to group customers.

As a side note, the principal component algorithm that the data scientists used belongs to a class of algorithms called “unsupervised learning,” which further exemplifies the concept of observation-driven insights. Unlike “supervised learning,” in which the user instructs the algorithm what to look for, an unsupervised learning algorithm lets the data describe how it is structured. It is evidence based; it quantifies and ranks each dimension, providing an objective measure of relative importance. The data does the talking. Too often we try to direct the data to yield to our human-conceived categorization schemes, which are familiar and convenient to us, evoking visceral and stereotypical archetypes. It’s satisfying and intuitive but often flimsy and fails to hold up in practice.

Examples like this are not rare. When immersed in the data, it’s hard for the data scientists not to come upon unexpected findings. And when they do, it’s even harder for them to resist further exploration — curiosity is a powerful motivator. Of course, she exercised her cognitive repertoire to do the work, but the entire analysis was inspired by observation of the data. For the company, such distractions are a blessing, not a curse. I’ve seen this sort of undirected research lead to better inventory management practices, better pricing structures, new merchandising strategies, improved user experience designs, and many other capabilities — none of which were asked for but instead were discovered by observation in the data.

Isn’t discovering new insights the data scientist’s job? Yes — that’s exactly the point of this article. The problem arises when data scientists are valued only for their technical skills. Viewing them solely as a support team limits them to answering specific questions, preventing deeper exploration of insights in the data. The pressure to respond to immediate requests often causes them to overlook anomalies, unintuitive results, and other potential discoveries. If a data scientist were to suggest some exploratory research based on observations, the response is almost always, “No, just focus on the Jira queue.” Even if they spend their own time — nights and weekends — researching a data pattern that leads to a promising business idea, it may still face resistance simply because it wasn’t planned or on the roadmap. Roadmaps tend to be rigid, dismissing new opportunities, even valuable ones. In some organizations, data scientists may pay a price for exploring new ideas. Data scientists are often judged by how well they serve functional teams, responding to their requests and fulfilling short-term needs. There is little incentive to explore new ideas when doing so detracts from a performance review. In reality, data scientists frequently find new insights in spite of their jobs, not because of them.

These two things — their cognitive repertoires and observations from the data — make the ideas that come from data scientists uniquely valuable. This is not to suggest that their ideas are necessarily better than those from the business teams. Rather, their ideas are different from those of the business teams. And being different has its own set of benefits.

Having a seemingly good business idea doesn’t guarantee that the idea will have a positive impact. Evidence suggests that most ideas will fail. When properly measured for causality⁹, the vast majority of business ideas either fail to show any impact at all or actually hurt metrics. (See some statistics here.) Given the poor success rates, innovative companies construct portfolios of ideas in the hopes that at least a few successes will allow them to reach their goals. Still savvier companies use experimentation¹⁰ (A/B testing) to try their ideas on small samples of customers, allowing them to assess the impact before deciding to roll them out more broadly.

This portfolio approach, combined with experimentation, benefits from both the quantity and diversity of ideas¹¹. It’s similar to diversifying a portfolio of stocks. Increasing the number of ideas in the portfolio increases exposure to a positive outcome — an idea that makes a material positive impact on the company. Of course, as you add ideas, you also increase the risk of bad outcomes — ideas that do nothing or even have a negative impact. However, many ideas are reversible — the “two-way door” that Amazon’s Jeff Bezos speaks of (Haden 2018). Ideas that don’t produce the expected results can be pruned after being tested on a small sample of customers, greatly mitigating the impact, while successful ideas can be rolled out to all relevant customers, greatly amplifying the impact.

So, adding ideas to the portfolio increases exposure to upside without a lot of downside — the more, the better¹². However, there is an assumption that the ideas are independent (uncorrelated). If all the ideas are similar, then they may all succeed or fail together. This is where diversity comes in. Ideas from different groups will leverage divergent cognitive repertoires and different sets of information. This makes them different and less likely to be correlated with each other, producing more varied outcomes. For stocks, the return on a diverse portfolio will be the average of the returns for the individual stocks. However, for ideas, since experimentation lets you mitigate the bad ones and amplify the good ones, the return of the portfolio can be closer to the return of the best idea (Page 2017).

In addition to building a portfolio of diverse ideas, a single idea can be significantly strengthened through collaboration between data scientists and business teams¹³. When they work together, their combined repertoires fill in each other’s blind spots (Page 2017)¹⁴. By merging the unique expertise and insights from multiple teams, ideas become more robust, much like how diverse groups tend to excel in trivia competitions. However, organizations must ensure that true collaboration happens at the ideation stage rather than dividing responsibilities such that business teams focus solely on generating ideas and data scientists are relegated to execution.

Data scientists are much more than a skilled resource for executing existing ideas; they are a wellspring of novel, innovative thinking. Their ideas are uniquely valuable because (1) their cognitive repertoires are highly relevant to businesses with the right conditions for learning, (2) their observations in the data can lead to novel insights, and (3) their ideas differ from those of business teams, adding diversity to the company’s portfolio of ideas.

However, organizational pressures often prevent data scientists from fully contributing their ideas. Overwhelmed with skill-based tasks and deprived of business context, they are incentivized to merely fulfill the requests of their partners. This pattern exhausts the team’s capacity for execution while leaving their cognitive repertoires and insights largely untapped.

Here are some suggestions that organizations can follow to better leverage data scientists and shift their roles from mere executors to active contributors of ideas:

  • Give them context, not tasks. Providing data scientists with tasks or fully specified requirements documents will get them to do work, but it won’t elicit their ideas. Instead, give them context. If an opportunity is already identified, describe it broadly through open dialogue, allowing them to frame the problem and propose solutions. Invite data scientists to operational meetings where they can absorb context, which may inspire new ideas for opportunities that haven’t yet been considered.
  • Create slack for exploration. Companies often completely overwhelm data scientists with tasks. It may seem paradoxical, but keeping resources 100% utilized is very inefficient¹⁵. Without time for exploration and unexpected learning, data science teams can’t reach their full potential. Protect some of their time for independent research and exploration, using tactics like Google’s 20% time or similar approaches.
  • Eliminate the task management queue. Task queues create a transactional, execution-focused relationship with the data science team. Priorities, if assigned top-down, should be given in the form of general, unframed opportunities that need real conversations to provide context, goals, scope, and organizational implications. Priorities might also emerge from within the data science team, requiring support from functional partners, with the data science team providing the necessary context. We don’t assign Jira tickets to product or marketing teams, and data science should be no different.
  • Hold data scientists accountable for real business impact. Measure data scientists by their impact on business outcomes, not just by how well they support other teams. This gives them the agency to prioritize high-impact ideas, regardless of the source. Additionally, tying performance to measurable business impact¹⁶ clarifies the opportunity cost of low-value ad hoc requests¹⁷.
  • Hire for adaptability and broad skill sets. Look for data scientists who thrive in ambiguous, evolving environments where clear roles and responsibilities may not always be defined. Prioritize candidates with a strong desire for business impact¹⁸, who see their skills as tools to drive outcomes, and who excel at identifying new opportunities aligned with broad company goals. Hiring for diverse skill sets enables data scientists to build end-to-end systems, minimizing the need for handoffs and reducing coordination costs — especially critical during the early stages of innovation when iteration and learning are most important¹⁹.
  • Hire functional leaders with growth mindsets. In new environments, avoid leaders who rely too heavily on what worked in more mature settings. Instead, seek leaders who are passionate about learning and who value collaboration, leveraging diverse perspectives and information sources to fuel innovation.

These suggestions require an organization with the right culture and values. The culture needs to embrace experimentation to measure the impact of ideas and to recognize that many will fail. It needs to value learning as an explicit goal and understand that, for some industries, the vast majority of knowledge has yet to be discovered. It must be comfortable relinquishing the clarity of command-and-control in exchange for innovation. While this is easier to achieve in a startup, these suggestions can guide mature organizations toward evolving with experience and confidence. Shifting an organization’s focus from execution to learning is a challenging task, but the rewards can be immense or even crucial for survival. For most modern firms, success will depend on their ability to harness human potential for learning and ideation — not just execution (Edmondson 2012). The untapped potential of data scientists lies not in their ability to execute existing ideas but in the new and innovative ideas no one has yet imagined.



Source link

Protected by Security by CleanTalk