25Nov

How Did We Go From Large Language Models to Large Behaviour Models? | by Cobus Greyling | Nov, 2024


Large Content & Behaviour Models (LBM / LCBM) integrate language & vision to simulate & understand both ontent (like text, images, or videos) and behaviours (such as user interactions or preferences).

It connects a language model with a visual encoder and is fine-tuned using datasets linking content to human behaviours.

For example, it analyses YouTube videos and their engagement metrics or Twitter posts and their interactions. LCBMs aim to predict behaviours based on content and vice versa, demonstrating both domain adaptation and generalisation.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.



Source link

25Nov

Perform outlier detection more effectively using subsets of features | by W Brett Kennedy | Nov, 2024


Identify relevant subspaces: subsets of features that allow you to most effectively perform outlier detection on tabular data

This article is part of a series related to the challenges, and the techniques that may be used, to best identify outliers in data, including articles related to using PCA, Distance Metric Learning, Shared Nearest Neighbors, Frequent Patterns Outlier Factor, Counts Outlier Detector (a multi-dimensional histogram-based method), and doping. This article also contains an excerpt from my book, Outlier Detection in Python.

We look here at techniques to create, instead of a single outlier detector examining all features within a dataset, a series of smaller outlier detectors, each working with a subset of the features (referred to as subspaces).

When performing outlier detection on tabular data, we’re looking for the records in the data that are the most unusual — either relative to the other records in the same dataset, or relative to previous data.

There are a number of challenges associated with finding the most meaningful outliers, particularly that there is no definition of statistically unusual that definitively specifies which anomalies in the data should be considered the strongest. As well, the outliers that are most relevant (and not necessarily the most statistically unusual) for your purposes will be specific to your project, and may evolve over time.

There are also a number of technical challenges that appear in outlier detection. Among these are the difficulties that occur where data has many features. As covered in previous articles related to Counts Outlier Detector and Shared Nearest Neighbors, where we have many features, we often face an issue known as the curse of dimensionality.

This has a number of implications for outlier detection, including that it makes distance metrics unreliable. Many outlier detection algorithms rely on calculating the distances between records — in order to identify as outliers the records that are similar to unusually few other records, and that are unusually different from most other records — that is, records that are close to few other records and far from most other records.

For example, if we have a table with 40 features, each record in the data may be viewed as a point in 40-dimensional space, and its outlierness can be evaluated by the distances from it to the other points in this space. This, then, requires a way to measure the distance between records. A variety of measures are used, with Euclidean distances being quite common (assuming the data is numeric, or is converted to numeric values). So, the outlierness of each record is often measured based on the Euclidean distance between it and the other records in the dataset.

These distance calculations can, though, break down where we are working with many features and, in fact, issues with distance metrics may appear even with only ten or twenty features, and very often with about thirty or forty or more.

We should note though, issues dealing with large numbers of features do not appear with all outlier detectors. For example, they do not tend to be significant when working with univariate tests (tests such as z-score or interquartile range tests, that consider each feature one at a time, independently of the other features — described in more detail in A Simple Example Using PCA for Outlier Detection) or when using categorical outlier detectors such as FPOF.

However, the majority of outlier detectors commonly used are numeric multi-variate outlier detectors — detectors that assume all features are numeric, and that generally work on all features at once. For example, LOF (Local Outlier Factor) and KNN (k-Nearest Neighbors) are two the the most widely-used detectors and these both evaluate the outlierness of each record based on their distances (in the high-dimensional spaces the data points live in) to the other records.

Consider the plots below. This presents a dataset with six features, shown in three 2d scatter plots. This includes two points that can reasonably be considered outliers, P1 and P2.

Looking, for now, at P1, it is far from the other points, at least in feature A. That is, considering just feature A, P1 can easily be flagged as an outlier. However, most detectors will consider the distance of each point to the other points using all six dimensions, which, unfortunately, means P1 may not necessarily stand out as an outlier, due to the nature of distance calculations in high-dimensional spaces. P1 is fairly typical in the other five features, and so it’s distance to the other points, in 6d space, may be fairly normal.

Nevertheless, we can see that this general approach to outlier detection — where we examine the distances from each record to the other records — is quite reasonable: P1 and P2 are outliers because they are far (at least in some dimensions) from the other points.

As KNN and LOF are very commonly used detectors, we’ll look at them a little closer here, and then look specifically at using subspaces with these algorithms.

With the KNN outlier detector, we pick a value for k, which determines how many neighbors each record is compared to. Let’s say we pick 10 (in practice, this would be a fairly typical value).

For each record, we then measure the distance to its 10 nearest neighbors, which provides a good sense of how isolated and remote each point is. We then need to create a single outlier score (i.e., a single number) for each record based on these 10 distances. For this, we generally then take either the mean or the maximum of these distances.

Let’s assume we take the maximum (using the mean, median, or other function works similarly, though each have their nuances). If a record has an unusually large distance to its 10th nearest neighbor, this means there are at most 9 records that are reasonably close to it (and possibly less), and that it is otherwise unusually far from most other points, so can be considered an outlier.

With the LOF outlier detector, we use a similar approach, though it works a bit differently. We also look at the distance of each point to its k nearest neighbors, but then compare this to the distances of these k neighbors to their k nearest neighbors. So LOF measures the outlierness of each point relative to the other points in their neighborhoods.

That is, while KNN uses a global standard to determine what are unusually large distances to their neighbors, LOF uses a local standard to determine what are unusually large distances.

The details of the LOF algorithm are actually a bit more involved, and the implications of the specific differences in these two algorithms (and the many variations of these algorithms) are covered in more detail in Outlier Detection in Python.

These are interesting considerations in themselves, but the main point for here is that KNN and LOF both evaluate records based on their distances to their closest neighbors. And that these distance metrics can work sub-optimally (or even completely breakdown) if using many features at once, which is reduced greatly by working with small numbers of features (subspaces) at a time.

The idea of using subspaces is useful even where the detector used does not use distance metrics, but where detectors based on distance calculations are used, some of the benefits of using subspaces can be a bit more clear. And, using distances in ways similar to KNN and LOF is quite common among detectors. As well as KNN and LOF, for example, Radius, ODIN, INFLO, and LoOP detectors, as well as detectors based on sampling, and detectors based on clustering, all use distances.

However, issues with the curse of dimensionality can occur with other detectors as well. For example, ABOD (Angle-based Outlier Detector) uses the angles between records to evaluate the outlierness of each record, as opposed to the distances. But, the idea is similar, and using subspaces can also be helpful when working with ABOD.

As well, other benefits of subspaces I’ll go through below apply equally to many detectors, whether using distance calculations or not. Still, the curse of dimensionality is a serious concern in outlier detection: where detectors use distance calculations (or similar measures, such as angle calculations), and there are many features, these distance calculations can break down. In the plots above, P1 and P2 may be detected well considering only six dimensions, and quite possibly if using 10 or 20 features, but if there were, say, 100 dimensions, the distances between all points would actually end up about the same, and P1 and P2 would not stand out at all as unusual.

Outside of the issues related to working with very large numbers of features, our attempts to identify the most unusual records in a dataset can be undermined even when working with fairly small numbers of features.

While very large numbers of features can make the distances calculated between records meaningless, even moderate numbers of features can make records that are unusual in just one or two features more difficult to identify.

Consider again the scatter plot shown earlier, repeated here. Point P1 is an outlier in feature A (thought not in the other five features). Point P2 is unusual in features C and D, but not in the other four features). However, when considering the Euclidean distances of these points to the other points in 6-dimensional space, they may not reliably stand out as outliers. The same would be true using Manhattan, and most other distance metrics as well.

The left pane shows point P1 in a 2D dataspace. The point is unusual considering feature A, but less so if using Euclidean distances in the full 6D dataspace, or even the 2D dataspace shown in this plot. This is an example where using additional features can be counterproductive. In the middle pane, we see another point, point P2, which is an outlier in the C–D subspace but not in the A-B or E–F subspaces. We need only features C and D to identify this outlier, and again including other features will simply make P2 more difficult to identify.

P1, for example, even in the 2d space shown in the left-most plot, is not unusually far from most other points. It’s unusual that there are no other points near it (which KNN and LOF will detect), but the distance from P1 to the other points in this 2d space is not unusual: it’s similar to the distances between most other pairs of points.

Using a KNN algorithm, we would likely be able to detect this, at least if k is set fairly low, for example, to 5 or 10 — most records have their 5th (and their 10th) nearest neighbors much closer than P1 does. Though, when including all six features in the calculations, this is much less clear than when viewing just feature A, or just the left-most plot, with just features A and B.

Point P2 stands out well as an outlier when considering just features C and D. Using a KNN detector with a k value of, say, 5, we can identify its 5 nearest neighbors, and the distances to these would be larger than is typical for points in this dataset.

Using an LOF detector, again with a k value of, say, 5, we can compare the distances to P1’s or P2’s 5 nearest neighbors to the distances to their 5 nearest neighbors and here as well, the distance from P1 or P2 to their 5 nearest neighbors would be found to be unusually large.

At least this is straightforward when considering only Features A and B, or Features C and D, but again, when considering the full 6-d space, they become more difficult to identify as outliers.

While many outlier detectors may still be able to identify P1 and P2 even with six, or a small number more, dimensions, it is clearly easier and more reliable to use fewer features. To detect P1, we really only need to consider feature A; and to identify P2, we really only need to consider features C and D. Including other features in the process simply makes this more difficult.

This is actually a common theme with outlier detection. We often have many features in the datasets we work with, and each can be useful. For example, if we have a table with 50 features, it may be that all 50 features are relevant: either a rare value in any of these features would be interesting, or a rare combination of values in two or more features, for each of these 50 features, would be interesting. It would be, then, worth keeping all 50 features for analysis.

But, to identify any one anomaly, we generally need only a small number of features. In fact, it’s very rare for a record to be unusual in all features. And it’s very rare for a record to have a anomaly based on a rare combination of many features (see Counts Outlier Detector for more explanation of this).

Any given outlier will likely have a rare value in one or two features, or a rare combination of values in a pair, or a set of perhaps three or four features. Only these features are necessary to identify the anomalies in that row, even though the other features may be necessary to detect the anomalies in other rows.

To address these issues, an important technique in outlier detection is using subspaces. The term subspaces simply refers to subsets of the features. In the example above, if we use the subspaces: A-B, C-D, E-F, A-E, B-C, B-D-F, and A-B-E, then we have seven subspaces (five 2d subspaces and two 3d subspaces). Creating these, we would run one (or more) detectors on each subspace, so would run at least seven detectors on each record.

Realistically, subspaces become more useful where we have many more features that six, and generally even the the subspaces themselves will have more than six features, and not just two or three, but viewing this simple case, for now, with a small number of small subspaces is fairly easy to understand.

Using these subspaces, we can more reliably find P1 and P2 as outliers. P1 would likely be scored high by the detector running on features A-B, the detector running on features A-E, and the detector running on features A-B-E. P2 would likely be detected by the detector running on features C-D, and possibly the detector running on B-C.

However, we have to be careful: using only these seven subspaces, as opposed to a single 6d space covering all features, would miss any rare combinations of, for example, A and D, or C and E. These may or may not be detected using a detector covering all six features, but definitely could not be detected using a suite of detectors that simply never examine these combinations of features.

Using subspaces does have some large benefits, but does have some risk of missing relevant outliers. We’ll cover some techniques to generate subspaces below that mitigate this issue, but it can be useful to still run one or more outlier detectors on the full dataspace as well. In general, with outlier detection, we’re rarely able to find the full set of outliers we’re interested in unless we apply many techniques. As important as the use of subspaces can be, it is still often useful to use a variety of techniques, which may include running some detectors on the full data.

Similarly, with each subspace, we may execute multiple detectors. For example, we may use both a KNN and LOF detector, as well as Radius, ABOD, and possibly a number of other detectors — again, using multiple techniques allows us to better cover the range of outliers we wish to detect.

We’ve seen, then, a couple motivations for working with subspaces: we can mitigate the curse of dimensionality, and we can reduce where anomalies are not identified reliably where they are based on small numbers of features that are lost among many features.

As well as handling situations like this, there are a number of other advantages to using subspaces with outlier detection. These include:

  • Accuracy due to the effects of using ensembles — Using multiple subspaces allows us to create ensembles (collections of outlier detectors), which allows us to combine the results of many detectors. In general, using ensembles of detectors provides greater accuracy than using a single detector. This is similar (though with some real differences too) to the way ensembles of predictors tend to be stronger for classification and regression problems than a single predictor. Here, using subspaces, each record is examined multiple times, which provides a more stable evaluation of each record than any single detector would.
  • Interpretability — The results can be more interpretable, and interpretability is often a key concern in outlier detection. Very often in outlier detection, we’re flagging unusual records with the idea that they may be a concern, or a point of interest, in some way, and often they will be manually examined. Knowing why they are unusual is necessary to be able to do this efficiently and effectively. Manually assessing outliers that are flagged by detectors that examined many features can be especially difficult; on the other hand, outliers flagged by detectors using only a small number of features can be much more manageable to asses.
  • Faster systems — Using fewer features allows us to create faster (and less memory-intensive) detectors. This can speed up both fitting and inference, particularly when working with detectors whose execution time is non-linear in the number of features (many detectors are, for example, quadratic in execution time based on the number of features). Depending on the detectors, using, say, 20 detectors, each covering 8 features, may actually execute faster than a single detector covering 100 features.
  • Execution in parallel — Given that we use many small detectors instead of one large detector, it’s possible to execute both the fitting and the predicting steps in parallel, allowing for faster execution where there are the hardware resources to support this.
  • Ease of tuning over time — Using many simple detectors creates a system that’s easier to tune over time. Very often with outlier detection, we’re simply evaluating a single dataset and wish just to identify the outliers in this. But it’s also very common to execute outlier detection systems on a long-running basis, for example, monitoring industrial processes, website activity, financial transactions, the data being input to machine learning systems or other software applications, the output of these systems, and so on. In these cases, we generally wish to improve the outlier detection system over time, allowing us to focus better on the more relevant outliers. Having a suite of simple detectors, each based on a small number of features, makes this much more manageable. It allows us to, over time, increase the weight of the more useful detectors and decrease the weight of the less useful detectors.

As indicated, we will need, for each dataset evaluated, to determine the appropriate subspaces. It can, though, be difficult to find the relevant set of subspaces, or at least to find the optimal set of subspaces. That is, assuming we are interested in finding any unusual combinations of values, it can be difficult to know which sets of features will contain the most relevant of the unusual combinations.

As an example, if a dataset has 100 features, we may train 10 models, each covering 10 features. We may use, say, the first 10 features for the first detector, the second set of 10 features for the second, and so on, If the first two features have some rows with anomalous combinations of values, we will detect this. But if there are anomalous combinations related to the first feature and any of the 90 features not covered by the same model, we will miss these.

We can improve the odds of putting relevant features together by using many more subspaces, but it can be difficult to ensure all sets of features that should be together are actually together at least once, particularly where there are relevant outliers in the data that are based on three, four, or more features — which must appear together in at least one subspace to be detected. For example, in a table of staff expenses, you may wish to identify expenses for rare combinations of Department, Expense Type, and Amount. If so, these three features must appear together in at least one subspace.

So, we have the questions of how many features should be in each subspace, which features should go together, and how many subspaces to create.

There are a very large number of combinations to consider. If there are 20 features, there are ²²⁰ possible subspaces, which is just over a million. If there are 30 features, there over a billion. If we decide ahead of time how many features will be in each subspace, the numbers of combinations decreases, but is still very large. If there are 20 features and we wish to use subspaces with 8 features each, there are 20 chose 8, or 125,970 combinations. If there are 30 features and we wish for subspaces with 7 features each, there are 30 chose 7, or 2,035,800 combinations.

One approach we may wish to take is to keep the subspaces small, which allows for greater interpretability. The most interpretable option, using two features per subspace, also allows for simple visualization. However, if we have d features, we will need d*(d-1)/2 models to cover all combinations, which can be intractable. With 100 features, we would require 4,950 detectors. We usually need to use at least several features per detector, though not necessarily a large number.

We wish to use enough detectors, and enough features per detector, that each pair of features appears together ideally at least once, and few enough features per detector that the detectors have largely different features from each other. For example, if each detector used 90 out of the 100 features, we’d cover all combinations of features well, but the subspaces would still be quite large (undoing much of the benefit of using subspaces), and all the subspaces will be quite similar to each other (undoing much of the benefit of creating ensembles).

While the number of features per subspace requires balancing these concerns, the number of subspaces created is a bit more straightforward: in terms of accuracy, using more subspaces is strictly better, but is computationally more expensive.

There are a few broad approaches to finding useful subspaces. I list these here quickly, then look at some in more detail below.

  • Based on domain knowledge — Here we consider which sets of features could potentially have combinations of values we would consider noteworthy.
  • Based on associations — Unusual combinations of values are only possible if a set of features are associated in some way. In prediction problems, we often wish to minimize the correlations between features, but with outlier detection, these are the features that are most useful to consider together. The features with the strongest associations will have the most meaningful outliers if there are exceptions to the normal patterns.
  • Based on finding very sparse regions — Records are typically considered as outliers if they are unlike most other records in the data, which implies they are located in sparse regions of the data. Therefore, useful subspaces can be found as those that contain large, nearly-empty regions.
  • Randomly — This is the method used by a technique shown later called FeatureBagging and, while it can be suboptimal, it avoids the expensive searches for associations and sparse regions, and can work reasonably well where many subspaces are used.
  • Exhaustive searches — This is the method employed by Counts Outlier Detector. This is limited to subspaces with small numbers of features, but the results are highly interpretable. It also avoids any computation, or biases, associated with selecting only a subset of the possible subspaces.
  • Using the features related to any known outliers — If we have a set of known outliers, can identify why they are outliers (the relevant features), and are in a situation where we do not wish to identify unknown outliers (only these specific outliers), then we can take advantage of this and identify the sets of features relevant for each known outlier, and construct models for the various sets of features required.

We’ll look at a few of these next in a little more detail.

Domain knowledge

Let’s take the example of a dataset, specifically an expenses table, shown below. If examining this table, we may be able to determine the types of outliers we would and would not be interested in. Unusual combinations of Account and Amount, as well as unusual combinations of Department and Account, may be of interest; whereas Date of Expense and Time would likely not be a useful combination. We can continue in this way, creating a small number of subspaces, each with likely two, three, or four features, which can allow for very efficient and interpretable outlier detection, flagging the most relevant outliers.

Expenses table

This can miss cases where we have an association in the data, though the association is not obvious. So, as well as taking advantage of domain knowledge, it may be worth searching the data for associations. We can discover relationships among the features, for example, testing where features can be predicted accurately from the other features using simple predictive models. Where we find such associations, these can be worth investigating.

Discovering these associations, though, may be useful for some purposes, but may or may not be useful for the outlier detection process. If there is, for example, a relationship between accounts and the time of the day, this may simply be due to the process people happen to typically use to submit their expenses, and it may be that deviations from this are of interest, but more likely they are not.

Random feature subspaces

Creating subspaces randomly can be effective if there is no domain knowledge to draw on. This is fast and can create a set of subspaces that will tend to catch the strongest outliers, though it can miss some important outliers too.

The code below provides an example of one method to create a set of random subspaces. This example uses a set of eight features, named A through H, and creates a set of subspaces of these.

Each subspace starts by selecting the feature that is so far the least-used (if there is a tie, one is selected randomly). It uses a variable called ft_used_counts to track this. It then adds features to this subspace one at a time, each step selecting the feature that has appeared in other subspaces the least often with the features so far in the subspace. It uses a feature called ft_pair_mtx to track how many subspaces each pair of features have appeared in together so far. Doing this, we create a set of subspaces that matches each pair of features roughly equally often.

import pandas as pd
import numpy as np

def get_random_subspaces(features_arr, num_base_detectors,
num_feats_per_detector):
num_feats = len(features_arr)
feat_sets_arr = []
ft_used_counts = np.zeros(num_feats)
ft_pair_mtx = np.zeros((num_feats, num_feats))

# Each loop generates one subspace, which is one set of features
for _ in range(num_base_detectors):
# Get the set of features with the minimum count
min_count = ft_used_counts.min()
idxs = np.where(ft_used_counts == min_count)[0]

# Pick one of these randomly and add to the current set
feat_set = [np.random.choice(idxs)]

# Find the remaining set of features
while len(feat_set) mtx_with_set = ft_pair_mtx[:, feat_set]
sums = mtx_with_set.sum(axis=1)
min_sum = sums.min()
min_idxs = np.where(sums==min_sum)[0]
new_feat = np.random.choice(min_idxs)
feat_set.append(new_feat)
feat_set = list(set(feat_set))

# Updates ft_pair_mtx
for c in feat_set:
ft_pair_mtx[c][new_feat] += 1
ft_pair_mtx[new_feat][c] += 1

# Updates ft_used_counts
for c in feat_set:
ft_used_counts[c] += 1

feat_sets_arr.append(feat_set)

return feat_sets_arr

np.random.seed(0)
features_arr = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
num_base_detectors = 4
num_feats_per_detector = 5

feat_sets_arr = get_random_subspaces(features_arr,
num_base_detectors,
num_feats_per_detector)
for feat_set in feat_sets_arr:
print([features_arr[x] for x in feat_set])

Normally we would create many more base detectors (each subspace often corresponds to one base detector, though we can also run multiple base detectors on each subspace) than we do in this example, but this uses just four to keep things simple. This will output the following subspaces:

['A', 'E', 'F', 'G', 'H']
['B', 'C', 'D', 'F', 'H']
['A', 'B', 'C', 'D', 'E']
['B', 'D', 'E', 'F', 'G']

The code here will create the subspaces such that all have the same number of features. There is also an advantage in having the subspaces cover different numbers of features, as this can introduce some more diversity (which is important when creating ensembles), but there is strong diversity in any case from using different features (so long as each uses a relatively small number of features, such that the subspaces are largely different features).

Having the same number of features has a couple benefits. It simplifies tuning the models, as many parameters used by outlier detectors depend on the number of features. If all subspaces have the same number of features, they can also use the same parameters.

It also simplifies combining the scores, as the detectors will be more comparable to each other. If using different numbers of features, this can produce scores that are on different scales, and not easily comparable. For example, with k-Nearest Neighbors (KNN), we expect greater distances between neighbors if there are more features.

Feature subspaces based on correlations

Everything else equal, in creating the subspaces, it’s useful to keep associated features together as much as possible. In the code below, we provide an example of code to select subspaces based on correlations.

There are several ways to test for associations. We can create predictive models to attempt to predict each feature from each other single feature (this will capture even relatively complex relationships between features). With numeric features, the simplest method is likely to check for Spearman correlations, which will miss nonmonotonic relationships, but will detect most strong relationships. This is what is used in the code example below.

To execute the code, we first specify the number of subspaces desired and the number of features in each.

This executes by first finding all pairwise correlations between the features and storing this in a matrix. We then create the first subspace, starting by finding the largest correlation in the correlation matrix (this adds two features to this subspace) and then looping over the number of other features to be added to this subspace. For each, we take the largest correlation in the correlation matrix for any pair of features, such that one feature is currently in the subspace and one is not. Once this subspace has a sufficient number of features, we create the next subspace, taking the largest correlation remaining in the correlation matrix, and so on.

For this example, we use a real dataset, the baseball dataset from OpenML (available with a public license). The dataset turns out to contain some large correlations. The correlation, for example, between At bats and Runs is 0.94, indicating that any values that deviate significantly from this pattern would likely be outliers.

import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml

# Function to find the pair of features remaining in the matrix with the
# highest correlation
def get_highest_corr():
return np.unravel_index(
np.argmax(corr_matrix.values, axis=None),
corr_matrix.shape)

def get_correlated_subspaces(corr_matrix, num_base_detectors,
num_feats_per_detector):
sets = []

# Loop through each subspace to be created
for _ in range(num_base_detectors):
m1, m2 = get_highest_corr()

# Start each subspace as the two remaining features with
# the highest correlation
curr_set = [m1, m2]
for _ in range(2, num_feats_per_detector):
# Get the other remaining correlations
m = np.unravel_index(np.argsort(corr_matrix.values, axis=None),
corr_matrix.shape)
m0 = m[0][::-1]
m1 = m[1][::-1]
for i in range(len(m0)):
d0 = m0[i]
d1 = m1[i]
# Add the pair if either feature is already in the subset
if (d0 in curr_set) or (d1 in curr_set):
curr_set.append(d0)
curr_set = list(set(curr_set))
if len(curr_set) curr_set.append(d1)
# Remove duplicates
curr_set = list(set(curr_set))
if len(curr_set) >= num_feats_per_detector:
break

# Update the correlation matrix, removing the features now used
# in the current subspace
for i in curr_set:
i_idx = corr_matrix.index[i]
for j in curr_set:
j_idx = corr_matrix.columns[j]
corr_matrix.loc[i_idx, j_idx] = 0
if len(curr_set) >= num_feats_per_detector:
break

sets.append(curr_set)
return sets

data = fetch_openml('baseball', version=1)
df = pd.DataFrame(data.data, columns=data.feature_names)

corr_matrix = abs(df.corr(method='spearman'))
corr_matrix = corr_matrix.where(
np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
corr_matrix = corr_matrix.fillna(0)

feat_sets_arr = get_correlated_subspaces(corr_matrix, num_base_detectors=5,
num_feats_per_detector=4)
for feat_set in feat_sets_arr:
print([df.columns[x] for x in feat_set])

This produces:

['Games_played', 'At_bats', 'Runs', 'Hits']
['RBIs', 'At_bats', 'Hits', 'Doubles']
['RBIs', 'Games_played', 'Runs', 'Doubles']
['Walks', 'Runs', 'Games_played', 'Triples']
['RBIs', 'Strikeouts', 'Slugging_pct', 'Home_runs']

PyOD is likely the most comprehensive and well-used tool for outlier detection on numeric tabular data available in Python today. It includes a large number of detectors, ranging from very simple to very complex — including several deep learning-based methods.

Now that we have an idea of how subspaces work with outlier detection, we’ll look at two tools provided by PyOD that work with subspaces, called SOD and FeatureBagging. Both of these tools identify a set of subspaces, execute a detector on each subspace, and combine the results for a single score for each record.

Whether using subspaces or not, it’s necessary to determine what base detectors to use. If not using subspaces, we would select one or more detectors and run these on the full dataset. And, if we are using subspaces, we again select one or more detectors, here running these on each subspace. As indicated above, LOF and KNN can be reasonable choices, but PyOD provides a number of others as well that can work well if executed on each subspace, including, for example, Angle-based Outlier Detector (ABOD), models based on Gaussian Mixture Models (GMMs), Kernel Density Estimations (KDE), and several others. Other detectors, provided outside PyOD can work very effectively as well.

SOD was designed specifically to handle situations such as shown in the scatter plots above. SOD works, similar to KNN and LOF, by identifying a neighborhood of k neighbors for each point, known as the reference set. The reference set is found in a different way, though, using a method called shared nearest neighbors (SNN).

Shared nearest neighbors are described thoroughly in this article, but the general idea is that if two points are generated by the same mechanism, they will tend to not only be close, but also to have many of the same neighbors. And so, the similarity of any two records can be measured by the number of shared neighbors they have. Given this, neighborhoods can be identified by using not only the sets of points with the smallest Euclidean distances between them (as KNN and LOF do), but the points with the most shared neighbors. This tends to be robust even in high dimensions and even where there are many irrelevant features: the rank order of neighbors tends to remain meaningful even in these cases, and so the set of nearest neighbors can be reliably found even where specific distances cannot.

Once we have the reference set, we use this to determine the subspace, which here is the set of features that explain the greatest amount of variance for the reference set. Once we identify these subspaces, SOD examines the distances of each point to the data center.

I provide a quick example using SOD below. This assumes pyod has been installed, which requires running:

pip install pyod

We’ll use, as an example, a synthetic dataset, which allows us to experiment with the data and model hyperparameters to get a better sense of the strengths and limitations of each detector. The code here provides an example of working with 35 features, where two features (features 8 and 9) are correlated and the other features are irrelevant. A single outlier is created as an unusual combination of the two correlated features.

SOD is able to identify the one known outlier as the top outlier. I set the contamination rate to 0.01 to specify to return (given there are 100 records) only a single outlier. Testing this beyond 35 features, though, SOD scores this point much lower. This example specifies the size of the reference set to be 3; different results may be seen with different values.

import pandas as pd
import numpy as np
from pyod.models.sod import SOD

np.random.seed(0)
d = np.random.randn(100, 35)
d = pd.DataFrame(d)

#A Ensure features 8 and 9 are correlated, while all others are irrelevant
d[9] = d[9] + d[8]

# Insert a single outlier
d.loc[99, 8] = 3.5
d.loc[99, 9] = -3.8

#C Execute SOD, flagging only 1 outlier
clf = SOD(ref_set=3, contamination=0.01)
d['SOD Scores'] = clf.fit (d)
d['SOD Scores'] = clf.labels_

We display four scatterplots below, showing four pairs of the 35 features. The known outlier is shown as a star in each of these. We can see features 8 and 9 (the two relevant features) in the second pane, and we can see the point is a clear outlier, though it is typical in all other dimensions.

Testing SOD with 35-dimensional data. One outlier was inserted into the data and can be seen clearly in the second pane for features 8 and 9. Although the point is typical otherwise, it is flagged as the top outlier by SOD. The third pane also includes feature 9, and we can see the point is somewhat unusual here, though no more so than many other points in other dimensions. The relationship in features 8 and 9 is the most relevant, and SOD appears to detect this

FeatureBagging was designed to solve the same problem as SOD, though takes a different approach to determining the subspaces. It creates the subspaces completely randomly (so slightly differently than the example above, which keeps a record of how often each pair of features are placed in a subspace together and attempts to balance this). It also subsamples the rows for each base detector, which provides a little more diversity between the detectors.

A specified number of base detectors are used (10 by default, though it is preferable to use more), each of which selects a random set of rows and features. For each, the maximum number of features that may be selected is specified as a parameter, defaulting to all. So, for each base detector, FeatureBagging:

  • Determines the number of features to use, up to the specified maximum.
  • Chooses this many features randomly.
  • Chooses a set of rows randomly. This is a bootstrap sample of the same size as the number of rows.
  • Creates an LOF detector (by default; other base detectors may be used) to evaluate the subspace.

Once this is complete, each row will have been scored by each base detector and the scores must then be combined into a single, final score for each row. PyOD’s FeatureBagging provides two options for this: using the maximum score and using the mean score.

As we saw in the scatter plots above, points can be strong outliers in some subspaces and not in others, and averaging in their scores from the subspaces where they are typical can water down their scores and defeat the benefit of using subspaces. In other forms of ensembling with outlier detection, using the mean can work well, but when working with multiple subspaces, using the maximum will typically be the better of the two options. Doing that, we give each record a score based on the subspace where it was most unusual. This isn’t perfect either, and there can be better options, but using the maximum is simple and is almost always preferable to the mean.

Any detector can be used within the subspaces. PyOD uses LOF by default, as did the original paper describing FeatureBagging. LOF is a strong detector and a sensible choice, though you may find better results with other base detectors.

In the original paper, subspaces are created randomly, each using between d/2 and d — 1 features, where d is the total number of features. Some researchers have pointed out that the number of features used in the original paper is likely much larger than is appropriate.

If the full number of features is large, using over half the features at once will allow the curse of dimensionality to take effect. And using many features in each detector will result in the detectors being correlated with each other (for example, if all base detectors use 90% of the features, they will use roughly the same features and tend to score each record roughly the same), which can also remove much of the benefit of creating ensembles.

PyOD allows setting the number of features used in each subspace, and it should be typically set fairly low, with a large number of base estimators created.

In this article we’ve looked at subspaces as a way to improve outlier detection in a number of ways, including reducing the curse of dimensionality, increasing interpretability, allowing parallel execution, allowing easier tuning over time, and so on. Each of these are important considerations, and using subspaces is often very helpful.

There are, though, often other approaches as well that can be used for these purposes, sometimes as alternatives, and sometimes in combination of with the use of subspaces. For example, to improve interpretability, its important to, as much as possible, select model types that are inherently interpretable (for example univariate tests such as z-score tests, Counts Outlier Detector, or a detector provided by PyOD called ECOD).

Where the main interest is in reducing the curse of dimensionality, here again, it can be useful to look at model types that scale to many features well, for instance Isolation Forest or Counts Outlier Detector. It can also be useful to look at executing univariate tests, or applying PCA.

One thing to be aware of when constructing subspaces, if they are formed based on correlations, or on sparse regions, is that the relevant subspaces may change over time as the data changes. New associations may emerge between features and new sparse regions may form that will be useful for identifying outliers, though these will be missed if the subspaces are not recalculated from time to time. Finding the relevant subspaces in these ways can be quite effective, but they may need to to be updated on some schedule, or where the data is known to have changed.

It’s common with outlier detection projects on tabular data for it to be worth looking at using subspaces, particularly where we have many features. Using subspaces is a relatively straightforward technique with a number of noteworthy advantages.

Where you face issues related to large data volumes, execution times, or memory limits, using PCA may also be a useful technique, and may work better in some cases than creating sub-spaces, though working with sub-spaces (and so, working with the original features, and not the components created by PCA) can be substantially more interpretable, and interpretability is often quite important with outlier detection.

Subspaces can be used in combination with other techniques to improve outlier detection. As an example, using subspaces can be combined with other ways to create ensembles: it’s possible to create larger ensembles using both subspaces (where different detectors in the ensemble use different features) as well as different model types, different training rows, different pre-processing, and so on. This can provide some further benefits, though with some increase in computation as well.

All images by author



Source link

24Nov

Data Science in Marketing: Hands-on Propensity Modelling with Python | by Rebecca Vickery | Nov, 2024


All the code you need to predict the likelihood of a customer purchasing your product

Photo by Campaign Creators on Unsplash

Propensity models are a powerful application of machine learning in marketing. These models use historical examples of customer behaviour to make predictions about future behaviour. The predictions generated by the propensity model are commonly used to understand the likelihood of a customer purchasing a particular product or taking up a specific offer within a given time frame.

In essence, propensity models are examples of the machine learning technique known as classification. What makes propensity models unique is the problem statement they solve and how the output needs to be crafted for use in marketing.

The output of a propensity model is a probability score describing the predicted likelihood of the desired customer behaviour. This score can be used to create customer segments or rank customers for increased personalisation and targeting of new products or offers.

In this article, I’ll provide an end-to-end practical tutorial describing how to build a propensity model ready for use by a marketing team.

This is the first in a series of hands-on Python tutorials I’ll be writing



Source link

23Nov

Building Sustainable Algorithms: Energy-Efficient Python Programming | by Ari Joury, PhD | Nov, 2024


6 techniques for reducing the computational cost of Python algorithms

You can get your Python performing better by using these techniques. Image generated with Leonardo AI

A junior software developer shall be forgiven for being happy when their code works. If that’s you, I do not judge you.

However, if you are ready to get to the next level of building software with Python, your code should not just run and pass some tests. It should also be written with the available computing resources — and the energy bill — in mind.

Every inefficient loop, poorly chosen data structure, or redundant computation burns more electricity than necessary. Unlike C, for example, where you must reserve bits from your disk for each new variable you create, Python will consume resources as it sees fit. This makes it extremely beginner-friendly, but also rather energy-intensive when used wrong.

Sloppy algorithms are not just bad for the performance of a code. They are bad for the planet, too. Software companies like Microsoft are struggling to keep their carbon emissions low because of all of the energy they consume for AI and other tasks. At the same time, sustainability is a growing concern. Sustainability-minded programmers are therefore becoming a valuable resource for many companies.



Source link

22Nov

The Evolution of AI Agents & Agentic Systems | by Cobus Greyling | Nov, 2024


Cognitive Architectures: Episodal Memory refers to the ability to store and recall specific events or experiences, like remembering a recent conversation.

Semantic Memory stores general knowledge about the world, such as facts and concepts.

While episodal memory is dynamic and context-dependent, semantic memory is more stable and involves understanding abstract, generalised information.

Action Space: The AI Agent operates within a dual-action framework. Internal actions involve processes like reasoning, planning, and updating its internal state, while external actions involve interactions with the environment, such as executing commands or providing outputs.

Decision-Making Procedure: The agent’s decision-making is organised as an interactive loop consisting of planning and execution. This iterative process allows the agent to analyse its environment, formulate a strategy, and act accordingly, refining its approach as new information becomes available.

These elements collectively define the operational framework of an AI agent, enabling adaptive and efficient behaviour in complex environments.

AI Agents operate within diverse environments that enable them to interact and execute tasks. Currently, these environments are primarily digital, including mobile operating systems, desktop operating systems, and other digital ecosystems.

In these settings, AI Agents can engage with games, APIs, websites, and general code execution, utilising these platforms as a foundation for task execution and knowledge application.

Digital environments provide an efficient and cost-effective alternative to physical interaction, serving as an easy supervised and cost-effective way to develop and evaluation of AI Agents.

Practical Examples

For example, in natural language processing (NLP) tasks, digital APIs — such as search engines, calculators, and translators — are often packaged as tools within the operating system, designed for specific purposes.

These tools can be considered specialised, single-use digital environments that allow agents to perform tasks requiring external knowledge or computation.

As AI Agents continue to evolve, their presence in digital environments will expand beyond static interaction, laying the groundwork for more complex systems.

The future of AI Agents lies in their physical embodiment, where they will operate in real-world environments. This transition will open up new possibilities for AI, enabling agents to interact physically with the world, navigate dynamic spaces, and take on roles in fields such as robotics.

The shift from purely digital environments to physical ones represents a significant step forward, as it will require AI Agents to integrate sensory data, physical actions, and context-aware decision-making, further enhancing their capabilities and applications.

What stands out in this study is its detailed analysis of the evolving frameworks built around LLMs to maximise their potential. It highlights how these structures are both internal and external, working in tandem to enhance their capabilities.

Internally, the focus is on reasoning, which forms the core of the model’s intelligence and decision-making processes. Externally, the journey began with data augmentation, enabling the integration of additional information.

Over time, these external frameworks have expanded to include direct interaction with the outside world, further extending the functionality and adaptability of LLMs.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.



Source link

20Nov

Anthropic’s Claude 3.5 Computer Use Framework (AI Agent) | by Cobus Greyling | Nov, 2024


Claude Computer Use utilises a reasoning-acting (ReAct) paradigm to generate reliable actions in the dynamic GUI environment.

Observing the environment before deciding on an action ensures that its responses align with the current GUI state.

Additionally, it demonstrates the ability to efficiently recognise when user requirements are met, enabling decisive actions while avoiding unnecessary steps.

Unlike traditional approaches that rely on continuous observation at every step, Claude Computer Use employs a selective observation strategy, monitoring the GUI state only when needed. This method reduces costs and enhances efficiency by eliminating redundant observations.

Claude Computer Use AI Agent relies exclusively on visual input from real-time screenshots to observe the user environment, without utilising any additional data sources.

These screenshots, captured during task execution, allow the model to mimic human interactions with desktop interfaces effectively.

This approach is essential for adapting to the ever-changing nature of GUI environments. By adopting a vision-only methodology, Claude Computer Use enables general computer operation without depending on software APIs, making it especially suitable for working with closed-source software.

Representative failure cases in the evaluation highlight instances where the model’s actions did not align with the intended user outcomes, exposing limitations in task comprehension or execution.

To analyse these failures systematically, errors are categorised into three distinct sources: Planning Error (PE), Action Error (AE) and Critic Error (CE).

These categories aid in identifying the root causes of each failure.

Planning Error (PE): These errors arise when the model generates an incorrect plan based on task queries, often due to misinterpreting task instructions or misunderstanding the current computer state. For instance, a planning error might occur during tasks like subscribing to a sports service.

Action Error (AE): These occur when the plan is correct, but the agent fails to execute the corresponding actions accurately. Such errors are typically related to challenges in understanding the interface, recognising spatial elements, or controlling GUI elements precisely. For example, errors can arise during tasks like inserting a sum equation over specific cells in a spreadsheet.

Critic Error (CE): Critic errors happen when the agent misjudges its own actions or the computer’s state, resulting in incorrect feedback about task completion. Examples include updating details on a resume template or inserting a numbering symbol.

These categorisations provide a structured approach to identifying and addressing the underlying causes of task failures.

This study highlights the framework’s potential as well as its limitations, particularly in planning, action execution, and critic feedback.

An out-of-the-box framework, Computer Use Out-of-the-Box, was introduced to simplify the deployment and benchmarking of such models in real-world scenarios. In an upcoming article I would like to dig into the framework.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.



Source link

18Nov

Revolutionising AI Agents With Computer Use Tools | by Cobus Greyling | Nov, 2024


Anthropic’s New Framework for Seamless Computer Use Is a good example for what is to come.

The future of AI Agents isn’t just about better language models — it’s about integrating intelligent agents into your computing environment.

Anthropic has taken a bold step forward with its new framework for computer use, designed to work locally on your system while leveraging reasoning and task decomposition capabilities.

𝗕𝗿𝗲𝗮𝗸𝗶𝗻𝗴 𝗶𝘁 𝗱𝗼𝘄𝗻:
🌟 𝘐𝘵’𝘴 𝘕𝘰𝘵 𝘑𝘶𝘴𝘵 𝘢 𝘓𝘢𝘯𝘨𝘶𝘢𝘨𝘦 𝘔𝘰𝘥𝘦𝘭
– This framework isn’t a typical AI chatbot or model — it’s a powerful tool that lives within your local computing space.

– At its core is the AI Agent, running within a Docker container, that:
— Decomposes tasks into manageable steps.
— Reasons logically to solve problems effectively.
— Interfaces directly with your computer to get things done.

– Think of it as a local, reasoning-driven AI agent, where the ability to navigate your computer is just one of the tools in its arsenal.

🛠️ 𝘛𝘩𝘳𝘦𝘦 𝘛𝘰𝘰𝘭 𝘛𝘺𝘱𝘦𝘴
– The framework features a robust tool system to handle diverse tasks:
1. Computer Tools
2. Text Editor Tool
3. Bash Tool

🌍 𝘏𝘰𝘸 𝘵𝘩𝘦 𝘈𝘯𝘵𝘩𝘳𝘰𝘱𝘪𝘤 𝘔𝘰𝘥𝘦𝘭 𝘍𝘪𝘵𝘴 𝘐𝘯
While the framework operates locally, it integrates seamlessly with Anthropic’s state-of-the-art language model (which includes vision capabilities) over the internet.

𝘛𝘩𝘪𝘴 𝘮𝘰𝘥𝘦𝘭 𝘱𝘳𝘰𝘷𝘪𝘥𝘦𝘴:
– Deep language understanding for advanced reasoning.
– Vision processing for interpreting images and other visual data.
– This modular approach ensures local control and performance while accessing cloud-powered intelligence only when necessary.

𝗪𝗵𝘆 𝗜𝘁 𝗠𝗮𝘁𝘁𝗲𝗿𝘀

Anthropic’s framework isn’t just about answering questions — it’s about giving you an AI Agent that understands your environment and helps you tackle complex tasks end-to-end.

This blend of local computing and cloud-powered reasoning opens up new possibilities for developers, researchers, and businesses looking to integrate AI into their workflows in a meaningful way.

💡 Imagine having an AI agent that doesn’t just think — it acts, navigates, and adapts within your computing space.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.



Source link

17Nov

Spoiler Alert: The Magic of RAG Does Not Come from AI | by Frank Wittkampf | Nov, 2024


Why retrieval, not generation, makes RAG systems magical

Quick POCs

Most quick proof of concepts (POCs) which allow a user to explore data with the help of conversational AI simply blow you away. It feels like pure magic when you can all of a sudden talk to your documents, or data, or code base.

These POCs work wonders on small datasets with a limited count of docs. However, as with almost anything when you bring it to production, you quickly run into problems at scale. When you do a deep dive and you inspect the answers the AI gives you, you notice:

  • Your agent doesn’t reply with complete information. It missed some important pieces of data
  • Your agent doesn’t reliably give the same answer
  • Your agent isn’t able to tell you how and where it got which information, making the answer significantly less useful

It turns out that the real magic in RAG does not happen in the generative AI step, but in the process of retrieval and composition. Once you dive in, it’s pretty obvious why…

* RAG = Retrieval Augmented Generation — Wikipedia Definition of RAG

RAG process — Illustration

A quick recap of how a simple RAG process works:

  1. It all starts with a query. The user asked a question, or some system is trying to answer a question. E.g. “Does patient Walker have a broken leg?”
  2. A search is done with the query. Mostly you’d embed the query and do a similarity search, but you can also do a classic elastic search or a combination of both, or a straight lookup of information
  3. The search result is a set of documents (or document snippets, but let’s simply call them documents for now)
  4. The documents and the essence of the query are combined into some easily readable context so that the AI can work with it
  5. The AI interprets the question and the documents and generates an answer
  6. Ideally this answer is fact checked, to see if the AI based the answer on the documents, and/or if it is appropriate for the audience

The dirty little secret is that the essence of the RAG process is that you have to provide the answer to the AI (before it even does anything), so that it is able to give you the reply that you’re looking for.

In other words:

  • the work that the AI does (step 5) is apply judgement, and properly articulate the answer
  • the work that the engineer does (step 3 and 4) is find the answer and compose it such that AI can digest it

Which is more important? The answer is, of course, it depends, because if judgement is the critical element, then the AI model does all the magic. But for an endless amount of business use cases, finding and properly composing the pieces that make up the answer, is the more important part.

The first set of problems to solve when running a RAG process are the data ingestion, splitting, chunking, document interpretation issues. I’ve written about a few of these in prior articles, but am ignoring them here. For now let’s assume you have properly solved your data ingestion, you have a lovely vector store or search index.

Typical challenges:

  • Duplication — Even the simplest production systems often have duplicate documents. More so when your system is large, you have extensive users or tenants, you connect to multiple data sources, or you deal with versioning, etc.
  • Near duplication — Documents which largely contain the same data, but with minor changes. There are two types of near duplication:
    — Meaningful — E.g. a small correction, or a minor addition, e.g. a date field with an update
    — Meaningless — E.g.: minor punctuation, syntax, or spacing differences, or just differences introduced by timing or intake processing
  • Volume — Some queries have a very large relevant response data set
  • Data freshness vs quality — Which snippets of the response data set have the most high quality content for the AI to use vs which snippets are most relevant from a time (freshness) perspective?
  • Data variety — How do we ensure a variety of search results such that the AI is properly informed?
  • Query phrasing and ambiguity — The prompt that triggered the RAG flow, might not be phrased in such a way that it yields the optimal result, or might even be ambiguous
  • Response Personalization — The query might require a different response based on who asks it

This list goes on, but you get the gist.

Short answer: no.

The cost and performance impact of using extremely large context windows shouldn’t be underestimated (you easily 10x or 100x your per query cost), not including any follow up interaction that the user/system has.

However, putting that aside. Imagine the following situation.

We put Anne in room with a piece of paper. The paper says: *patient Joe: complex foot fracture.* Now we ask Anne, does the patient have a foot fracture? Her answer is “yes, he does”.

Now we give Anne a hundred pages of medical history on Joe. Her answer becomes “well, depending on what time you are referring to, he had …”

Now we give Anne thousands of pages on all the patients in the clinic…

What you quickly notice, is that how we define the question (or the prompt in our case) starts to get very important. The larger the context window, the more nuance the query needs.

Additionally, the larger the context window, the universe of possible answers grows. This can be a positive thing, but in practice, it’s a method that invites lazy engineering behavior, and is likely to reduce the capabilities of your application if not handled intelligently.

As you scale a RAG system from POC to production, here’s how to address typical data challenges with specific solutions. Each approach has been adjusted to suit production requirements and includes examples where useful.

Duplication

Duplication is inevitable in multi-source systems. By using fingerprinting (hashing content), document IDs, or semantic hashing, you can identify exact duplicates at ingestion and prevent redundant content. However, consolidating metadata across duplicates can also be valuable; this lets users know that certain content appears in multiple sources, which can add credibility or highlight repetition in the dataset.

# Fingerprinting for deduplication
def fingerprint(doc_content):
return hashlib.md5(doc_content.encode()).hexdigest()

# Store fingerprints and filter duplicates, while consolidating metadata
fingerprints = {}
unique_docs = []
for doc in docs:
fp = fingerprint(doc['content'])
if fp not in fingerprints:
fingerprints[fp] = [doc]
unique_docs.append(doc)
else:
fingerprints[fp].append(doc) # Consolidate sources

Near Duplication

Near-duplicate documents (similar but not identical) often contain important updates or small additions. Given that a minor change, like a status update, can carry critical information, freshness becomes crucial when filtering near duplicates. A practical approach is to use cosine similarity for initial detection, then retain the freshest version within each group of near-duplicates while flagging any meaningful updates.

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN
import numpy as np

# Cluster embeddings with DBSCAN to find near duplicates
clustering = DBSCAN(eps=0.1, min_samples=2, metric="cosine").fit(doc_embeddings)

# Organize documents by cluster label
clustered_docs = {}
for idx, label in enumerate(clustering.labels_):
if label == -1:
continue
if label not in clustered_docs:
clustered_docs[label] = []
clustered_docs[label].append(docs[idx])

# Filter clusters to retain only the freshest document in each cluster
filtered_docs = []
for cluster_docs in clustered_docs.values():
# Choose the document with the most recent timestamp or highest relevance
freshest_doc = max(cluster_docs, key=lambda d: d['timestamp'])
filtered_docs.append(freshest_doc)

Volume

When a query returns a high volume of relevant documents, effective handling is key. One approach is a **layered strategy**:

  • Theme Extraction: Preprocess documents to extract specific themes or summaries.
  • Top-k Filtering: After synthesis, filter the summarized content based on relevance scores.
  • Relevance Scoring: Use similarity metrics (e.g., BM25 or cosine similarity) to prioritize the top documents before retrieval.

This approach reduces the workload by retrieving synthesized information that’s more manageable for the AI. Other strategies could involve batching documents by theme or pre-grouping summaries to further streamline retrieval.

Data Freshness vs. Quality

Balancing quality with freshness is essential, especially in fast-evolving datasets. Many scoring approaches are possible, but here’s a general tactic:

  • Composite Scoring: Calculate a quality score using factors like source reliability, content depth, and user engagement.
  • Recency Weighting: Adjust the score with a timestamp weight to emphasize freshness.
  • Filter by Threshold: Only documents meeting a combined quality and recency threshold proceed to retrieval.

Other strategies could involve scoring only high-quality sources or applying decay factors to older documents.

Data Variety

Ensuring diverse data sources in retrieval helps create a balanced response. Grouping documents by source (e.g., different databases, authors, or content types) and selecting top snippets from each source is one effective method. Other approaches include scoring by unique perspectives or applying diversity constraints to avoid over-reliance on any single document or perspective.

# Ensure variety by grouping and selecting top snippets per source

from itertools import groupby

k = 3 # Number of top snippets per source
docs = sorted(docs, key=lambda d: d['source'])

grouped_docs = {key: list(group)[:k] for key, group in groupby(docs, key=lambda d: d['source'])}
diverse_docs = [doc for docs in grouped_docs.values() for doc in docs]

Query Phrasing and Ambiguity

Ambiguous queries can lead to suboptimal retrieval results. Using the exact user prompt is mostly not be the best way to retrieve the results they require. E.g. there might have been an information exchange earlier on in the chat which is relevant. Or the user pasted a large amount of text with a question about it.

To ensure that you use a refined query, one approach is to ensure that a RAG tool provided to the model asks it to rephrase the question into a more detailed search query, similar to how one might carefully craft a search query for Google. This approach improves alignment between the user’s intent and the RAG retrieval process. The phrasing below is suboptimal, but it provides the gist of it:

tools = [{ 
"name": "search_our_database",
"description": "Search our internal company database for relevent documents",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "A search query, like you would for a google search, in sentence form. Take care to provide any important nuance to the question."
}
},
"required": ["query"]
}
}]

Response Personalization

For tailored responses, integrate user-specific context directly into the RAG context composition. By adding a user-specific layer to the final context, you allow the AI to take into account individual preferences, permissions, or history without altering the core retrieval process.



Source link

16Nov

Open the Artificial Brain: Sparse Autoencoders for LLM Inspection | by Salvatore Raieli | Nov, 2024


|LLM|INTERPRETABILITY|SPARSE AUTOENCODERS|XAI|

A deep dive into LLM visualization and interpretation using sparse autoencoders

Explore the inner workings of Large Language Models (LLMs) beyond standard benchmarks. This article defines fundamental units within LLMs, discusses tools for analyzing complex interactions among layers and parameters, and explains how to visualize what these models learn, offering insights to correct unintended behaviors.
Image created by the author using DALL-E

All things are subject to interpretation whichever interpretation prevails at a given time is a function of power and not truth. — Friedrich Nietzsche

As AI systems grow in scale, it is increasingly difficult and pressing to understand their mechanisms. Today, there are discussions about the reasoning capabilities of models, potential biases, hallucinations, and other risks and limitations of Large Language Models (LLMs).



Source link

15Nov

The Focus Is Shifting From AI Agents To AI Agent Tool Use | by Cobus Greyling | Nov, 2024


As models become utilities, tool-enabled frameworks and environments are emerging as key, with leading AI companies like OpenAI and Anthropic exploring AI Agents that use computer GUI navigation to accomplish complex tasks.

Also recently announced, OpenAI is gearing up to release an AI Agent, Operator, which will perform tasks autonomously on a user’s computer, like coding and booking travel, available as a research preview in January.

This release aligns with an industry-wide shift toward more capable Agentic Tools that manage multi-step workflows with minimal oversight.

Other major players are also launching agent tools capable of real-time computer navigation, reflecting a strategic move to enhance AI Agent capabilities through tool access rather than simply improving model power.



Source link

Protected by Security by CleanTalk