10Jul

Teaching Small Language Models to Reason | by Cobus Greyling | Jul, 2024


Chain-Of-Thought Prompting at a foundational level is so successful, that it gave rise to something some refer to as the Chain-Of-X phenomenon. Google Research explored how to generate a CoT data ontology for existing datasets using LLMs and then how to fine-tune smaller Language Models on the CoT.

As most everyone knows, Chain-Of-Thought prompting improves the reasoning capabilities of large language models.

Google asserts that reasoning capabilities only emerge in models with at least tens of billions of parameters. This research from Google explores transferring these capabilities to smaller models via knowledge distillation.

They fine-tuned a student model using the Chain-Of-Thought outputs from a larger teacher model.

Researchers from Google found that this method improves task performance in arithmetic, common sense, and symbolic reasoning datasets.

Chain of thought (CoT) prompting teaches Language Models (LMs) to decompose a reasoning task into a series of intermediate steps.

It is demonstrated that this prompting significantly increases the task accuracy of large language models (LLMs) across common sense, symbolic and mathematical reasoning datasets.

However, the reasoning capabilities of smaller LMs do not improve with CoT prompting, mostly producing illogical CoT. Notably, CoT prompting even reduces the accuracy of models with less than 10 billion parameters.

Research attributes this to abilities, such as semantic understanding and symbolic mapping, only emerging at larger scale models.

Google Research propose a two-step pipeline for CoT (Chain-Of-Thought) knowledge distillation.

Annotation with CoT Reasoning

  1. Use a teacher model, like PaLM 540B or GPT-3 175B, to annotate an existing supervised dataset with CoT reasoning.
  2. Perform few-shot prompting with 8 examples to generate CoTs, adapting prompts to provide the target answer after the question and before the example CoT. This helps correct small mistakes.
  3. Remove incorrect CoTs based on the target answer to ensure quality.

Fine-Tuning the Student Model

  1. Fine-Tune a student model using teacher forcing.
  2. Provide the question as input and the CoT and answer as the target.
  3. This training eliminates the need for prompting during fine-tuning.

An overview of the proposed method is shown in the figure below:



Source link

10Jul

legitimate expectations, legal certainty and economic sanctions. – European Law Blog


Blogpost 35/2024

Disclosure: the author was a member of the Applicant’s counsel team

 

Introduction

This post concerns a question which ought to be of concern to all who practise in or study EU law:  does the EU administrative law acquis provide the Union’s courts with the tools they need to supervise the exercise of Union power across a range of competences which were simply not in contemplation at the time the acquis was developed?  There are two prompts for this post.

The first prompt is Joana Mendes’ recent (European Constitutional Law Review. 2022;18(4):706-736) and persuasive demonstration of how the current EU administrative law acquis grew up as a result of a “symbiosis of judicial and scholarly developments” in the pre-Maastricht era. The result was that, by the late 1980s there was a consensus that the subjugation of EU institutions to administrative law constraints (as then understood and theorised) had become “an essential aspect of the EC’s legitimacy”. Mendes argues (again persuasively) that this consensus and the principles which underlay it were the product of (amongst other things) the “institutional and legal reality” of what was then the European Community – i.e. “a functional polity whose interventionist institutional and decision-making structures were created for the establishment and functioning of a common market”. Mendes concludes by urging scholarly (and, perhaps, judicial) “self-reflection” as to whether this framework for analysis remains “fit for purpose” in an EU with competences far beyond what those pioneering scholars and jurists had conceived of.

The second prompt is the General Court’s recent decision in Case T-426/21 Nizar Assaad v Council ECLI:EU:T:2023:114. Here, the Court was asked to apply two core components of the administrative law acquis (the principles legitimate expectation and legal certainty)  in a context which would have been inconceivable to the Court at the time the underlying legal principles were developed – targeted economic sanctions introduced to further a foreign policy objective of the Union as a whole. The Assaad decision provides an opportunity for reflection of the type urged by Mendes and, it is argued, indicates that the Court is capable standing back and interrogating the principles which underlay the early decisions establishing the EU administrative law framework, and how they ought to apply in the much changed context of the Union activity in the Lisbon era.

 

Background to the Assaad case

The Applicant in the Nizar Assaad case was Mr Nizar Assaad, a dual citizen of Canada and Syria. Mr Assaad was a prominent businessman who resided in Syria until the uprising in 2011 when he left and relocated to Beirut and Dubai. As will become apparent, Mr Assaad was never involved in politics and had no connection to the Syrian regime. Mr Assaad’s business interests from 2000 onwards were largely outside Syria, and he had no business connections in Syria at all following the 2011 uprising. Rather, he had the ill-fortune to have a surname which bore (in English transliteration) a passing similarity to that of the Syrian president Bashar al-Assad.

The story begins in August 2011 when the Council added an individual identified as “Nizar Al-Assaad” as “entry 36” to the list of those subject to the EU’s Syrian sanctions regime, which is set out in Annex II to Regulation (EU) No 36/2012 concerning restrictive measures in view of the situation in Syria. The Applicant knew that entry 36 could not relate to him as he had not done any of the things suggested in the accompanying reasons, nor did he satisfy any of the listing criteria.  However, since the Council had (it might be said, in dereliction of its duty to list individuals in compliance with the principle of legal certainty) given no identifying information, there was a real risk that third parties would conclude that he was the person listed at entry 36. Unsurprisingly, this was of the utmost concern to the Applicant, not least because he risked the severe reputational impact of third parties misapprehending that he was associated with President Assad’s regime. Furthermore, there was a risk that third parties would (wrongly) conclude that he was subject to the strictures of the sanctions regime, including the far-reaching consequences of a complete EU wide freezing of all his assets and economic resources and of being prevented from entering or travelling through any EU Member State.

The Applicant’s representatives tried repeatedly to contact the Council with a view to clarification, but to no avail. The Applicant then brought an application for annulment in respect of entry 36, on the basis that he was self-evidently not the person referred to. The Council did not dispute this. Rather, the Council wrote to the Applicant confirming that “the targeted person is President Al-Assad’s cousin” and that the Applicant was “not the subject of the listing”, although he has a “similar name”. Entry 36 was clarified, and the General Court concluded that the annulment application was inadmissible as the Applicant was not the addressee of the measure: Assaad v Council(T‑550/11, not published, EU:T:2012:266).

There the story should have ended. Indeed, there was every indication that it would. For the subsequent decade, whenever there was any confusion as to who was identified in entry 36, the Council made clear that it was not the Applicant. Occasionally, this confusion was the result of administrative errors by the Council. While this was a matter of unneeded stress and inconvenience to the Applicant, the Council always responded by making clear that the Applicant was not the man referred to in entry 36.

Against that background (and at the risk of understatement), it was a matter of surprise to the Applicant when in February 2021 the Council wrote to him maintaining that, contrary to everything it had said to him, the Court, and the world at large over the previous decade, the Council had decided that he was in fact been the person who had been listed since 2011. Furthermore, the Council asserted that it was “maintaining” his listing, and that it would be amending the published statement of reasons to make this clear.

 

The application for annulment

The Applicant immediately brought an application for annulment, the primary ground being that the Council had made a manifest error of assessment. The Applicant established that he was not a person to whom the Syrian sanctions regime could apply: he was not associated with the Syrian regime, did not have any ties (professional or personal) to either President Assad’s family or the Makhlouf family and did not have business interests in Syria at all (still less in a prominent capacity). The Court agreed, and annulled the listing on the basis that it could not be supported in fact (even given the very large margin that the Court accords to the Council in such matters).

The Court did not, however, let matters rest there. The Court went on to find that the Council’s conduct had been breach of the applicant’s legitimate expectations and of the related principle of legal certainty. It is the Court’s approach to these issues which presents an opportunity for reflection of the kind urged by Mendes.

 

Assessment of the Court’s approach

As Mendes notes the principles of legitimate expectation came to form part of the corpus of EU administrative law as a result of the “transplanting” into EU law of principles deriving from the domestic administrative law of member states. Following that transplant, the underlying EU legal principles of legitimate expectation were settled in a line of pre-Maastricht decisions which establish that, where a Union institution considers that it has adopted an “incorrect position”, it will be permitted to resile from that position within a reasonable period, but only where that would not frustrate the legitimate expectations of the individual concerned (or those of third parties) who had been led to rely on the lawfulness of their conduct. Where a Union institution “finds that a measure which it has just adopted is tainted by illegality” it will have a right to withdraw that only “within a reasonable period”. Even then “that right may be restricted by the need to fulfil the legitimate expectations of a beneficiary of the measure, who has been led to rely on the lawfulness thereof”: Case C-365/89 Cargill v Produktschap voor Margarine, Vetten en Oliën paragraph 18, citing Case 14/81 Alpha Steel v Commission.

All very well in circumstances where the contested act concerned steel quotas (Alpha Steel) or agricultural subsidies to a legal person (Cargill). But how does the principle apply where the Union contends that it was previously mistaken as to a matter as serious as whether the Applicant was a supporter or beneficiary of the Syrian regime who is to be treated as, in effect, persona non grata? Does one apply the same approach? Does one give the Council a greater freedom to correct what it contends are errors? Does one weigh the interests of the affected individual differently?

Returning to the Nizar Assaad case, the Council (for its part) denied that there was any retrospectivity at all. The Council’s argument was that because economic sanctions operated only prospectively, there could be no question of retrospectivity. In their telling, it was only if the contested measure could be said to have retrospective economicconsequences that the principle would bite. One can see the logic of the Council’s position, having regard to the circumstances of the (pre-Maastricht) cases which established this principle.

The Court’s reasons, however, evince a sensitivity to the quite different context of the case before them, and in particular what one might call the human context of the contested measure. This is evident in the terms in which the Court rejected the Council’s restrictive approach, concluding that while it was “true that, in principle, the funds of a person or entity may be frozen only for the future”, this was not a principled answer to the Applicant’s claim. Accordingly the Court went on (at para 198) to hold that “confining the effects of the 2021 measures solely to the freezing of the applicant’s funds and economic resources, or to restrictions on admission to the territory of the Member States, wrongly disregards the effects which the adoption of those measures has had on the applicant’s overall legal situation and, in particular, on his reputation and integrity”. This was undoubtedly correct – as the Court went on to explain at para 200: “in establishing, by means of the 2021 measures, that the applicant’s name has been included on the lists at issue since the 2011 measures, the Council asserts that, since that date, the applicant has had links with the Syrian regime and has carried out the various acts which justified his name being entered on the lists at issue and retained since then. Such an assertion is sufficient to alter retroactively the applicant’s legal situation, quite beyond the freezing of his funds alone.”

The same sensitivity is evident in the Court’s treatment of the Council’s alternative submission, which was that any retrospectivity or frustration of the Applicant’s legitimate expectations could be justified by reference to the Council’s objectives. Again, the objectives relied upon (“consolidating and supporting human rights and international humanitarian law”) were of a nature far removed from the economic context in which the Court’s general principles were settled. The Court accepted that correction of errors in sanctioning measures could contribute to this aim, and that this was in the general interest (para 219). Nevertheless, the Court concluded that the Council “failed to have due regard for the applicant’s legitimate expectations by adopting restrictive measures with retroactive effect against him” (para 241). Here, again, the Court demonstrated an acute awareness of the human situation before it, reasoning (at para 246) that the Council’s error correction prerogative was “subject to limits, namely observance of the principle of the protection of legitimate expectations”, cautioning that “the compliance with which is all the more important” in the sanctions context “since the consequences for the legal situation of the persons and entities concerned by the restrictive measures are not insignificant”. The Court’s assessment, like the author’s above, might, perhaps be accused of understatement.

 

Conclusion

Standing back, the Court’s approach in the instant case is – it is suggested – an instance of the kind of self-reflection urged by Mendes. Faced with a situation far removed from that considered in the leading authorities, the Court stood back and interrogated what principles underlay those decisions, and how they ought to apply in the much changed context of the Union activity in issue in the particular case before it. To return to one of Mendes’ themes, such introspection (judicial and scholarly) is not only welcome, but also essential to the continued legitimacy of the EU legal order.



Source link

10Jul

Our Human Creativity Is Becoming More Uniform Due To ChatGPT | by Cobus Greyling | Jul, 2024


Our ideas, solutions and artistic expressions are becoming less original & diverse.

One of the primary use-cases for ChatGPT is to use it to become more creative, or to generate new and unique ideas.

This recent study considers how, instead of ChatGPT making us more creative, it leads to similar ideas across disparate users. It also leads us to approach and experience the creative process differently.

In a study with 36 participants, the researchers found that users of ChatGPT produced less semantically distinct ideas compared to alternative creativity support tools (CSTs).

Additionally, ChatGPT users generated more detailed ideas but felt less responsible for them.

The challenge is that a large number of people are using highly centralised, data-driven AI systems (such as ChatGPT) for our creative ideas and content. This leads to decreased diversity in the results of our creative processes, amongst other things.

Below on the right, is a representation of users making use of ChatGPT to produce much more homogenous ideas. At a group level users on the left are making use of more traditional creativity support tools with more diverse ideas.



Source link

09Jul

Doping: A Technique to Test Outlier Detectors | by W Brett Kennedy | Jul, 2024


Using well-crafted synthetic data to compare and evaluate outlier detectors

This article continues my series on outlier detection, following articles on Counts Outlier Detector and Frequent Patterns Outlier Factor, and provides another excerpt from my book Outlier Detection in Python.

In this article, we look at the issue of testing and evaluating outlier detectors, a notoriously difficult problem, and present one solution, sometimes referred to as doping. Using doping, real data rows are modified (usually) randomly, but in such a way as to ensure they are likely an outlier in some regard and, as such, should be detected by an outlier detector. We’re then able to evaluate detectors by assessing how well they are able to detect the doped records.

In this article, we look specifically at tabular data, but the same idea may be applied to other modalities as well, including text, image, audio, network data, and so on.

Likely, if you’re familiar with outlier detection, you’re also familiar, at least to some degree, with predictive models for regression and classification problems. With these types of problems, we have labelled data, and so it’s relatively simple to evaluate each option when tuning a model (selecting the best pre-processing, features, hyper-parameters, and so on); and it’s also relatively easy to estimate a model’s accuracy (how it will perform on unseen data): we simply use a train-validation-test split, or better, use cross validation. As the data is labelled, we can see directly how the model performs on a labelled test data.

But, with outlier detection, there is no labelled data and the problem is significantly more difficult; we have no objective way to determine if the records scored highest by the outlier detector are, in fact, the most statistically unusual within the dataset.

With clustering, as another example, we also have no labels for the data, but it is at least possible to measure the quality of the clustering: we can determine how internally consistent the clusters are and how different the clusters are from each other. Using some distance metric (such as Manhattan or Euclidean distances), we can measure how close records within a cluster are to each other and how far apart clusters are from each other.

So, given a set of possible clusterings, it’s possible to define a sensible metric (such as the Silhouette score) and determine which is the preferred clustering, at least with respect to that metric. That is, much like prediction problems, we can calculate a score for each clustering, and select the clustering that appears to work best.

With outlier detection, though, we have nothing analogous to this we can use. Any system that seeks to quantify how anomalous a record is, or that seeks to determine, given two records, which is the more anomalous of the two, is effectively an outlier detection algorithm in itself.

For example, we could use entropy as our outlier detection method, and can then examine the entropy of the full dataset as well as the entropy of the dataset after removing any records identified as strong outliers. This is, in a sense, valid; entropy is a useful measure of the presence of outliers. But we cannot assume entropy is the definitive definition of outliers in this dataset; one of the fundamental qualities of outlier detection is that there is no definitive definition of outliers.

In general, if we have any way to try to evaluate the outliers detected by an outlier detection system (or, as in the previous example, the dataset with and without the identified outliers), this is effectively an outlier detection system in itself, and it becomes circular to use this to evaluate the outliers found.

Consequently, it’s quite difficult to evaluate outlier detection systems and there’s effectively no good way to do so, at least using the real data that’s available.

We can, though, create synthetic test data (in such a way that we can assume the synthetically-created data are predominantly outliers). Given this, we can determine the extent to which outlier detectors tend to score the synthetic records more highly than the real records.

There are a number of ways to create synthetic data we cover in the book, but for this article, we focus on one method, doping.

Doping data records refers to taking existing data records and modifying them slightly, typically changing the values in just one, or a small number, of cells per record.

If the data being examined is, for example, a table related to the financial performance of a company comprised of franchise locations, we may have a row for each franchise, and our goal may be to identify the most anomalous of these. Let’s say we have features including:

  • Age of the franchise
  • Number of years with the current owner
  • Number of sales last year
  • Total dollar value of sales last year

As well as some number of other features.

A typical record may have values for these four features such as: 20 years old, 5 years with the current owner, 10,000 unique sales in the last year, for a total of $500,000 in sales in the last year.

We could create a doped version of this record by adjusting a value to a rare value, for example, setting the age of the franchise to 100 years. This can be done, and will provide a quick smoke test of the detectors being tested — likely any detector will be able to identify this as anomalous (assuming a value is 100 is rare), though we may be able to eliminate some detectors that are not able to detect this sort of modified record reliably.

We would not necessarily remove from consideration the type of outlier detector (e.g. kNN, Entropy, or Isolation Forest) itself, but the combination of type of outlier detector, pre-processing, hyperparameters, and other properties of the detector. We may find, for example, that kNN detectors with certain hyperparameters work well, while those with other hyperparameters do not (at least for the types of doped records we test with).

Usually, though, most testing will be done creating more subtle outliers. In this example, we could change the dollar value of total sales from 500,000 to 100,000, which may still be a typical value, but the combination of 10,000 unique sales with $100,000 in total sales is likely unusual for this dataset. That is, much of the time with doping, we are creating records that have unusual combinations of values, though unusual single values are sometimes created as well.

When changing a value in a record, it’s not known specifically how the row will become an outlier (assuming it does), but we can assume most tables have associations between the features. Changing the dollar value to 100,000 in this example, may (as well as creating an unusual combination of number of sales and dollar value of sales) quite likely create an unusual combination given the age of the franchise or the number of years with the current owner.

With some tables, however, there are no associations between the features, or there are only few and weak associations. This is rare, but can occur. With this type of data, there is no concept of unusual combinations of values, only unusual single values. Although rare, this is actually a simpler case to work with: it’s easier to detect outliers (we simply check for single unusual values), and it’s easier to evaluate the detectors (we simply check how well we are able to detect unusual single values). For the remainder of this article, though, we will assume there are some associations between the features and that most anomalies would be unusual combinations of values.

Most outlier detectors (with a small number of exceptions) have separate training and prediction steps. In this way, most are similar to predictive models. During the training step, the training data is assessed and the normal patterns within the data (for example, the normal distances between records, the frequent item sets, the clusters, the linear relationships between features, etc.) are identified. Then, during the prediction step, a test set of data (which may be the same data used for training, or may be separate data) is compared against the patterns found during training, and each row is assigned an outlier score (or, in some cases, a binary label).

Given this, there are two main ways we can work with doped data:

  1. Including doped records in the training data

We may include some small number of doped records in the training data and then use this data for testing as well. This tests our ability to detect outliers in the currently-available data. This is a common task in outlier detection: given a set of data, we often wish to find the outliers in this dataset (though may wish to find outliers in subsequent data as well — records that are anomalous relative to the norms for this training data).

Doing this, we can test with only a small number of doped records, as we do not wish to significantly affect the overall distributions of the data. We then check if we are able to identify these as outliers. One key test is to include both the original and the doped version of the doped records in the training data in order to determine if the detectors score the doped versions significantly higher than the original versions of the same records.

We also, though, wish do check that the doped records are generally scored among the highest (with the understanding that some original, unmodified records may legitimately be more anomalous than the doped records, and that some doped records may not be anomalous).

Given that we can test only with a small number of doped records, this process may be repeated many times.

The doped data is used, however, only for evaluating the detectors in this way. When creating the final model(s) for production, we will train on only the original (real) data.

If we are able to reliably detect the doped records in the data, we can be reasonably confident that we are able to identify other outliers within the same data, at least outliers along the lines of the doped records (but not necessarily outliers that are substantially more subtle — hence we wish to include tests with reasonably subtle doped records).

2. Including doped records only in the testing data

It is also possible to train using only the real data (which we can assume is largely non-outliers) and then test with both the real and the doped data. This allows us to train on relatively clean data (some records in the real data will be outliers, but the majority will be typical, and there is no contamination due to doped records).

It also allows us to test with the actual outlier detector(s) that may, potentially, be put in production (depending how well they perform with the doped data — both compared to the other detectors we test, and compared to our sense of how well a detector should perform at minimum).

This tests our ability to detect outliers in future data. This is another common scenario with outlier detection: where we have one dataset that can be assumed to be reasonable clean (either free of outliers, or containing only a small, typical set of outliers, and without any extreme outliers) and we wish to compare future data to this.

Training with real data only and testing with both real and doped, we may test with any volume of doped data we wish, as the doped data is used only for testing and not for training. This allows us to create a large, and consequently, more reliable test dataset.

There are a number of ways to create doped data, including several covered in Outlier Detection in Python, each with its own strengths and weaknesses. For simplicity, in this article we cover just one option, where the data is modified in a fairly random manner: where the cell(s) modified are selected randomly, and the new values that replace the original values are created randomly.

Doing this, it is possible for some doped records to not be truly anomalous, but in most cases, assigning random values will upset one or more associations between the features. We can assume the doped records are largely anomalous, though, depending how they are created, possibly only slightly so.

Here we go through an example, taking a real dataset, modifying it, and testing to see how well the modifications are detected.

In this example, we use a dataset available on OpenML called abalone (https://www.openml.org/search?type=data&sort=runs&id=42726&status=active, available under public license).

Although other preprocessing may be done, for this example, we one-hot encode the categorical features and use RobustScaler to scale the numeric features.

We test with three outlier detectors, Isolation Forest, LOF, and ECOD, all available in the popular PyOD library (which must be pip installed to execute).

We also use an Isolation Forest to clean the data (remove any strong outliers) before any training or testing. This step is not necessary, but is often useful with outlier detection.

This is an example of the second of the two approaches described above, where we train on the original data and test with both the original and doped data.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import RobustScaler
import matplotlib.pyplot as plt
import seaborn as sns
from pyod.models.iforest import IForest
from pyod.models.lof import LOF
from pyod.models.ecod import ECOD

# Collect the data
data = fetch_openml('abalone', version=1)
df = pd.DataFrame(data.data, columns=data.feature_names)
df = pd.get_dummies(df)
df = pd.DataFrame(RobustScaler().fit_transform(df), columns=df.columns)

# Use an Isolation Forest to clean the data
clf = IForest()
clf.fit(df)
if_scores = clf.decision_scores_
top_if_scores = np.argsort(if_scores)[::-1][:10]
clean_df = df.loc[[x for x in df.index if x not in top_if_scores]].copy()

# Create a set of doped records
doped_df = df.copy()
for i in doped_df.index:
col_name = np.random.choice(df.columns)
med_val = clean_df[col_name].median()
if doped_df.loc[i, col_name] > med_val:
doped_df.loc[i, col_name] = \
clean_df[col_name].quantile(np.random.random()/2)
else:
doped_df.loc[i, col_name] = \
clean_df[col_name].quantile(0.5 + np.random.random()/2)

# Define a method to test a specified detector.
def test_detector(clf, title, df, clean_df, doped_df, ax):
clf.fit(clean_df)
df = df.copy()
doped_df = doped_df.copy()
df['Scores'] = clf.decision_function(df)
df['Source'] = 'Real'
doped_df['Scores'] = clf.decision_function(doped_df)
doped_df['Source'] = 'Doped'
test_df = pd.concat([df, doped_df])
sns.boxplot(data=test_df, orient='h', x='Scores', y='Source', ax=ax)
ax.set_title(title)

# Plot each detector in terms of how well they score doped records
# higher than the original records
fig, ax = plt.subplots(nrows=1, ncols=3, sharey=True, figsize=(10, 3))
test_detector(IForest(), "IForest", df, clean_df, doped_df, ax[0])
test_detector(LOF(), "LOF", df, clean_df, doped_df, ax[1])
test_detector(ECOD(), "ECOD", df, clean_df, doped_df, ax[2])
plt.tight_layout()
plt.show()

Here, to create the doped records, we copy the full set of original records, so will have an equal number of doped as original records. For each doped record, we select one feature randomly to modify. If the original value is below the median, we create a random value above the median; if the original is below the median, we create a random value above.

In this example, we see that IF does score the doped records higher, but not significantly so. LOF does an excellent job distinguishing the doped records, at least for this form of doping. ECOD is a detector that detects only unusually small or unusually large single values and does not test for unusual combinations. As the doping used in this example does not create extreme values, only unusual combinations, ECOD is unable to distinguish the doped from the original records.

This example uses boxplots to compare the detectors, but normally we would use an objective score, very often the AUROC (Area Under a Receiver Operator Curve) score to evaluate each detector. We would also typically test many combinations of model type, pre-processing, and parameters.

The above method will tend to create doped records that violate the normal associations between features, but other doping techniques may be used to make this more likely. For example, considering first categorical columns, we may select a new value such that both:

  1. The new value is different from the original value
  2. The new value is different from the value that would be predicted from the other values in the row. To achieve this, we can create a predictive model that predicts the current value of this column, for example a Random Forest Classifier.

With numeric data, we can achieve the equivalent by dividing each numeric feature into four quartiles (or some number of quantiles, but at least three). For each new value in a numeric feature, we then select a value such that both:

  1. The new value is in a different quartile than the original
  2. The new value is in a different quartile than what would be predicted given the other values in the row.

For example, if the original value is in Q1 and the predicted value is in Q2, then we can select a value randomly in either Q3 or Q4. The new value will, then, most likely go against the normal relationships among the features.

There is no definitive way to say how anomalous a record is once doped. However, we can assume that on average the more features modified, and the more they are modified, the more anomalous the doped records will be. We can take advantage of this to create not a single test suite, but multiple test suites, which allows us to evaluate the outlier detectors much more accurately.

For example, we can create a set of doped records that are very obvious (multiple features are modified in each record, each to a value significantly different from the original value), a set of doped records that are very subtle (only a single feature is modified, not significantly from the original value), and many levels of difficulty in between. This can help differentiate the detectors well.

So, we can create a suite of test sets, where each test set has a (roughly estimated) level of difficulty based on the number of features modified and the degree they’re modified. We can also have different sets that modify different features, given that outliers in some features may be more relevant, or may be easier or more difficult to detect.

It is, though, important that any doping performed represents the type of outliers that would be of interest if they did appear in real data. Ideally, the set of doped records also covers well the range of what you would be interested in detecting.

If these conditions are met, and multiple test sets are created, this is very powerful for selecting the best-performing detectors and estimating their performance on future data. We cannot predict how many outliers will be detected or what levels of false positives and false negatives you will see — these depend greatly on the data you will encounter, which in an outlier detection context is very difficult to predict. But, we can have a decent sense of the types of outliers you are likely to detect and to not.

Possibly more importantly, we are also well situated to create an effective ensemble of outlier detectors. In outlier detection, ensembles are typically necessary for most projects. Given that some detectors will catch some types of outliers and miss others, while other detectors will catch and miss other types, we can usually only reliably catch the range of outliers we’re interested in using multiple detectors.

Creating ensembles is a large and involved area in itself, and is different than ensembling with predictive models. But, for this article, we can indicate that having an understanding of what types of outliers each detector is able to detect gives us a sense of which detectors are redundant and which can detect outliers most others are not able to.

It is difficult to assess how well any given outlier detects outliers in the current data, and even harder to asses how well it may do on future (unseen) data. It is also very difficult, given two or more outlier detectors, to assess which would do better, again on both the current and on future data.

There are, though, a number of ways we can estimate these using synthetic data. In this article, we went over, at least quickly (skipping a lot of the nuances, but covering the main ideas), one approach based on doping real records and evaluating how well we’re able to score these more highly than the original data. Although not perfect, these methods can be invaluable and there is very often no other practical alternative with outlier detection.

All images are from the author.



Source link

08Jul

Evaluating The Quality Of RAG & Long-Context LLM Output | by Cobus Greyling | Jul, 2024


Salesforce propose to leverage the task of summarisation as a testbed for evaluating long-context models and RAG systems.

Summarisation requires reasoning over a long context and a careful understanding of the relative importance of content.

The Problem Identified:

Prior work on summarisation evaluation, particularly in evaluating the relevance of summaries, has focused on single-document summarisation or tasks in which the input content is on the order of 1,000–2,000 tokens.

Longer conversational and multi-document news summarisation is still often limited to around 10k tokens.

A major problem in summarisation evaluation is the reliance on low-quality reference summaries and automatic metrics that poorly correlate with human judgments.

Traditional evaluations compare candidate summaries to gold-standard references, assuming higher overlap indicates better quality. This approach is unreliable, especially for long-context settings where high-quality references are expensive to obtain. Even the best automatic metrics for content coverage often fail to correlate well with human judgments.

To address these issues, Salesforce use synthetic data generation.

Considering the image below, the approach from Salesforce involves creating a large corpus of documents (“Haystack”) on a given topic, ensuring certain signals repeat across documents.

By controlling which insights appear in which documents, Salesforce can automatically determine the relevant insights for a search query. The SummHay task requires systems to summarise these insights and cite their sources. Summaries are evaluated based on coverage of expected insights and accuracy in citing source documents.



Source link

07Jul

Understanding and Implementing Medprompt | by Anand Subramanian | Jul, 2024


We now perform choice shuffling ensembling by shuffling the order of answer choices for each test question, creating multiple variants of the same question. The LLM is then prompted with these variants, along with the corresponding few-shot exemplars, to generate reasoning steps and an answer for each variant. Finally, we perform a majority vote over the predictions from all variants and select the final prediction.

The code related to this implementation can be found at this github repo link.

We use the MedQA [6] dataset for implementing and evaluating Medprompt. We first define helper functions for parsing the jsonl files.

def write_jsonl_file(file_path, dict_list):
"""
Write a list of dictionaries to a JSON Lines file.

Args:
- file_path (str): The path to the file where the data will be written.
- dict_list (list): A list of dictionaries to write to the file.
"""
with open(file_path, 'w') as file:
for dictionary in dict_list:
json_line = json.dumps(dictionary)
file.write(json_line + '\n')

def read_jsonl_file(file_path):
"""
Parses a JSONL (JSON Lines) file and returns a list of dictionaries.

Args:
file_path (str): The path to the JSONL file to be read.

Returns:
list of dict: A list where each element is a dictionary representing
a JSON object from the file.
"""
jsonl_lines = []
with open(file_path, 'r', encoding="utf-8") as file:
for line in file:
json_object = json.loads(line)
jsonl_lines.append(json_object)

return jsonl_lines

Implementing Self-Generated CoT

For our implementation, we utilize the training set from MedQA. We implement a zero-shot CoT prompt and process all the training questions. We use GPT-4o in our implementation. For each question, we generate the CoT and the corresponding answer. We define a prompt which is based on the template provided in the Medprompt paper.

system_prompt = """You are an expert medical professional. You are provided with a medical question with multiple answer choices.
Your goal is to think through the question carefully and explain your reasoning step by step before selecting the final answer.
Respond only with the reasoning steps and answer as specified below.
Below is the format for each question and answer:

Input:
## Question: {{question}}
{{answer_choices}}

Output:
## Answer
(model generated chain of thought explanation)
Therefore, the answer is [final model answer (e.g. A,B,C,D)]"""

def build_few_shot_prompt(system_prompt, question, examples, include_cot=True):
"""
Builds the zero-shot prompt.

Args:
system_prompt (str): Task Instruction for the LLM
content (dict): The content for which to create a query, formatted as
required by `create_query`.

Returns:
list of dict: A list of messages, including a system message defining
the task and a user message with the input question.
"""
messages = [{"role": "system", "content": system_prompt}]

for elem in examples:
messages.append({"role": "user", "content": create_query(elem)})
if include_cot:
messages.append({"role": "assistant", "content": format_answer(elem["cot"], elem["answer_idx"])})
else:
answer_string = f"""## Answer\nTherefore, the answer is {elem["answer_idx"]}"""
messages.append({"role": "assistant", "content": answer_string})

messages.append({"role": "user", "content": create_query(question)})
return messages

def get_response(messages, model_name, temperature = 0.0, max_tokens = 10):
"""
Obtains the responses/answers of the model through the chat-completions API.

Args:
messages (list of dict): The built messages provided to the API.
model_name (str): Name of the model to access through the API
temperature (float): A value between 0 and 1 that controls the randomness of the output.
A temperature value of 0 ideally makes the model pick the most likely token, making the outputs deterministic.
max_tokens (int): Maximum number of tokens that the model should generate

Returns:
str: The response message content from the model.
"""
response = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=temperature,
max_tokens=max_tokens
)
return response.choices[0].message.content

We also define helper functions for parsing the reasoning and the final answer option from the LLM response.

def matches_ans_option(s):
"""
Checks if the string starts with the specific pattern 'Therefore, the answer is [A-Z]'.

Args:
s (str): The string to be checked.

Returns:
bool: True if the string matches the pattern, False otherwise.
"""
return bool(re.match(r'^Therefore, the answer is [A-Z]', s))

def extract_ans_option(s):
"""
Extracts the answer option (a single capital letter) from the start of the string.

Args:
s (str): The string containing the answer pattern.

Returns:
str or None: The captured answer option if the pattern is found, otherwise None.
"""
match = re.search(r'^Therefore, the answer is ([A-Z])', s)
if match:
return match.group(1) # Returns the captured alphabet
return None

def matches_answer_start(s):
"""
Checks if the string starts with the markdown header '## Answer'.

Args:
s (str): The string to be checked.

Returns:
bool: True if the string starts with '## Answer', False otherwise.
"""
return s.startswith("## Answer")

def validate_response(s):
"""
Validates a multi-line string response that it starts with '## Answer' and ends with the answer pattern.

Args:
s (str): The multi-line string response to be validated.

Returns:
bool: True if the response is valid, False otherwise.
"""
file_content = s.split("\n")

return matches_ans_option(file_content[-1]) and matches_answer_start(s)

def parse_answer(response):
"""
Parses a response that starts with '## Answer', extracting the reasoning and the answer choice.

Args:
response (str): The multi-line string response containing the answer and reasoning.

Returns:
tuple: A tuple containing the extracted CoT reasoning and the answer choice.
"""
split_response = response.split("\n")
assert split_response[0] == "## Answer"
cot_reasoning = "\n".join(split_response[1:-1]).strip()
ans_choice = extract_ans_option(split_response[-1])
return cot_reasoning, ans_choice

We now process the questions in the training set of MedQA. We obtain CoT responses and answers for all questions and store them to a folder.

train_data = read_jsonl_file("data/phrases_no_exclude_train.jsonl")

cot_responses = []
# os.mkdir("cot_responses")
existing_files = os.listdir("cot_responses/")

for idx, item in enumerate(tqdm(train_data)):
if str(idx) + ".txt" in existing_files:
continue

prompt = build_zero_shot_prompt(system_prompt, item)
try:
response = get_response(prompt, model_name="gpt-4o", max_tokens=500)
cot_responses.append(response)
with open(os.path.join("cot_responses", str(idx) + ".txt"), "w", encoding="utf-8") as f:
f.write(response)
except Exception as e :
print(str(e))
cot_responses.append("")

We now iterate across all the generated responses to check if they are valid and adhere to the prediction format defined in the prompt. We discard responses that do not conform to the required format. After that, we check the predicted answers against the ground truth for each question and only retain questions for which the predicted answers match the ground truth.

questions_dict = []
ctr = 0
for idx, question in enumerate(tqdm(train_data)):
file = open(os.path.join("cot_responses/", str(idx) + ".txt"), encoding="utf-8").read()
if not validate_response(file):
continue

cot, pred_ans = parse_answer(file)

dict_elem = {}
dict_elem["idx"] = idx
dict_elem["question"] = question["question"]
dict_elem["answer"] = question["answer"]
dict_elem["options"] = question["options"]
dict_elem["cot"] = cot
dict_elem["pred_ans"] = pred_ans
questions_dict.append(dict_elem)

filtered_questions_dict = []
for item in tqdm(questions_dict):
pred_ans = item["options"][item["pred_ans"]]
if pred_ans == item["answer"]:
filtered_questions_dict.append(item)

Implementing the KNN model

Having processed the training set and obtained the CoT response for all these questions, we now embed all questions using the text-embedding-ada-002 from OpenAI.

def get_embedding(text, model="text-embedding-ada-002"):
return client.embeddings.create(input = [text], model=model).data[0].embedding

for item in tqdm(filtered_questions_dict):
item["embedding"] = get_embedding(item["question"])
inv_options_map = {v:k for k,v in item["options"].items()}
item["answer_idx"] = inv_options_map[item["answer"]]

We now train a KNN model using these question embeddings. This acts as a retriever at inference time, as it helps us to retrieve similar datapoints from the training set that are most similar to the question from the test set.

import numpy as np
from sklearn.neighbors import NearestNeighbors

embeddings = np.array([d["embedding"] for d in filtered_questions_dict])
indices = list(range(len(filtered_questions_dict)))

knn = NearestNeighbors(n_neighbors=5, algorithm='auto', metric='cosine').fit(embeddings)

Implementing the Dynamic Few-Shot and Choice Shuffling Ensemble Logic

We can now run inference. We subsample 500 questions from the MedQA test set for our evaluation. For each question, we retrieve the 5 most similar questions from the train set using the KNN module, along with their respective CoT reasoning steps and predicted answers. We construct a few-shot prompt using these examples.

For each question, we also shuffle the order of the options 5 times to create different variants. We then utilize the constructed few-shot prompt to get the predicted answer for each of the variants with shuffled options.

def shuffle_option_labels(answer_options):
"""
Shuffles the options of the question.

Parameters:
answer_options (dict): A dictionary with the options.

Returns:
dict: A new dictionary with the shuffled options.
"""
options = list(answer_options.values())
random.shuffle(options)
labels = [chr(i) for i in range(ord('A'), ord('A') + len(options))]
shuffled_options_dict = {label: option for label, option in zip(labels, options)}

return shuffled_options_dict

test_samples = read_jsonl_file("final_processed_test_set_responses_medprompt.jsonl")

for question in tqdm(test_samples, colour ="green"):
question_variants = []
prompt_variants = []
cot_responses = []
question_embedding = get_embedding(question["question"])
distances, top_k_indices = knn.kneighbors([question_embedding], n_neighbors=5)
top_k_dicts = [filtered_questions_dict[i] for i in top_k_indices[0]]
question["outputs"] = []

for idx in range(5):
question_copy = question.copy()
shuffled_options = shuffle_option_labels(question["options"])
inv_map = {v:k for k,v in shuffled_options.items()}

question_copy["options"] = shuffled_options
question_copy["answer_idx"] = inv_map[question_copy["answer"]]
question_variants.append(question_copy)
prompt = build_few_shot_prompt(system_prompt, question_copy, top_k_dicts)
prompt_variants.append(prompt)

for prompt in tqdm(prompt_variants):
response = get_response(prompt, model_name="gpt-4o", max_tokens=500)
cot_responses.append(response)

for question_sample, answer in zip(question_variants, cot_responses):
if validate_response(answer):
cot, pred_ans = parse_answer(answer)

else:
cot = ""
pred_ans = ""

question["outputs"].append({"question": question_sample["question"], "options": question_sample["options"], "cot": cot, "pred_ans": question_sample["options"].get(pred_ans, "")})

We now evaluate the results of Medprompt over the test set. For each question, we have five predictions generated through the ensemble logic. We take the mode, or most frequently occurring prediction, for each question as the final prediction and evaluate the performance. Two edge cases are possible here:

  1. Two different answer options are predicted two times each, with no clear winner.
  2. There is an error with the response generated, meaning that we don’t have a predicted answer option.

For both of these edge cases, we consider the question to be wrongly answered by the LLM.

def find_mode_string_list(string_list):
"""
Finds the most frequently occurring strings.

Parameters:
string_list (list of str): A list of strings.
Returns:
list of str or None: A list containing the most frequent string(s) from the input list.
Returns None if the input list is empty.
"""
if not string_list:
return None

string_counts = Counter(string_list)
max_freq = max(string_counts.values())
mode_strings = [string for string, count in string_counts.items() if count == max_freq]
return mode_strings

ctr = 0
for item in test_samples:
pred_ans = [x["pred_ans"] for x in item["outputs"]]
freq_ans = find_mode_string_list(pred_ans)

if len(freq_ans) > 1:
final_prediction = ""
else:
final_prediction = freq_ans[0]

if final_prediction == item["answer"]:
ctr +=1

print(ctr / len(test_samples))

We evaluate the performance of Medprompt with GPT-4o in terms of accuracy on the MedQA test subset. Additionally, we benchmark the performance of Zero-shot prompting, Random Few-Shot prompting, and Random Few-Shot with CoT prompting.

Results of our evaluation (Image by Author)

We observe that Medprompt and Random Few-Shot CoT prompting outperform the Zero and Few-Shot prompting baselines. However, surprisingly, we notice that Random Few-Shot CoT outperforms our Medprompt performance. This could be due to a couple of reasons:

  1. The original Medprompt paper benchmarked the performance of GPT-4. We observe that GPT-4o outperforms GPT-4T and GPT-4 on various text benchmarks significantly (https://openai.com/index/hello-gpt-4o/), indicating that Medprompt could have a lesser effect on a stronger model like GPT-4o.
  2. We restrict our evaluation to 500 questions subsampled from MedQA. The Medprompt paper evaluates other Medical MCQA datasets and the full version of MedQA. Evaluating GPT-4o on the complete versions of the datasets could give a better picture of the overall performance.

Medprompt is an interesting framework for creating sophisticated prompting pipelines, particularly for adapting a generalist LLM to a specific domain without the need for fine-tuning. It also highlights the considerations involved in deciding between prompting and fine-tuning for various use cases. Exploring how far prompting can be pushed to enhance LLM performance is important, as it offers a resource and cost-efficient alternative to fine-tuning.

[1] Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., … & Horvitz, E. (2023). Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452. (https://arxiv.org/abs/2311.16452)

[2] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., … & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837. (https://openreview.net/pdf?id=_VjQlMeSB_J)

[3] Gekhman, Z., Yona, G., Aharoni, R., Eyal, M., Feder, A., Reichart, R., & Herzig, J. (2024). Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?. arXiv preprint arXiv:2405.05904. (https://arxiv.org/abs/2405.05904)

[4] Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., … & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172–180. (https://www.nature.com/articles/s41586-023-06291-2)

[5] Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., … & Natarajan, V. (2023). Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617. (https://arxiv.org/abs/2305.09617)

[6] Jin, D., Pan, E., Oufattole, N., Weng, W. H., Fang, H., & Szolovits, P. (2021). What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14), 6421. (https://arxiv.org/abs/2009.13081) (Original dataset is released under a MIT License)



Source link

06Jul

LLM Disruption in Chatbot Development Frameworks | by Cobus Greyling | Jul, 2024


Large Language Models (LLMs) have introduced more human-like and contextually aware interactions, allowing developers to build sophisticated chatbots with minimal effort. This innovation reduces the need for extensive rule-based programming and enables rapid deployment across various applications. However, there are challenges…

The image above outlines the various elements and features constituting a Large Language Model (LLM).

The challenge lies in accessing each of these features at the appropriate time, ensuring stability, predictability, and, to a certain extent, reproducibility.

Many organisations and technology providers are navigating the transition from traditional chatbots to incorporating Large Language Models with varying levels of success.

✨ Traditional Chatbot IDEs

Traditional chatbots typically consist of four basic elements:

  1. Intent Detection (NLU)
  2. Entity Extraction (NLU)
  3. Response Messages (Message Abstraction Layer)
  4. Dialog-turn/Conversation State Management (Dialog Flow Control)

Recently, numerous attempts have been made to reimagine this structure, aiming to loosen the rigidity of hard-coded and fixed architectural components.

✨ Natural Language Understanding (NLU)

The NLU engine is the only “AI” component of the chatbot, responsible for detecting intents and entities from the input.

It includes a GUI for defining training data and managing the model.

The typical advantages of NLU engines are:

  • Numerous open-source models.
  • Small footprint and not resource-intensive, making local and edge installations feasible.
  • No-code UIs.
  • Extensive corpus of named entities due to long-term usage.
  • Predefined entities and training data for specific verticals, such as banking, help desks, HR, etc.
  • Rapid model training, with the ability to train models multiple times per day in a production environment.
  • Initial introduction of LLMs for generating NLU training data.
  • Use of LLMs to generate training data for NLU models based on existing conversations and sample data.

✨ Conversation Flow & Dialog Management

The dialog flow and logic are designed and built within a no-code to low-code GUI.

The flow and logic follow a predefined path with specific logic points.

The conversation progresses based on input data matching certain criteria at these logic gates.

Efforts have been made to introduce flexibility to the flow, aiming to add some semblance of intelligence.

✨ Message Abstraction Layer

The message abstraction layer holds predefined bot responses for each dialog turn. These responses are fixed, with templates sometimes used to insert data and create personalised messages.

Managing these messages becomes challenging as the chatbot application grows, and the static nature of the messages can lead to a significant total number.

Introducing multilingual chatbots adds considerable complexity. Whenever the tone or persona of the chatbot needs to change, all of these messages must be revisited and updated.

This is also one of the areas where LLMs were first introduced to leverage the power of Natural Language Generation (NLG).

✨ Out-Of-Domain

Out-Of-Domain questions are handled by knowledge bases & semantic similarity searches.

Knowledge bases were primarily used for QnA and the solutions made use of semantic search. In many regards this could be considered as an early version of RAG.

In conclusion, the integration of Large Language Models (LLMs) into chatbot development frameworks marks a significant leap forward in creating more human-like and contextually aware interactions.

By reducing the reliance on rigid, rule-based programming, LLMs enable developers to build sophisticated chatbots with greater ease and speed.

However, this transition is not without its challenges.

Accessing and effectively utilising the various features of LLMs while ensuring stability and predictability remains a critical concern.

Organisations and technology providers are actively navigating these complexities as they embrace LLMs, each with varying degrees of success.

As innovations in Natural Language Understanding (NLU) and Natural Language Generation (NLG) continue to evolve, the future promises even more seamless and intelligent chatbot interactions, reshaping how we interact with technology in diverse applications.

👉🏼 Follow me on LinkedIn for updates on Large Language Models

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn



Source link

05Jul

LLM Alignment: Reward-Based vs Reward-Free Methods | by Anish Dubey | Jul, 2024


Optimization methods for LLM alignment

Language models have demonstrated remarkable abilities in producing a wide range of compelling text based on prompts provided by users. However, defining what constitutes “good” text is challenging, as it often depends on personal preferences and the specific context. For instance, in storytelling, creativity is key; in crafting informative content, accuracy and reliability are crucial; and when generating code, ensuring it runs correctly is essential. Hence the “LLM alignment problem,” which refers to the challenge of ensuring that large language models (LLMs) act in ways that are consistent with human values, intentions, and preferences.

Designing a loss function that captures the diverse qualities we value in text — like creativity, accuracy, or executability — is highly complex and often impractical. Concepts like these are not differentiable and hence not back-propagated and cannot be trained upon with simple next token generation.

Imagine if we could harness human feedback to evaluate the quality of generated text or, even better, use that feedback as a guiding loss function to improve the model’s performance. This concept is at the heart of Reinforcement Learning from Human Feedback (RLHF). By applying reinforcement learning techniques, RLHF allows us to fine-tune language models based on direct human feedback, aligning the models more closely with nuanced human values and expectations. This approach has opened up new possibilities for training language models that are not only more responsive but also more aligned with the complexity of human preferences.

Below, we will aim to learn more about RLHF via reward-based and then about RLHF via reward-free methods.

Let’s go through Reinforcement learning through human feedback (RLHF). It consist of 3 main stages:

  1. Supervised fine tuning
  2. Reward modeling phase
  3. RL fine-tuning phase

Supervised fine tuning

RLHF is a pre-trained model which is fine tuned already on a high quality data set. Its objective is simple i.e. when given an input (prompt), it produces an output. The ultimate objective here is to further fine tune this model to produce output according to human preference. Hence, let’s call this a base model for reference. Currently, this model is a vanilla base model which is not aware of any human preference.

Reward Modelling Phase

Reward model innovation: This is where the new innovation begins on how reward models are incorporated into RLHF. The idea behind the reward model is that a new LLM model, which can be same as the above mentioned base model, will have the ability to generate human preference score. The reason it is similar to a large language model is because this model also needs to understand the language semantics before it can rate if an output is human preferred or not. Since the reward is scalar, we add a linear layer on top of LLM to generate a scalar score in terms of human preference.

Data collection phase: This is done from the supervised fine tuning stage where the base model is asked to generate 2 outputs for a given text. Example: For an input token x, two output tokens are generated, y1 and y2 by the base model. These outputs are shown to human raters to rate and human preference is recorded for each individual output.

Training phase: Once the data sample is collected from the data collection phase, the reward model is trained with the following prompt. “Given the following input: , LLM generated output. Can you rate the performance of the output?”. The model will output r(reward) and we already know the actual value of reward r1 from the data collection phase. Now, this can be back-propagated with the loss function and the model can be trained. Below is the objective loss function which the model optimises for through back-propagation:

Equation from this paper: https://arxiv.org/pdf/2305.18290

Notation:

  • rΦ(x, y): a reward model parameterized by Φ which estimates the reward. Parameterized means we don’t know the actual value and this needs to be optimized from the above equation. This is the reward LLM model itself. Mostly, the LLM parameters are frozen here and only few parameters are left to change. Most important layer is the linear layer added at the top. This does most of the learning to rate the score of output.
  • Ɗ: A dataset of triplets (x, yw, yl) where x: input, yw: the winner output and yl: the loser output
  • σ: the sigmoid function which maps the difference in reward to a probability (0–1)
  • ∑(x, y,w yl) ~Ɗ means x, yw, yl are all sampled from Ɗ

Example scenario: Imagine you’re training a reward model to evaluate responses. You have pairs of responses to a given prompt, and human feedback tells you which response is better. For context, x(“What is the capital of France?”), you have yw(“The capital of France is Paris.”) as winner and yl(“The capital of France is Berlin.” ) as loser. The reward model should eventually learn to give higher reward for “The capital of France is Paris.” output when compared to “The capital of France is Berlin.” output if “What is the capital of France?” input is given.

RL fine-tuning phase

Reinforcement learning idea: Now the base model and reward model are trained, the idea is how to leverage reward model score and update base model parameters to reflect human preference. Since the reward model outputs a scalar score and is not differentiable, we cannot use simple back-propogation to update the base model param. Hence, we need other techniques to update the base model. This is where reinforcement learning comes which helps the base model to change the params through reward model score. This is done through PPO (proximal policy optimization). Understanding the core architecture of PPO is not required to grasp this concept and hence we will not cover it here but on a high level, the idea is that PPO can use scalar score to update base model parameters. Now let’s understand how base and reward models are incorporated to make base models learn human preference.

RL fine-tuning idea: In reinforcement learning, we have action, space and rewards. The idea is to come up with a policy which any action agent can take in the space which maximizes the reward. This becomes quite complicated but in a simplified sense, π is the policy which is our base LLM model only. Πref means the base model and ΠӨ means a different LLM optimal model which we are trying to generate. We need to find ΠӨ (the base model’s neural network weights will be fine-tuned) which gives human-preferred output. It’s just that we don’t know ΠӨ and the idea is to find this optimal model.

RL training and feedback loop phase: An input x is given to 2 policy models, Πref (baseline model) and ΠӨ (optimal model which we are trying to generate). Initially both models are kept the same. Input x to two models individually will give two outputs correspondingly. The output from ΠӨ model is also fed to reward model (input: x, output: y; as discussed above) and asked to output the reward score which is rΦ(x, y). Now we have 3 things, output from the baseline model, output from the optimal model, and a reward score from the optimal model. There are 2 things we are optimizing here, one is to maximize the reward because eventually we want the model to be as close as human preference and another is to minimize the divergence from baseline model. Maximizing the reward is easy since it is already a scalar quantity but how do we minimize the divergence of baseline and optimal model. Here we use “Kullback–Leibler divergence” which estimates the difference between 2 continuous probability distributions. Let’s take a deeper look into the objective loss function

Equation from this paper: https://arxiv.org/pdf/2305.18290

Notation:

  • rΦ(x, y): a scalar value for an input x and output y (from optimal model). To be explicit, output from the optimal model is fed into the reward model.
  • Dkl (ΠӨ (y | x) || Πref (y | x)): This computes the Kullback–Leibler divergence between 2 probability distributions. Each token from each model is a probability distribution. KL estimates how far the distribution is from each other.
  • β : Hyperparameter which is used to determine how important it is to have optimal model close to baseline model.

Example scenario: Imagine you are asking (“What is the capital of France?”), Πref (baseline model) says: “The capital of France is Berlin.” and ΠӨ (optimal model) “There are 3 capitals, Paris, Versailles, and Lyon, but Paris is considered as the official capital”. Now rΦ(“x: What is the capital…”, “y: There are 3 capital..”) should give low score as it is less human-preferred and Kullback–Leibler divergence of (ΠӨ (y | x) || Πref (y | x)) should be high as well since the probability distribution space differs for both individual output. Hence the loss will be high from both terms. We do not want the model to only optimize for reward but also stay closer to the baseline model and hence both the terms are used to optimize the reward. In the next iteration with learning let’s say, ΠӨ (optimal model) says “The capital of France is Delhi”, in this case model learned to stay closer to Πref (baseline model) and output the format closer to baseline model but the reward component will still be lower. Hopefully, in the third iteration ΠӨ (optimal model) should be able to learn and output “The capital of France is Paris” with higher reward and model output aligning closely with baseline model.

The below diagram helps illustrate the logic. I will also highly recommend to go through RLHF link from hugging face.

Image by author, inspired by https://huggingface.co/blog/rlhf

With RLHF using a reward-based method in mind, let’s move to the reward-free method. According to the paper: “our key insight is to leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies. This change-of-variables approach avoids fitting an explicit, standalone reward model, while still optimizing under existing models of human preferences”. Very complicated to understand, but let’s try to break this down in simple phases in the next section.

Reward-free method’s key idea: In RLHF, a separate new reward model is trained which is expensive and costly to maintain. Is there any mechanism to avoid training a new reward model and use the existing base model to achieve a new optimal model? This is exactly what reward-free method does i.e. it avoids training a new reward model and in turn changes the equation in such a way that there is no reward model term in the loss function of DPO (Direct preference optimization). One way to think about this is that we need to reach optimal model policy(ΠӨ) from base model (Πref). It can be reached either through optimizing the reward function space which helps build a proxy to reach optimal model policy or directly learning a mapping function from reward to policy and in turn optimize for policy itself. This is exactly what the authors have tried by removing the reward function component in loss function and substitute it directly by model policy parameter. This is what the author meant when they say “leverage an analytical mapping from reward function to optimal policies …. into a loss function over policies”. This is the core innovation of the paper.

DPO training and feedback loop phase: Using Πref (baseline model), input x is given and asked to produce 2 outputs (y1 and y2). All x, y1 and y2 are used by human raters to decide winning yw and losing yl. Offline data set is collected with triplet information . With this information, we know what the winning (human preferred) and losing (human not preferred) answers are. Now, the same input x is given to 2 policy (models) Πref (baseline model) and ΠӨ (optimal model). Initially both models are kept the same for training purposes. Input x to two models individually will give two outputs correspondingly. We compute how far the output is from winning and losing answers from both reference and optimal model through “Kullback–Leibler divergence”. Let’s take a deeper look into the objective loss function

Equation

Equation from https://arxiv.org/pdf/2305.18290
  • ΠӨ (yw | x) -> Given x(input), how far is the corresponding output of the model say youtput from the winning output yw. Output youtput and yw are probability distributions and differences among both will be computed through “Kullback–Leibler divergence”. This will be a scalar value. Also this is computed for both models with different combinations of Πref (yw | x), Πref (yl | x), ΠӨ (yw | x) and ΠӨ (yl | x).
  • β : Hyperparameter which is used to determine how important it is to have optimal model close to baseline model.
Image by author, inspired by https://huggingface.co/blog/rlhf
  • Naturally, the question comes down to which one is better, RLHF through reward-based method using PPO or reward-free method using DPO. There is no right answer to this question. A recent paper compares “Is DPO superior to PPO for LLM alignment” (paper link) and concludes that PPO is generally better than DPO and that DPO suffers more heavily from out-of-distribution data. “Out-of-distribution” data means the human preference data is different from the baseline trained data. This can happen if base model training is done on some dataset while preference output is done for some other dataset.
  • Overall, the research is still out on which one is better while we have seen companies like OpenAI, Anthropic, Meta leverage both RLHF via PPO and DPO as a tool for LLM alignment.



Source link

05Jul

meaningful ban or paper tiger? – European Law Blog


Blogpost 34/2024

After years of anticipation, the final text of the Artificial Intelligence Act (‘the Act’) was approved by the Council on May 21st of this year. The landmark regulation, first of its kind, positions the EU at the forefront of the global effort to establish a comprehensive legal framework on artificial intelligence. The Act aims to safeguard fundamental rights and promoting the development of safe and trustworthy AI by adopting a risk-based approach, mandating stricter scrutiny for higher-risk applications. At the highest level of risk, the Act contains a list of “prohibited uses” of artificial intelligence (Article 5) due to their potentially detrimental consequences for fundamental rights and Union values, including human dignity, freedom, and equality (see Recital 28). While the Act prohibits the use of specific instances of AI predictive policing, we should seriously consider whether the ban will have meaningful effects in practice, or may become a mere instrument of symbolic politics. Leaning towards the latter, this blog cautiously implies that this concern reflects broader questions about the Act’s commitment to developing “human-centric” AI and whether it effectively encompasses all individuals within its protective scope.

Predictive policing is not defined in the Act, but a leading definition provided by Perry et. al, is ‘the use of analytical techniques to identify promising targets’ to forecast criminal activity. As highlighted by Litska Strikwerda (Dutch only), this may involve identifying potential crime locations (predictive mapping), as well as assessing the likelihood that an individual will either become a victim of a crime or commit a crime (predictive identification). While predictive identification has significant potential as a crime prevention tool, it has faced substantial criticism, particularly concerning potential human rights implications. For example, the extensive data collection and processing involved in predictive identification raise serious concerns about data protection and privacy, including the correct legal basis for such data processing and the potential intrusion into individuals’ private lives. Additionally, the discriminatory nature of algorithms can exacerbate existing structural injustices and biases within the criminal justice system. Another issue is the presumption of innocence, given that predictive identification approaches criminality from an almost entirely opposite perspective, labelling individuals as potential criminals before they have engaged in any criminal conduct. Recital 42 of the Act cites this concern in justifying the prohibition on AI based predictive identification.

Initially classified as a high-risk application of artificial intelligence under the Commission’s proposal, predictive identification is now designated as a prohibited use of artificial intelligence under Article 5(1)(d) of the Act. This post seeks to demonstrate the potential limitations of the ban’s effectiveness through a critical analysis of this provision. After providing a brief background on the ban, including the substantive lobbying by various human rights organisations after earlier versions of the Act failed to include predictive identification as a prohibited use, the provision and its implications will be analysed in depth. First, this post points out the potential for a “human in the loop” workaround due to the prohibition’s reference to “profiling”. Secondly, it will discuss how the Act’s general exemption clause for national security purposes contributes to a further weakening of the ban’s effectiveness.

 

The Ban in the Act

The practice of predictive identification has been under scrutiny for years before the final adoption of the AI Act. For example, following the experiments of “living labs” in the Netherlands, Amnesty International published an extensive report on the human rights consequences of predictive policing. The report highlights one experiment in particular, namely the “Sensing Project”, which involved collecting data about bypassing cars (such as license plate numbers and brands) to predict the occurrence of petty crimes such as pickpocketing and shoplifting. The idea was that certain indicators, such as the type of car, could help identify potential suspects. However, the system disproportionately targeted cars with Eastern European number plates, assigning them a higher risk-score. This bias highlights the potentially discriminatory effects of predictive identification. Earlier that same year (2020), a Dutch lower court ruled that the fraud detection tool SyRI violated the right to private life under the ECHR, as it failed to fulfil the “necessary in a democratic society”-condition under Article 8(2) ECHR. This tool, which used “foreign names” and “dual nationality” as possible risk-indicators, was a key element in the notorious child benefits scandal in the Netherlands.

Despite widespread concerns, a ban on predictive policing was not included in the Commission’s initial proposal of the Act. Shortly after the publication of the proposal, several human rights organizations, including Fair Trials, started intensive lobbying for a ban on predictive identification to be included in the Act. Subsequently, the IMCO-LIBE report recommended prohibiting predictive identification under Article 5 of the Act, citing its potential to violate the presumption of innocence, human dignity, and its discriminatory potential. Lobbying efforts continued vigorously throughout the negotiations (see this signed statement of 100+ human rights organizations).

Eventually, the clause was incorporated in the Parliament’s resolution and is now part of the final version of the Act, reading as follows:

[ The following AI practices shall be prohibited: ] the placing on the market, the putting into service for this specific purpose, or the use of an AI system(s) for making risk assessments of natural persons in order to assess or predict the likelihood of a natural person committing a criminal offence, based solely on the profiling of a natural person or on assessing their personality traits and characteristics. [ … ] This prohibition shall not apply to AI systems used to support the human assessment of the involvement of a person in a criminal activity, which is already based on objective and verifiable facts directly linked to a criminal activity. (Article 5(1)(d)).

 

The ”Human in the Loop” Problem

The prohibition applies to instances of predictive identification based solely on profiling, or on the assessment of a natural person’s personality traits and/or characteristics. The specifics of these terms are unclear. For the definition of “profiling”, the Act (Article 3(52)) refers to the definition given in the GDPR, which defines it as any automated processing of personal data to evaluate personal aspects relating to a natural person (Article 4(4) GDPR).

The first question that arises here relates to the difference between profiling and the assessment of personality traits and characteristics. Inger Marie Sunde has highlighted this ambiguity, noting that profiling inherently involves evaluating personal characteristics. A difference between “profiling” and “assessing” may lie in the degree of human involvement. While profiling implies an (almost) entirely automated process with no meaningful human intervention, there is no clear indication on the level of human involvement required for “assessing”.

A deeper concern lies in the question as to what should be understood by “automated processing”. The test for a decision to qualify as solely-automated, including profiling, is that there has been no meaningful human interventionin the decision-making process. However, the exact meaning of “meaningful” here has not been spelled out. For example, the CJEU in the SCHUFA Holding case confirmed automated credit scoring to be a solely automated decision (in the context of Article 22 GDPR), but did not elaborate on the details. While it is clear that the human role should be active and real, not symbolic and marginal (e.g. pressing a button), a large grey area remains (for more, see also here). In the context of predictive identification, this creates uncertainty as to the extent of the human involvement required, opening the door for a potential “human in the loop”- defense. Law enforcement authorities could potentially circumvent the ban on predictive identification by demonstrating “meaningful” human involvement in the decision-making process. This problem is further aggravated by the lack of a clear threshold for the definition of “meaningful” in this context.

The second paragraph of the prohibition on predictive identification in the Act states that the prohibition does not apply to AI systems supporting human assessment of criminal involvement, provided this is based on “objective and verifiable facts directly linked to a criminal activity”. This could be understood as an instance of predictive identification where the human involvement is sufficiently “meaningful”. Nevertheless, there is room for improvement in terms of clarity. Additionally, this conception of predictive identification does not reflect its default operational mode – where AI generates predictions first, followed by human review or verification – but rather the opposite scenario.

In the event that an instance of predictive identification does not fit the definition of a prohibited use, this does not result in the entire practice being effectively free from restrictions. Other instances of predictive identification, not involving profiling or the assessment of an individual’s personality traits, may be classified as “high-risk” applications under the Act (See Article 6 in conjunction with Annex III 6(d)). This distinction between prohibited and high-risk practices may hinge on whether the AI system operates solely automatically, or includes meaningful human input. If the threshold for meaningful human intervention is not clearly defined, there is a risk that predictive identification systems with a degree of human involvement just beyond being “marginal and symbolic” might be classified as high-risk rather than prohibited. This is significant, as high-risk systems are simply subject to certain strict safety and transparency rules, rather than being outright prohibited.

In this regard, another issue that should be considered is the requirement of human-oversight. According to Article 14 of the Act, high-risk applications of AI should be subject to “human-oversight” to guarantee their safe use, ensuring that such systems are used responsibly and ethically. However, as is the case with the requirement of “meaningful human intervention”, the exact meaning of “human oversight” is also unclear (as explained thoroughly in an article by Johann Laux). As a consequence, even in instances where predictive identification does not classify as a prohibited use under Article 5(1)(d) of the Act, but is considered high-risk instead, uncertainty about the degree of human involvement required remains.

Finally, it should be noted that even if the AI would only have a complementary task compared to the human, another problem exists. It pertains to the potential biases of the actual “human in the loop”. Recent studies suggest humans are more likely to agree with AI outcomes that align with their personal predispositions. This is a problem distinct from the inherent biases present in predictive identification systems (as demonstrated by, for example, the aforementioned cases of the “Sensing Project” and the Dutch childcare benefits scandal). Indeed, even the human in the loop “safeguard” may not offer requisite counter-balance to the use of predictive identification systems.

 

General clause on national security purposes

Further, the Act includes a general exemption for AI systems used for national security purposes. As national security is beyond the EU’s competences (Article 4(2) TEU), the Act does not apply to potential uses of AI in the context of the national security of the Member States (Article 2 of the Act). It is uncertain to what extent this exception may influence the ban on predictive identification. National security purposes are not uniformly understood, although established case law has confirmed several instances, such as espionage and (incitement to- and approval of) terrorism to be included within its meaning (see this report by the FRA). Yet, given the degree of discretion granted to the Member States in this area, it is uncertain which instances of predictive identification might be excluded from the Act’s application.

Several NGOs focusing on human rights (particularly in the digital realm) have raised concerns about this potential loophole, arguing that the exemption under the Act is broader than permitted under European law. Article 19, an advocacy group for freedom of speech and information, has argued that such a broad exemption contradicts European law, stating that ‘the adopted text makes the national security a largely digital rights-free zone’. Similar concerns have been raised by Access Now. The fear is that Member States might invoke the national security exemption to justify the use of predictive identification techniques under the guise of safeguarding national security. This could undermine the effectiveness of the ban in practice, allowing for the continued use of such technologies despite their potential to infringe upon fundamental rights. For example, the use of predictive policing in counter-terrorism efforts could disproportionately target minority communities and individuals from non-Western backgrounds. Combined with the existing concerns about biases and the potential for discriminatory outcomes in the context of predictive identification, this is a serious ground for concern.

Rather than a blanket exemption, national security considerations should be addressed on a case-by-case basis. This approach finds support in the case law of the ECJ, including its ruling in La Quadrature du Net, where it reiterated that the exemption is not by definition synonymous with the absolute non-applicability of European law.

 

Conclusion

While at first sight the ban on predictive identification appears like a significant win for fundamental rights, its effectiveness is notably weakened by the potential for a “human in the loop”-defence and the national security exemption. The human in the loop-defence may allow law enforcement authorities to engage in predictive identification if they assert human involvement, and the lack of a clear definition for “meaningful human intervention” limits the provision’s impact. Additionally, the exemption for AI systems offering mere assistance to human decision-making still allows for human biases to influence outcomes, and the lack of clarity regarding the standards for “human oversight” for high-risk applications are not promising either. The national security exemption further undermines the ban’s effectiveness. Given the broad and ambiguous nature of the exemption, there is significant scope for Member States to invoke this exemption.

Combined, these loopholes risk reducing the ban on predictive policing to a symbolic gesture rather than a substantial protection of fundamental rights. In addition to the well-documented downsides of predictive identification, there is an inherent tension between these limitations in the ban, and the overarching goals of the AI Act, including its commitment to safeguard humanity and develop AI that benefits everyone (see for example Recitals 1 and 27 of the Act). Predictive identification may aim to enhance safety by mitigating the threat of potential crime, but it may very well fail to benefit those already marginalised, for example minority communities and individuals from non-Western backgrounds, who are at higher risk of being unfairly targeted, for example under the guise of counter-terrorism efforts. Addressing these issues requires clearer definitions, stricter guidelines on human involvement, and a nuanced approach to national security exceptions. Without such changes, the current ban on this instance of predictive policing risks becoming merely symbolic: a paper tiger failing to confront the real challenges and potential harms of the use of AI in law enforcement.



Source link

04Jul

TinyStories Is A Synthetic DataSet Created With GPT-4 & Used To Train Phi-3 | by Cobus Greyling | Jul, 2024


The Small Language Model from Microsoft, called Phi-3, was trained using a novel dataset called TinyStories.

Microsoft used the following recipe to create synthetic training data for the Phi-3 language model:

  1. Microsoft researchers created a discrete dataset based on 3,000 words, comprising of roughly equal numbers of nouns, verbs, and adjectives.
  2. They then instructed an LLM to create children’s stories using one noun, one verb, and one adjective from the list.
  3. This prompt repeated millions of times over several days, generating millions of tiny children’s stories.
  4. The TinyStories dataset was created to combine all the qualitative elements of natural language, such as grammar, vocabulary, facts, and reasoning.
  5. The main challenge in using large language models for producing training data is generating a dataset that is sufficiently diverse.
  6. This method also forces the LLM to not be too repetitive in the content generated.

The Small Language Model (SLM) Phi-3 was trained on synthetic data generated by GPT-3.5 and GPT-4. Training data created by large language models can often be too repetitive and lack diversity in verbs, nouns, and adjectives.

The dataset needed to include all the qualitative elements of natural language, such as grammar, vocabulary, facts, and reasoning, but it was designed to be smaller, less diverse, and more restricted in content.

The concept of creating a framework or data topology for the LLM to generate synthetic training data is intriguing.

The study indicates that training generative models on TinyStories can typically be completed in less than a day on a single GPU, while still exhibiting behaviours similar to those observed in larger models.

Instead of relying solely on raw web data, the creators of Phi-3 sought high-quality data. Microsoft researchers created a discrete dataset based on 3,000 words, comprising roughly equal numbers of nouns, verbs, and adjectives.

They then instructed a large language model to create children’s stories using one noun, one verb, and one adjective from the list — a prompt repeated millions of times over several days, generating millions of tiny children’s stories.

Small language models are designed to excel at simpler tasks, making them more accessible and easier to use for organisations with limited resources. They can also be fine-tuned more easily to meet specific needs.



Source link

Protected by Security by CleanTalk