12Aug

OpenAI Enhanced Their API With Robust Structured Output Capabilities | by Cobus Greyling | Aug, 2024


Previously two options were available JSON Mode & Function Calling…

Enabling OpenAI’s JSON mode doesn’t ensure that the output will adhere to a specific predefined JSON schema. It only guarantees that the JSON will be valid and parse without errors.

The challenge with OpenAI’s JSON Mode lies in the significant variability of the JSON output with each inference, making it impossible to predefine a consistent JSON schema.

To clarify, the chat completion API itself doesn’t call any functions, but the model can generate JSON output that you can use in your code to trigger function calls.

Last year OpenAI introduced JSON mode as a valuable tool for developers aiming to build reliable applications using their models.

Although JSON mode enhances the model’s ability to generate valid JSON outputs, but has I have highlighted in previous articles, it does not ensure that the responses will adhere to a specific schema. Which makes this a more experimental feature than a production ready feature.

Now, OpenAI is introducing Structured Outputs in the API, a new feature designed to guarantee that model-generated outputs will precisely match the JSON Schemas provided by developers.

Structured output is available in two formats, Function Calling & A new option for the response_format parameter.

The following Python code of a Function Calling example can be copied and pasted as-is into a notebook:

# Install the requests library if not already installed
!pip install requests

import requests
import json

# Define your OpenAI API key
api_key = ''

# Define the API endpoint
url = "https://api.openai.com/v1/chat/completions"

# Define the headers with the API key
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}

# Define the data for the API request
data = {
"model": "gpt-4o-2024-08-06",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. The current date is August 6, 2024. You help users query for the data they are looking for by calling the query function."
},
{
"role": "user",
"content": "look up all my orders in may of last year that were fulfilled but not delivered on time"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "query",
"description": "Execute a query.",
"strict": True,
"parameters": {
"type": "object",
"properties": {
"table_name": {
"type": "string",
"enum": ["orders"]
},
"columns": {
"type": "array",
"items": {
"type": "string",
"enum": [
"id",
"status",
"expected_delivery_date",
"delivered_at",
"shipped_at",
"ordered_at",
"canceled_at"
]
}
},
"conditions": {
"type": "array",
"items": {
"type": "object",
"properties": {
"column": {
"type": "string"
},
"operator": {
"type": "string",
"enum": ["=", ">", "=", " },
"value": {
"anyOf": [
{
"type": "string"
},
{
"type": "number"
},
{
"type": "object",
"properties": {
"column_name": {
"type": "string"
}
},
"required": ["column_name"],
"additionalProperties": False
}
]
}
},
"required": ["column", "operator", "value"],
"additionalProperties": False
}
},
"order_by": {
"type": "string",
"enum": ["asc", "desc"]
}
},
"required": ["table_name", "columns", "conditions", "order_by"],
"additionalProperties": False
}
}
}
]
}

# Make the API request
response = requests.post(url, headers=headers, data=json.dumps(data))

# Print the response
print(response.status_code)
print(response.json())

JSON serves as a vital tool for structuring and exchanging data between AI agents and the functions they interact with, ensuring clear, consistent, and reliable communication across various systems and platforms.

✨✨ Follow me on LinkedIn for updates on Large Language Models

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn

https://openai.com/index/introducing-structured-outputs-in-the-api

https://platform.openai.com/docs/guides/structured-outputs



Source link

12Aug

How to Use Hybrid Search for Better LLM RAG Retrieval | by Dr. Leon Eversberg | Aug, 2024


Building an advanced local LLM RAG pipeline by combining dense embeddings with BM25

Code snippet from the hybrid search we are going to implement in this article. Image by author

The basic Retrieval-Augmented Generation (RAG) pipeline uses an encoder model to search for similar documents when given a query.

This is also called semantic search because the encoder transforms text into a high-dimensional vector representation (called an embedding) in which semantically similar texts are close together.

Before we had Large Language Models (LLMs) to create these vector embeddings, the BM25 algorithm was a very popular search algorithm. BM25 focuses on important keywords and looks for exact matches in the available documents. This approach is called keyword search.

If you want to take your RAG pipeline to the next level, you might want to try hybrid search. Hybrid search combines the benefits of keyword search and semantic search to improve search quality.

In this article, we will cover the theory and implement all three search approaches in Python.

Table of Contents

· RAG Retrieval
Keyword Search With BM25
Semantic Search With Dense Embeddings
Semantic Search or Hybrid Search?
Hybrid Search
Putting It All Together
·…



Source link

10Aug

Comparing Sex Ratios: Revisiting a Famous Statistical Problem from the 1700s | by Ryan Burn | Aug, 2024


What can we say about the difference of two binomial distribution probabilities

18th century Paris and London [12]

Consider two independent binomial distributions with probabilities of successes p_1 and p_2. If we observe a_1 successes, b_1 failures from the first distribution and a_2 successes, b_2 failures from the second distribution, what can we say about the difference, p_1 – p_2?

Binomial model differences like this were first studied by Laplace in 1778. Laplace observed that the ratio of boys-to-girls born in London was notably larger than the ratio of boys-to-girls born in Paris, and he sought to determine whether the difference was significant.

Using what would now be called Bayesian inference together with a uniform prior, Laplace computed the posterior probability that the birth ratio in London was less than the birth ratio in Paris as

where



Source link

09Aug

Structured Outputs and How to Use Them | by Armin Catovic | Aug, 2024


Building robustness and determinism in LLM applications

Image by the author

OpenAI recently announced support for Structured Outputs in its latest gpt-4o-2024–08–06 models. Structured outputs in relation to large language models (LLMs) are nothing new — developers have either used various prompt engineering techniques, or 3rd party tools.

In this article we will explain what structured outputs are, how they work, and how you can apply them in your own LLM based applications. Although OpenAI’s announcement makes it quite easy to implement using their APIs (as we will demonstrate here), you may want to instead opt for the open source Outlines package (maintained by the lovely folks over at dottxt), since it can be applied to both the self-hosted open-weight models (e.g. Mistral and LLaMA), as well as the proprietary APIs (Disclaimer: due to this issue Outlines does not as of this writing support structured JSON generation via OpenAI APIs; but that will change soon!).

If RedPajama dataset is any indication, the overwhelming majority of pre-training data is human text. Therefore “natural language” is the native domain of LLMs — both in the input, as well as the output. When we build applications however, we would like to use machine-readable formal structures or schemas to encapsulate our data input/output. This way we build robustness and determinism into our applications.

Structured Outputs is a mechanism by which we enforce a pre-defined schema on the LLM output. This typically means that we enforce a JSON schema, however it is not limited to JSON only — we could in principle enforce XML, Markdown, or a completely custom-made schema. The benefits of Structured Outputs are two-fold:

  1. Simpler prompt design — we need not be overly verbose when specifying how the output should look like
  2. Deterministic names and types — we can guarantee to obtain for example, an attribute age with a Number JSON type in the LLM response

For this example, we will use the first sentence from Sam Altman’s Wikipedia entry

Samuel Harris Altman (born April 22, 1985) is an American entrepreneur and investor best known as the CEO of OpenAI since 2019 (he was briefly fired and reinstated in November 2023).

…and we are going to use the latest GPT-4o checkpoint as a named-entity recognition (NER) system. We will enforce the following JSON schema:

json_schema = {
"name": "NamedEntities",
"schema": {
"type": "object",
"properties": {
"entities": {
"type": "array",
"description": "List of entity names and their corresponding types",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The actual name as specified in the text, e.g. a person's name, or the name of the country"
},
"type": {
"type": "string",
"description": "The entity type, such as 'Person' or 'Organization'",
"enum": ["Person", "Organization", "Location", "DateTime"]
}
},
"required": ["name", "type"],
"additionalProperties": False
}
}
},
"required": ["entities"],
"additionalProperties": False
},
"strict": True
}

In essence, our LLM response should contain a NamedEntities object, which consists of an array of entities, each one containing a name and type. There are a few things to note here. We can for example enforce Enum type, which is very useful in NER since we can constrain the output to a fixed set of entity types. We must specify all the fields in the required array: however, we can also emulate “optional” fields by setting the type to e.g. ["string", null] .

We can now pass our schema, together with the data and the instructions to the API. We need to populate the response_format argument with a dict where we set type to "json_schema” and then supply the corresponding schema.

completion = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{
"role": "system",
"content": """You are a Named Entity Recognition (NER) assistant.
Your job is to identify and return all entity names and their
types for a given piece of text. You are to strictly conform
only to the following entity types: Person, Location, Organization
and DateTime. If uncertain about entity type, please ignore it.
Be careful of certain acronyms, such as role titles "CEO", "CTO",
"VP", etc - these are to be ignore.""",
},
{
"role": "user",
"content": s
}
],
response_format={
"type": "json_schema",
"json_schema": json_schema,
}
)

The output should look something like this:

{   'entities': [   {'name': 'Samuel Harris Altman', 'type': 'Person'},
{'name': 'April 22, 1985', 'type': 'DateTime'},
{'name': 'American', 'type': 'Location'},
{'name': 'OpenAI', 'type': 'Organization'},
{'name': '2019', 'type': 'DateTime'},
{'name': 'November 2023', 'type': 'DateTime'}]}

The full source code used in this article is available here.

The magic is in the combination of constrained sampling, and context free grammar (CFG). We mentioned previously that the overwhelming majority of pre-training data is “natural language”. Statistically this means that for every decoding/sampling step, there is a non-negligible probability of sampling some arbitrary token from the learned vocabulary (and in modern LLMs, vocabularies typically stretch across 40 000+ tokens). However, when dealing with formal schemas, we would really like to rapidly eliminate all improbable tokens.

In the previous example, if we have already generated…

{   'entities': [   {'name': 'Samuel Harris Altman',

…then ideally we would like to place a very high logit bias on the 'typ token in the next decoding step, and very low probability on all the other tokens in the vocabulary.

This is in essence what happens. When we supply the schema, it gets converted into a formal grammar, or CFG, which serves to guide the logit bias values during the decoding step. CFG is one of those old-school computer science and natural language processing (NLP) mechanisms that is making a comeback. A very nice introduction to CFG was actually presented in this StackOverflow answer, but essentially it is a way of describing transformation rules for a collection of symbols.

Structured Outputs are nothing new, but are certainly becoming top-of-mind with proprietary APIs and LLM services. They provide a bridge between the erratic and unpredictable “natural language” domain of LLMs, and the deterministic and structured domain of software engineering. Structured Outputs are essentially a must for anyone designing complex LLM applications where LLM outputs must be shared or “presented” in various components. While API-native support has finally arrived, builders should also consider using libraries such as Outlines, as they provide a LLM/API-agnostic way of dealing with structured output.



Source link

08Aug

How to Reduce Class Imbalance Bias in AI? (Explained with a Riddle) | by Diana Morales


Do you like riddles? Perfect! In this article I’ll use a riddle as a fun way to explain class imbalance bias in machine learning models

For International Women’s Day, Mindspace asked 22 people to solve the following riddle and recorded their responses:

A father is about to bring his son to a job interview, applying to work at a large stock trading company. The son is incredibly nervous… In the car during their drive over they hardly speak… Just when arriving at the parking lot of the company the son receives a phone call. He looks up at this father, who says: ‘’Go ahead, pick it up.’’ The caller is the CEO of the stock trading company, who says: ‘’Good luck son…you’ve got this.’’ The boy hangs up the phone and again looks at his father, who is still sitting next to him in the car.

How is this possible? No, really… take a minute and think about it. Alright! Final answer? ˙ɹǝɥʇoɯ s,uos ǝɥʇ sı OƎƆ ǝɥ⊥



Source link

06Aug

Executive Assistant to the Director


GovAI was founded to help humanity navigate the transition to a world with advanced AI. Our first research agenda, published in 2018, helped define and shape the nascent field of AI governance. Our team and affiliate community possess expertise in a wide variety of domains, including AI regulation, responsible development practices, compute governance, AI company corporate governance, US-China relations, and AI progress forecasting.

GovAI researchers have closely advised decision makers in government, industry, and civil society. Our researchers have also published in top peer-reviewed journals and conferences, including International Organization, NeurIPS, and Science. Our alumni have gone on to roles in government, in both the US and UK; top AI companies, including DeepMind, OpenAI, and Anthropic; top think tanks, including the Centre for Security and Emerging Technology and RAND; and top universities, including the University of Oxford and the University of Cambridge.

As Executive Assistant to GovAI’s Director, you will help him prioritise well among competing objectives and tasks, track work streams and processes so that he doesn’t have to, protect his attention and focus, and free up his time by taking over some routine tasks. 

Responsibilities will include: 

  • Developing and maintaining a good understanding of the Director’s priorities, and eventually helping him decide which of them to focus on.
  • Setting up systems to help the Director keep an overview of outstanding tasks and enable him to execute on the highest-leverage ones.
  • Proactively managing the Director’s internal and external communications. This involves soliciting input from the Director where appropriate, and communicating decision-relevant information with concision.
  • Managing the Director’s schedule and appointments, organising and supporting meetings, and planning and implementing travel itineraries. 
  • Interfacing with the rest of the organisation, and especially the Chief of Staff, to keep the Director apprised of the status and timelines of ongoing projects.

We’re selecting candidates who are:

  • Driven by a desire to produce consistently excellent work and achieve valuable results. Being able and motivated to achieve nearly 100% reliability will be especially important in this role.
  • Highly organised and able to keep on top of the wide range of tasks and workstreams the Director needs to engage with. Previous experience with executive assistance is a strong plus.
  • Thriving in a fast-paced and constantly changing environment. The demands on the Director’s time and attention are shifting frequently.
  • Discrete and highly trustworthy. This role will involve access to sensitive information, and protecting its confidentiality is a must.
  • Excellent at oral and written communication. This role will require clear and prompt communication with a wide range of stakeholders, both over email and in person.
  • Excited by the opportunity to use their careers to positively influence the lasting impact of artificial intelligence, in line with our organisation’s mission.

This position is full-time. Our offices are located in Oxford, UK, and we strongly prefer team members to be based here, but are open to individuals who would need to work remotely. We are able to sponsor visas. 

The Executive Assistant will be compensated in line with our salary principles. As such, the salary for this role will depend on the successful applicant’s experience, but we expect the range to be between £65,000 (~$84,000) and £85,000 (~$110,000) if based in Oxford. In rare cases where salary considerations would prevent a candidate from accepting an offer, there may also be some flexibility in compensation. 

Benefits associated with the role include health, dental, and vision insurance, a £5,000 annual wellbeing budget, an annual commuting budget, flexible work hours, extended parental leave, ergonomic equipment, a 10% pension contribution, and 33 days of paid vacation (including Bank Holidays). 

The application process consists of a written submission in the first round, a paid remote work test in the second round, an interview in the third round, and an on-site assessment day in the final round. We also conduct reference checks for all candidates we interview. Please apply using the form linked below.

GovAI is committed to fostering a culture of inclusion and we encourage individuals with underrepresented perspectives and backgrounds to apply. We especially encourage applications from women, gender minorities, people of colour, and people from regions other than North America and Western Europe who are excited about contributing to our mission. We are an equal opportunity employer and want to make it as easy as possible for everyone who joins our team to thrive in our workplace. 

If you would need a decision communicated by a particular date, need assistance with the application due to a disability, or have any other questions about applying, please email

re*********@go********.ai











.



Source link

06Aug

Visualizing Stochastic Regularization for Entity Embeddings | by Valerie Carey | Aug, 2024


A glimpse into how neural networks perceive categoricals and their hierarchies

Photo by Rachael Crowe on Unsplash

Industry data often contains non-numeric data with many possible values, for example zip codes, medical diagnosis codes, preferred footwear brand. These high-cardinality categorical features contain useful information, but incorporating them into machine learning models is a bit of an art form.

I’ve been writing a series of blog posts on methods for these features. Last episode, I showed how perturbed training data (stochastic regularization) in neural network models can dramatically reduce overfitting and improve performance on unseen categorical codes [1].

In fact, model performance for unseen codes can approach that of known codes when hierarchical information is used with stochastic regularization!

Here, I use visualizations and SHAP values to “look under the hood” and gain some insights into how entity embeddings respond to stochastic regularization. The pictures are pretty, and it’s cool to see plots shift as data is changed. Plus, the visualizations suggest model improvements and can identify groups that might be of interest to analysts.

NAICS Codes



Source link

05Aug

Agent AI: Agentic Applications Are Software Systems With A Foundation Model AI Backbone & Defined Autonomy via Tools | by Cobus Greyling | Aug, 2024


Flow Engineering

Prompt Engineering alone was not enough and we had to find a way of re-using prompts; hence templates were introduced where key data fields could be populated at inference. This was followed by prompts being chained to create longer flows and more complex applications.

Chaining was supplemented with highly contextual information and inference, giving rise to an approach leveraging the In-Context Learning (ICL) via Retrieval Augmented Generation (RAG).

The next step in this evolution is Agentic Applications (AI Agents) where a certain level of agency (autonomy) is given to the application. LlamaIndex combined advanced RAG capabilities with an Agent approach to coin Agentic RAG.

For Agentic Applications to have an increased level of agency, more modalities need to be introduced. MindSearch can explore the web via a text interface. Where OmniParser, Ferrit-UI and WebVoyager enable agentic applications to be able define a graphic interface, and navigate the GUI.

The image above is from Microsoft is called OmniParser, where a similar approach is followed to Apple with FerritUI & WebVoyager. Screen elements are detected, mapped with bounding boxes and named. From here a natural language layer can be created between a UI and any conversational AI system.

MindSearch is premised on the problem that complex requests often cannot be accurately and completely retrieved by the search engine via a single instance.

Corresponding information which needs to be integrated into solving a problem or a question, is spread over multiple web pages along with significant noise.

Also, a large number of web pages with long contents may quickly exceed the maximum context length of LLMs.

The WebPlanner models the human mind of multi-step information seeking as a dynamic graph construction process.

It decomposes the user query into atomic sub-questions as nodes in the graph and progressively extends the graph based on the search result from WebSearcher; using either GPT-4o or InternLM2.5–7B models.



Source link

05Aug

Managing Risks from AI-Enabled Biological Tools


GovAI research blog posts represent the views of their authors, rather than the views of the organisation.

Introduction

Some experts have recently expressed concern that AI models could increase the risk of a deliberate bioweapon attack. Attention has primarily focused on the controversial hypothesis that  “large language models” (LLMs) — particularly chatbots such as ChatGPT and Claude — could help scientific novices learn how to identify, create, and release a pandemic-causing virus. 

The risks posed by AI-enabled biological tools have received less attention. However, some of these tools could pose major risks of their own. Rather than lowering skill barriers to misuse, some biological tools could advance the scientific frontier and make more deadly attacks possible.

Unfortunately, policies designed primarily with chatbots in mind may have limited effects in addressing risks from biological tools. For example, policymakers focused on chatbots have begun to consider or enact mandatory reporting requirements for models trained using very large amounts of compute. The justification for limiting these requirements to high-compute chatbots is that compute usage strongly predicts the capabilities and thus risks of chatbots. However, for AI-enabled biological tools, it is an open question how far compute is now, and will be in the future, a good proxy for model capability and risk. Therefore, reporting requirements based on “compute thresholds” may not be effective at identifying high-risk biological tools.

Moreover, while cutting-edge LLMs are produced by only a handful of technology companies, biological tools are produced by a fairly large number of research groups that are dispersed across multiple jurisdictions. This makes self-regulation less likely to succeed. It also increases challenges around regulatory compliance and may create a need for different regulatory strategies.

Finally, policymakers focused on risks from chatbots have advocated for “red-teaming” as a method of identifying risks: this means trying to elicit unacceptable outputs from chatbots, such as advice about how to build biological weapons. However, red-teaming is less likely to be a practical and safe approach to assessing the risks of biological tools. For example, it would likely be both too dangerous and too costly to try to use a biological tool to design a novel virus and then test its virulence. In general, trying to produce new scientific knowledge about biological weapons is more problematic than trying to reproduce existing knowledge.

It is not yet clear what the most effective policy approach to managing risks from biological tools will be. However, a natural first step would be for governments to conduct recurring literature-based risk assessments of biological tools. This would involve surveying the literature for published evidence of dangerous capabilities in existing tools. The results of these assessments could inform policymakers in choosing which further tests to conduct, as well as when and whether new policy actions are warranted.

Moreover, there are some actions policymakers could take to mitigate both existing biological risks and potential new risks created by biological tools. For example, mandatory screening of DNA synthesis orders could, in general, make it harder for malicious actors to gain access to bioweapons by placing orders to mail-order virus suppliers.

Biological tools

AI-enabled biological tools are AI systems that have been developed using biological data such as genetic sequences and trained using machine learning techniques in order to perform tasks that support biological research and engineering. Perhaps the most notable biological tools are design tools, which can help to design novel biological agents — including novel viruses. However, the category also includes a broader range of tools such as platforms that automate wet lab experiments. Some biological tools can perform a wide range of biological tasks, while others perform specific tasks, such as predicting the structure of proteins (like AlphaFold 2). 

Like chatbots, biological tools have made extraordinary progress in recent years. They are already contributing to progress in many areas of biomedicine, including vaccine development and cancer therapy. However, the models also introduce risks. Of particular concern are certain kinds of biological tools that could potentially allow malicious actors to design and develop pathogens that are more transmissible or more deadly than known pathogens. 

The landscape of biological tools is different to the landscape of AI chatbots in several ways.

Risks from biological tools

To understand potential risk from biological tools, it is useful to first consider the “baseline risk” of a bioweapon attack, and then consider how biological tools could increase this risk.

Baseline risk

Independent of progress in AI, the baseline risk of a bioweapon attack over the next ten years is concerning. Today, thousands of people are already able to create viruses from scratch without using AI tools. Moreover, accessibility is increasing exponentially: it now costs only $1,000 to create the virus that caused the 1918 influenza pandemic from scratch1. (Note, however, that 1918 pandemic influenza is unlikely to cause a pandemic today, due to pre-existing population immunity.) The price of gene synthesis declined by a factor of 1,000 from the year 2000 to 2016. 

Biological tools, therefore, are not necessary for a bioweapons attack. The concern is that some biological tools could exacerbate the risk. 

How could biological tools affect biological risk?

Many experts have expressed concern that some kinds of biological tools could help malicious actors to identify new pandemic-capable viruses. For example, certain tools could help an actor to predict a virus’ toxicity or predict whether it could evade a vaccine.

Most AI-enabled biological tools present limited risks. However, it is difficult to know the degree to which some tools could increase biological risks. While there has been research into how chatbots affect the risk of a bioweapons attack, there are still no published risk assessments of the risk posed by current or future biological tools. It is also likely that a large portion of work on this topic will remain unpublished or highly redacted, due to concerns about spreading knowledge that could be misused. These same concerns are why I have not provided more specific descriptions of how biological tools could be misused. 

Who can make use of biological tools?

The risks from biological tools will depend, in part, on how many actors can make use of them.

At present, scientific novices are unlikely to use them successfully. Barriers to use include both scientific knowledge and programming expertise.

However, these barriers could be overcome by general-purpose AI systems that can interface with biological tools. These models could interpret plain-language requests, choose and program the appropriate biological tool, translate technical results back into plain language, and provide a step-by-step laboratory protocol or even control automated laboratory equipment to produce the desired result. This is analogous to the way in which general-purpose AI systems, like ChatGPT, can already write and run simple computer programs in response to plain-language requests from users. (See this video from 310 copilot which outlines how such a system could work.)

These kinds of general-purpose AI systems are not yet mature enough to empower a novice2. They have shown some limited success, though, and in future they could ultimately allow novices to make use of biological tools too.

Even if the models cannot yet empower novices, they may still increase risks by increasing the capabilities of sophisticated actors. 

Policymakers have taken only limited steps to understand and govern risks from biological tools

So far, policymakers worldwide have taken only limited steps to understand and address potential risks from biological tools. (See Appendix A.)

In the US, the main step policymakers have taken is to require the developers of certain biological tools to report information about these tools if they are trained using an amount of compute that exceeds an established threshold. However, the relevant “compute threshold” is set high enough that it may currently apply to only a single existing model.

Meanwhile, in the UK, policymakers have not yet established any regulatory requirements targeting biological tools. However, UK policymakers have established a workstream within the AI Safety Institute to study risks from biological tools.

Even though the EU has recently passed a wide-ranging AI Act, the Act does not cover AI-enabled biological tools. 

Distinct governance challenges

There is currently a lively policy dialogue about how to assess and manage potential extreme risks from AI, including biological risks. 

This dialogue has mostly focused on risks from chatbots and other language models. In this context, common policy proposals include:

  • Supporting industry self-regulation by working closely with large technology companies to encourage them to voluntarily commit to responsible practices
  • Encouraging or requiring the developers of models to red team them (i.e. test their safety by directly trying to elicit unacceptable outputs) and then put in place safeguards if sufficiently dangerous capabilities are found
  • Creating standards or regulations that specifically apply to AI systems that exceed some compute threshold (since compute is highly predictive of the capabilities of language models)

However, because biological tools pose distinctive governance challenges, policy ideas that are developed with chatbots in mind often do not straightforwardly apply to biological tools.

Some distinctive challenges for the governance of biological tools are:

  1. Lower barriers to development: Leading biological tools are relatively cheap to develop, and developers are spread across numerous institutions and countries. Moreover, as the cost of compute declines and algorithms improve, the set of actors who have the resources necessary to develop biological tools will increase rapidly.

    This could make effective self-regulation difficult to sustain: a large and heterogeneous set of developers — rather than a fairly small and homogenous set of technology companies — would all need to voluntarily follow the same norms. This could make it harder to ensure compliance with regulations. Further, it increases the risk that developers would move jurisdiction in order to avoid regulation.

  2. Greater risks and costs of red-teaming: Red-teaming biological tools (i.e. testing their safety by trying to elicit and identify unacceptable outputs) may be both more dangerous and more difficult than red-teaming chatbots. For the most part, red-teaming efforts that focus on chatbots attempt to assess whether these chatbots can help non-experts gain access to or apply dangerous information that experts already possess. On the other hand, red-teaming efforts that focus on biological tools may need to assess whether the tools can produce new dangerous information (e.g. a design for a new deadly virus) that was previously unknown. Clearly, in this latter case, the potential for harm is greater. Furthermore, actually judging the validity of this new information could require biological experiments (e.g. synthesising and studying a potentially deadly new virus) that would be both difficult and highly risky.

    Therefore, policymakers will likely need to be more cautious about calling on the developers of biological tools to engage in red-teaming. Many developers of biological tools have themselves noted that evaluations “should be undertaken with precaution to avoid creating roadmaps for misuse”.

  3. Stronger open science norms: Unlike with cutting-edge chatbots, the vast majority of cutting edge biological tools are “open source”: they can be directly downloaded, modified, and run by users. This matters because safeguards against misuse can generally be easily and cheaply removed from any model that is open-sourced.

    The commitment to open sourcing in the field in part stems from strong open science norms in the biomedical sciences in general and among biological tool developers in particular. Even when models are not open-source, these norms may also lead well-intentioned scientists to publish papers that call attention to information produced by biological tools that could be misused.

    The upshot is that calls for biological tool developers to introduce safeguards against the misuse of their models could have comparatively little efficacy. If policymakers ultimately judge that certain safeguards should be put in place, they will likely need to rely especially strongly on regulatory requirements.
  4. Greater difficulty identifying high-risk models: In part because biological tools are more heterogeneous than language models, we should expect there to be a less reliable relationship between the level of compute used to develop a biological tool and the level of risk it poses. This means that “compute thresholds” may also be less reliable for distinguishing high-risk biological tools from low-risk ones.

    If standard-setting bodies or regulators want to apply certain requirements only to “high-risk” biological tools, they may need to go substantially beyond simple compute thresholds to define the high-risk category.

In addition to these distinctive challenges, the fact that biological tools are also dual-use — meaning that, like language models, they can be used for good or ill — will tend to make governance especially difficult. It will be difficult to reduce risks from misuse without also, to some extent, interfering with benign uses.

Policy options

Despite all of these challenges, there are steps that governments can take today to understand and begin to mitigate potential risks from biological tools.

Two promising initial steps could be to (1) carry out recurring domain-wide risk assessments of biological tools, and (2) mandate screening of all orders to gene synthesis companies.

Conducting literature-based risk assessments of biological tools

In order to manage the risks of biological tools, governments should first review the published literature on models that have already been released to assess the risks the models pose. This would allow governments to understand whether and how to take further action. 

As noted above, red-teaming models — or asking developers to red-team their own models — could involve unacceptable risks and unreasonable costs. Fortunately, as has also been noted by researchers at the Centre for Long-Term Resilience, conducting risk assessments that are based simply on reviews of published literature would be much safer and more practicable.

In slightly more detail, governments could proceed as follows. First, they could develop a list of potentially dangerous model capabilities and assessments of how much risk would be created by tools with these capabilities. Then, they could review the published literature from industry and academia to assess the extent to which already-released models appear to possess these capabilities. 

These literature-based risk assessments would not substantially increase biosecurity risks, because they would not seek to discover new information that has not already been published. (However, because they could still draw attention to dangerous information, they should not be shared publicly.) Performing these risk assessments also would not involve running costly experiments. 

Literature-based risk assessments would be an important first step in understanding and managing the risks of biological tools. They would provide governments with up-to-date information on the risks posed by current biological tools. This would in turn help governments to decide whether further actions are warranted. Possible responses to concerning findings could include: conducting more thorough assessments; supporting additional research on safeguards; issuing voluntary guidance to biological tool developers; or introducing new regulatory requirements. It is possible, for example, that there will eventually be a need for a regulatory regime that requires developers to seek government approval before publishing especially high-risk tools. Literature-based risk assessments could help governments know if and when more restrictive policy is needed.

Mandate gene synthesis screening

There are a range of policies that would reduce the specific risks from biological tools, as well as other biological risks that are not caused by AI. One such policy, supported both by many biosecurity experts and many biological tool developers, is gene synthesis screening. 

Once many companies and scientists today have the DNA sequence of a virus, they can create synthetic versions of that virus from scratch. Some, though not all, of these gene synthesis companies screen orders from customers so that they will not synthesise dangerous pathogens, such as anthrax or smallpox. In the US, the White House recently developed best practice guidelines for DNA synthesis providers, which recommend that synthesis companies should implement comprehensive and verifiable screening of their orders. All federal research funding agencies will also require recipients of federal research funds to procure synthetic nucleic acids only from providers that implement these best practices. This is a valuable first step towards comprehensive DNA synthesis screening, but significant gaps remain. DNA screening should be mandatory for all virus research, including research funded by the private sector or foundations.

In the future, biological design tools may make it possible for malicious actors to bypass DNA screening by designing agents that do not resemble the function or sequence of any known pathogen. Screening requirements would then have to be improved in tandem with biological design tools in order to reduce this risk.

Conclusion

Rapid improvements in biological tools promise large benefits for public health, but also introduce novel risks. Despite that, there has so far been limited discussion of how to manage these emerging risks. Policy approaches that are well-suited to frontier AI chatbots are unlikely to be effective for mitigating risks from biological tools. A promising first step towards effective risk management of biological tools would be for governments to mandate screening of all gene synthesis orders and to carry out recurring risk assessments of released models that rely on the published literature.

Acknowledgements

Thank you to my colleagues at GovAI and to Jakob Grabaak, Friederike Grosse-Holz, David Manheim, Richard Moulange, Cassidy Nelson, Aidan O’Gara, Jaime Sevilla, and James Smith for their comments on earlier drafts. The views expressed and any mistakes are my own.

John Halstead can be contacted at jo***********@go********.ai

As noted above, policymakers have taken only limited steps to understand and address potential risks from biological tools. Here, in somewhat more detail, I will briefly review the most relevant policy developments in the US, UK, and EU.

United States

In the US, there are no regulations that require the developers of any biological tools to assess the risks posed by their tools – or to refrain from releasing them if they appear to be excessively dangerous.

However, a provision in the recent Executive Order on AI does place reporting requirements on the developers of certain biological tools. Specifically, these requirements apply to anyone developing an AI model that is trained “primarily on biological sequence data” and uses a sufficiently large amount of compute. The developers of these models must detail their development plans and risk-management strategies, including any efforts to identify dangerous capabilities and safeguard their models against theft.

In practice, currently, these reporting requirements apply to very few biological tools. At the time of writing, only one publicly known model — xTrimoPGLM-100B — appears to exceed the relevant compute threshold (1023 “floating point operations”). However, the number of tools that the requirements apply to is also likely to grow over time.

United Kingdom

The UK has not yet introduced any regulatory requirements (including reporting requirements) that target biological tools. 

However, it has recently established an AI Safety Institute to support the scientific evaluation and management of a wide range of risks posed by AI. While the institute primarily focuses on general-purpose AI systems, it also currently supports work on risks from AI-enabled biological tools. It has not yet reported substantial information about this workstream, but may report more as the workstream matures.

European Union

The EU has recently passed an AI Act that will introduce a wide range of new regulatory requirements related to AI. The act places requirements and restrictions on (1) specific applications of AI systems, including “high-risk” AI systems; and (2) general-purpose AI systems. AI-enabled biological tools are not currently included in either category. This can only be changed by amending legislation, which involves the European Council, Parliament, and Commission. 

General-purpose models

General-purpose AI models are defined according to “the generality and the capability to competently perform a wide range of distinct tasks” (Article 3(63)). This is not yet precisely defined, but does not plausibly apply to AI-enabled biological tools. Some AI-enabled biological tools perform a wide range of biological tasks, but still perform a much narrower range of tasks than models like GPT-4 and Claude 3. Moreover, some potentially risky AI-enabled biological tools do not perform a wide range of biological tasks, and therefore cannot be defined as general-purpose. 

The act says that a general-purpose AI model shall be presumed to have “high impact capabilities” and thus “systemic risk” if (but not only if) it is trained on more than 1025 FLOPs (Article 51(2), 51(1)(a))3. This does not apply to any existing AI-enabled biological tools (Maug et al 2024). 

However, as noted in the main piece, in the future, it is likely that there will be combined LLM and biological tool models that are general-purpose in the sense meant by the AI Act, and will have advanced biological capabilities that may significantly increase biological risk. These models would be covered by the Act. 

High-risk applications

Article 6 of the EU AI Act defines “high-risk” AI systems as those that “pose a significant risk of harm to the health, safety or fundamental rights of natural persons”. Annex III of the Act sets out specific criteria which classify applications of AI as high-risk, including use in social scoring systems, making employment-related decisions, assessing people’s financial creditworthiness, and other criteria.

Increased biological weapons capabilities are not included in Annex III. The European Commission can class systems as high-risk or not high-risk, provided they increase risk in the areas covered by Annex III (per Article 6). The Commission cannot use delegated acts to add new areas of risk that are not already included in Annex III (per Article 7). However, the Commission can propose the addition of new risk areas to Annex III of the Act to the European Parliament and Council once a year (per Article 112).

Scientific R&D exemption?

The Act states that AI models developed and used for the sole purpose of scientific R&D do not fall within the scope of the act (Article 2(8)), and this could plausibly apply to AI-enabled biological tools. However, since most AI-enabled biological tools are open source, there is no way to know whether they are being used solely for the purpose of scientific R&D, or are also being misused for bioweapons development by malicious actors. So, it is not clear that this provision does exempt AI-enabled biological tools from coverage by the Act.

AI-enabled biological tools have scaled rapidly over the past decade. To predict how the capabilities of biological tools will evolve, it is helpful to consider the different drivers of performance improvements. We can think of these improvements as being driven by (1) increases in computing power (or “compute”); (2) by increases in data quantity and quality; and (3) by algorithmic progress. 

Here, I perform a preliminary analysis of how compute and data are likely to develop in the near-term. By my own assessment, the compute and data used in training is likely to increase rapidly in the near future and then increase more slowly thereafter. More research on this question would be valuable, as it would provide a clearer picture for policymakers of how risks are likely to develop in the future. 

However, it is important to note that for AI-enabled biological tools, the sheer amount of compute and data used in training may not be a good predictor of especially risky capabilities. There may be models trained with small amounts of compute and data that provide significant uplift at specific narrow tasks that in turn materially increase misuse risk. Therefore, although the increase in compute and data used in training may soon slow, this does not necessarily mean that the increase in risks will slow proportionately. 

Moreover, the analysis here focuses only on compute and data, but not algorithmic progress. More research on (a) the past effect of algorithmic improvements on the performance of biological tools, and on (b) likely future trends in algorithmic progress, would also be valuable. 

Compute trends

The amount of compute used to train biological sequence models has increased by an average of 8-10 times per year over the last six years. This is equivalent to compute doubling around every four months. xTrimoPGLM-100B, a 100 billion parameter protein language model, exceeds the White House Executive Order’s reporting threshold of 1023 FLOP by a factor of six.

xTrimoPGLM-100B had amortised hardware training costs of around $1 million4. If biological design tools keep scaling at current rates, then the hardware cost of training will be on the order of $400 million in 2026 and $2.7 billion in 20275. Such costs would require biological design tools to have very large commercial benefits or receive substantial government funding. For comparison, it cost $4.75 billion over ten years to build the Large Hadron Collider at CERN. This suggests that the rate of increase in compute will decline substantially in the next few years. Future research should analyse likely future investment in biological tools from the public and private sector.

Data trends

There is much less information on trends in the amount of data used to train biological design tools than for compute. On the (questionable) assumption that the relationship between data scaling and compute scaling is similar to AI language models, then this would imply that the data used to train biological design tools is increasing by around a factor 13 per year6.

A report by Epoch found that the public databases used to train different types of biological design tools are increasing in size: between 2022 and 2023 key databases grew by 31% for DNA sequences, 20% for metagenomic protein sequences, and 6.5% for protein structures. If we assume that growth continues at this rate, and that all of this data is equally useful for model training, on current scaling trends, models trained on public databases cannot continue to scale at current rates beyond the next few years.7 

However, this does not provide a realistic picture of the data constraints to scaling of biological tools for two reasons. First, the rate of the increase in data in public and private databases will likely increase in the future. Gene sequencing costs have declined by more than a factor of 100,000 over the last 20 years, and so are likely to decline by a few more orders of magnitude over the next few years. Various projects also propose a massive increase in sequencing. For example, the pilot Nucleic Acid Observatory proposal to sequence wastewater at all US ports of entry would increase the growth rate in metagenomic sequences by a factor of 100.8 So, there could be changes in the growth rate of available data in the future. 

Second, some proprietary databases are much larger than public databases. For example, Basecamp Research, a private company that develops biological databases and develops biological design tools, aims to create the world’s largest and most diverse protein database. Basecamp says that after only a few years of sampling, their proprietary databases have 4x more sequence diversity and 20x more genomic content than comparable public databases. Indeed, there will be stronger commercial incentives to improve biological databases given the capabilities of Biological Design Tools. Future research should analyse the implications of these large proprietary databases for future scaling of biological design tools. 

There are two caveats to this. First, there are diminishing returns to the increases in data in the relevant databases because many of the genomes are very genetically similar to one another. For example, there are around 1,000 human genomes and 260,000 e-coli genomes in the NCBI database, but these genomes are very similar to one another. So, the training data is less varied and therefore less useful for biological tools than, e.g. for LLMs. 

Finally, state-of-the-art protein language models that predict protein structure and function are pre-trained in a self-supervised way on tens of millions to billions of proteins.9 However, in order to predict rare or novel protein functions, which are especially concerning from a biological risk point of view, current models need to be fine tuned using labelled protein data. Historically, labelling of protein function had been imperfect, highly labour-intensive, and required synthesising decades of biological research. If labelling continues to be labour-intensive, the quantity of labelled data is unlikely to increase by an order of magnitude in the near future.10

However, this could also change in the future because AI tools may themselves be able to label proteins more accurately and efficiently. For example, Basecamp’s HiFin-NN AI model outperforms other annotation models and is able to rapidly annotate a large representative portion of the MGnify microbial protein database that was previously not annotated.



Source link

Protected by Security by CleanTalk