26Jun


GovAI research blog posts represent the views of their authors, rather than the views of the organisation.

Introduction

In the coming years, AI will impact many people’s jobs.

AI systems like ChatGPT and Claude can already perform a small but growing number of tasks. This includes, for example, drafting emails, producing illustrations, and writing simple computer programs.

Increasingly capable AI systems will produce increasingly significant economic effects. Automation will boost economic growth, but it will also disrupt labour markets: new jobs will be created and others will be lost. At a minimum, there will be immediate harm to workers who are forced to look for new work. Depending on how different groups are affected, inequality could rise or fall.

While automation is not a new phenomenon, some economists have suggestedcontroversially — that AI’s impacts might be more disruptive than those of previous labour-saving technologies. First, AI-driven automation could potentially happen faster. Second, at least in the long-run, AI might more heavily substitute for human labour. The second concern means that, beyond some level of automation, the typical person’s ability to find well-paid work could actually begin to decline. However, there is still no consensus on what to expect.

If policymakers could more clearly foresee AI’s economic impacts, then they could more readily develop policies to mitigate harms and accelerate benefits. To this end, some researchers have begun to develop automation evaluations: forward-looking assessments of AI’s potential to automate work, as well as automation’s downstream impacts on labour markets.

All existing approaches to automation evaluations have major limitations. Researchers are not yet in a position to make reliable predictions.

Fortunately, though, there is a great deal of room to produce more informative evaluations. This post will discuss a number of promising directions for further work, such as adopting more empirically-grounded methods for estimating automation potential and leveraging the “staged release” of AI systems to study early real-world effects. As the uncertain economic impact of AI looms increasingly large, advancing the science of automation evaluations should be a policy priority.

The Importance of Automation Evaluations

Policymakers will need to address the economic impacts of AI to some extent. This may mean crafting policies to support high-growth industries, limit harm to displaced workers (for example, by supporting retraining efforts), or redistribute concentrated gains.

Reliable automation evaluations would give policymakers more time to plan and craft effective policies, by offering them foresight into AI’s economic impacts.1 Without this foresight, policymakers could find themselves scrambling to respond to impacts after-the-fact. A purely reactive approach could be particularly inadequate if AI’s impacts unfold unusually quickly.2

Current Automation Evaluation Methods

The potential automation impacts of AI can be evaluated in several ways, though existing approaches have important limitations.

In this post, I review two prominent methods: estimating “occupational exposure” to AI using task descriptions and measuring the task-level productivity impacts that workers get from using AI systems.3 The former helps give a broad but imprecise overview of potential labour market impacts across the economy. The latter provides more precise, but narrowly focused evidence on the effect of AI on particular occupations and tasks.

After describing these two approaches, along with their respective advantages and limitations, I will discuss limitations that are relevant to both methods. I will then outline a number of research directions that could help to mitigate these limitations.

Estimating Occupational Exposure to AI Using Task Descriptions

What is “occupational exposure”?

One way to evaluate potential automation impacts is to estimate occupational exposure to AI. While definitions vary across studies, a task is generally considered “exposed” to AI if AI can be meaningfully helpful for completing it. One job is then typically said to be “more exposed” to AI than another if a larger proportion of the tasks that make up this job are exposed.

For example, because AI has proven to be useful for tasks involving writing, and less useful for physical tasks, secretaries are likely to be more exposed to today’s AI systems than roofers.

Exposure is a flexible concept that has been operationalised in a range of ways. For example, in a recent Science paper, my co-authors and I define a task as “exposed” to AI systems like ChatGPT if these systems could double the productivity of the worker performing the task.

How is exposure currently estimated?

Exposure estimates are not typically based on empirical observations of AI being applied to tasks.

Instead, these estimates are produced by drawing on existing datasets — such as the United States Bureau of Labor Statistics’ O*NET database — that describe a wide range of worker tasks. These descriptions may then be given to AI experts, who are asked to apply their knowledge of AI to judge whether the task is exposed. Alternatively, experts may develop a grading rubric that classifies tasks as exposed or unexposed based on whether they have particular traits (e.g. whether the tasks are described as “physical” and “routine”). There are also a number of other possible variations on these approaches.4

Drawing on existing task descriptions is appealing, because it avoids the need to collect new data or perform costly experiments. As a result, these studies are often able to report exposure estimates for a large number of jobs across the economy. This can help identify broad patterns and macro-level findings that would not be clear if only a handful of occupations were considered.

How accurate are current exposure estimates?

To some extent, these exposure studies achieve breadth by sacrificing accuracy. We cannot expect perfect accuracy from estimates that are based only on descriptions of tasks.

The methodologies applied in these studies, particularly the earliest studies, have attracted a number of critiques. It has also been noted that different exposure studies produce conflicting estimates, which implies that at least some estimates must be substantially inaccurate.

On the other hand, some patterns have emerged across recent exposure studies. Arguably, some of them are also beginning to be empirically validated. For example, recent studies have consistently found that higher-wage work is more exposed to AI in the US. This finding matches a pattern in AI adoption data in the US: on average, industries with higher AI adoption rates also have higher average wages.

Ultimately, we do not yet know exactly how accurate description-based exposure estimates are or can become.

Measuring Task-Level Worker Productivity Impacts

A common alternative approach to automation evaluation is to measure the task-level productivity impacts of AI systems on workers. Using this method, researchers randomly assign access to an AI system to some workers but not to others. They then measure how much more or less productively workers with access to the AI system can perform various tasks compared to workers without access. 

Unlike description-based exposure estimates, these worker productivity impact estimates are based on empirical observation. They also attempt to provide information about exactly how useful AI is for a given task, rather than simply classifying a task as “exposed” or “not exposed.”

For example, one study reported a 40% time savings and 18% boost in quality on professional writing tasks. Another reported a 55.8% time savings for software developers working on a coding task using GitHub Copilot.5

Ultimately, these experiments can offer more reliable and fine-grained information about how useful AI is for performing a specific task. The chief limitation of these experiments, however, is that they can be costly to design and run, particularly if they are implemented in real-world work settings. As a result, unlike description-based exposure estimates, they are typically applied to individual occupations and small sets of tasks. They are therefore more limited in scope and do not provide the broad economy-wide insights that can be captured by occupational exposure studies.

Limitations of Existing Methods

These two automation evaluation methods both have distinct strengths and weaknesses. Description-based exposure studies sacrifice accuracy for breadth, while worker productivity studies offer greater accuracy but on a smaller scale. Researchers deciding between these methods therefore need to consider their priorities in light of a significant trade-off between accuracy and breadth.

While the two methods have distinct limitations, there is a third category of limitations that applies to both methods. Because of these limitations, neither method can be used to directly predict the impact of AI on real-world variables such as wages, employment, or growth.

In particular, neither approach can effectively predict or account for:

  • Barriers to AI adoption
  • Changes in demand for workers’ outputs
  • The complexity of real-world jobs
  • Future AI progress
  • New tasks and new ways of producing the same outputs

To accurately predict real-world impacts, further evidence and analysis are ultimately needed.

Neither method accounts for barriers to AI adoption or for changes in demand for workers’ outputs

Ultimately, these methods can only predict whether AI has the potential to affect an occupation in some significant way. They do not tell us that AI actually will have a significant impact. They also do not tell us whether the impact will be positive or negative for workers.

For example, if we learn that AI can boost productivity in some occupation, this does not mean that it actually will be widely adopted by that occupation within any given time frame. There may be important barriers that delay adoption, such as the need for employee training, process adjustments, or costly capital investments.

Even if AI does boost productivity within an occupation, the implications for wages and employment are not necessarily clear. For example, if copy editors become more productive, this may allow them to earn more money by completing more assignments. However, it may also cause them to earn less, since the amount they earn per assignment could decline as the overall supply of copy-editing grows. The net effect will depend, in part, on how much demand for copy-editing rises as prices fall. However, neither exposure studies or worker productivity impact studies tell us anything about impacts on the demand for copy-editing services.

Neither method fully accounts for the complexity of real-world jobs

For the most part, both of the automation evaluation methods I have discussed treat jobs as collections of well-defined, isolated tasks.6 The evaluations consider how useful AI is for individual tasks, with the hope that researchers can easily make inferences about how AI will affect occupations that contain those tasks.

However, this approach overlooks the nuanced reality of jobs. Many occupations actually involve a complex web of interrelated tasks, interpersonal interactions, and contextual decisions. 

For example, even if an AI system can perform most of the individual tasks that are considered to be part of a particular worker’s job, this does not necessarily imply that the job can be automated to the point that the work is completely replaced by the technology. Furthermore, researchers cannot reliably judge how AI will impact a worker’s overall productivity if they do not understand how their workflow or role will shift in response to the adoption of AI.

Neither method accounts for future AI progress

The economic impact of AI will partly depend on how existing AI capabilities are applied throughout the economy. However, the further ahead we look, the more these impacts will depend on what new AI capabilities are developed.

Empirical worker productivity studies can only measure the impact of existing AI systems. Exposure studies typically ask analysts to judge how exposed various tasks are to existing AI capabilities. It will inevitably be harder to estimate exposure to future capabilities, when we do not yet know what these capabilities will be.

Neither method accounts for new tasks and new ways of producing the same outputs

The introduction of new technologies like the electric lightbulb or digital camera did not automate work by performing the same tasks that workers had previously performed in order to light a gas lamp or develop a photograph in a darkroom. Instead, these technologies completely changed the set of tasks that a worker needed to perform in order to produce the same or better-quality output (e.g. a brightly lit street lamp or a photograph).

These historical examples suggest that we cannot necessarily assume that a job will remain immune to significant changes just because AI is not helpful for performing the tasks it currently involves.

When considered on a broad, historical scale, technological progress does not only allow existing tasks to be performed more efficiently. It also makes entirely new tasks possible. Existing approaches to automation evaluations do little to help predict or understand the implications of these new tasks.

Towards Improved Evaluations

Below, I discuss a few ways in which evaluations could be improved to overcome some of these trade-offs and limitations. These are: 

  • Running large-sample worker productivity studies
  • Piloting evaluations of AI performance on worker tasks
  • Modelling additional economic variables
  • Measuring automation impacts in real-world settings (perhaps leveraging the “staged release” of new AI systems)

Running large-sample worker productivity studies

One way to overcome the trade-off between breadth and accuracy in automation evaluations would be to simply invest in much larger-scale worker productivity studies.

For example, a large-scale study (or set of studies) could attempt to empirically measure productivity impacts across a representative sample of economically relevant tasks and occupations. If the sample is large enough, it could offer the same kinds of insights about economy-wide patterns and trends that exposure studies aim to offer — but with greater accuracy and precision.

While the costs involved would be significant, it is possible that the insights produced would warrant these costs.

Piloting evaluations of AI performance on worker tasks

Another, potentially more scalable, approach to achieving both breadth and empirical grounding could be to pilot evaluations of AI performance on a wide variety of worker tasks. These evaluations would involve having AI systems perform tasks that workers currently perform and then assessing how helpful the technology has been in terms of reducing task time or improving the quality of task outputs. 

In practice, this approach would involve treating automation evaluations the same way researchers treat evaluations on performance-based benchmarks for other AI capabilities that are not directly tied to work. The goal of this approach would be to directly assess what AI systems can do, rather than assessing (as in the case of worker productivity studies) what workers can do with AI systems.

As an illustrative example, an evaluation focused on the specific task of writing professional emails could judge how well a model performs at drafting a representative sample of professional emails (e.g. a polite rejection email, a scheduling email, a workshop invitation). Evaluation scores could then be either considered directly or converted into binary judgements about whether or not a task is “exposed.” They could even be used to judge whether or not a system is technically capable of reliably automating a task.

Of course, there are significant technical challenges associated with implementing this approach. These include:

  • Translating task descriptions into clear prompts to an AI system7
  • Developing and validating efficient and reliable methods for rating AI system performance on a variety of tasks8,9

Despite these challenges, running initial pilots to develop and apply these evaluations could be a worthwhile experiment. If the early pilots are promising, then the approach could be scaled to examine a broader and more representative set of occupational tasks. 

Having the technical infrastructure to run evaluations of AI performance on worker tasks could become increasingly important as the automation capabilities of new systems advance.

Modelling additional economic variables

Beyond the breadth/accuracy trade-off, a shared limitation of the methods I have discussed so far is that neither can account for additional economic variables (such as the elasticity of demand for a worker’s outputs) that will help to determine real-world automation impacts.

One path forward here seems to be for researchers to attempt to estimate some of these additional variables and integrate those estimates into economic models.

It is not clear how far researchers can follow this path, since many of the relevant variables are themselves difficult to estimate. However, there is some early work that moves in this direction. For instance, a recent pioneering paper from a team at MIT sets out to go “beyond exposure” to evaluate which tasks it is currently cost-effective to automate with computer vision.

Measuring automation impacts in real-world settings

Another important limitation of the methods I have discussed is that they study tasks in isolation and are not capable of addressing the complexity of real-world jobs. For example, estimates of task-level productivity impacts do not allow us to infer how a worker’s overall productivity will change.

Experiments that vary AI access across teams of workers within real-world firms, over long periods of time, could potentially allow us to overcome this limitation. Researchers could observe how AI increases the overall productivity and performance of both workers and teams. In addition, these studies could potentially allow us to begin to observe how AI affects demand for labour.

It would be costly to run these experiments and would require complicated negotiations with individual firms. Fortunately, however, there may be another approach to studying real-world impacts that would be more feasible.

Specifically, researchers could leverage the staged release of AI systems. Many AI companies already deploy frontier AI systems through “staged release” processes, which often involve giving different actors access to their systems at different points in time. With the cooperation of AI companies and other firms, researchers could take advantage of the variation in adoption during staged releases to estimate the effect of AI on productivity and labour demand in the real world.10,11 Because some companies will get access earlier than others, staged releases enable comparisons between companies with access and those without access.

Conclusion

Automation evaluations could prove to be a critical tool for policymakers as they work to minimise the societal harms and maximise the economic benefits of powerful AI systems. Predicting how those systems will affect the labour market is a challenge, and current methods for evaluation have limitations. It is important for policymakers and researchers to be mindful of these limitations and invest in post-deployment monitoring of impacts as well. However, researchers can also improve their predictions by running large-sample worker productivity studies, piloting evaluations of AI performance on worker tasks, modelling additional economic variables, and measuring automation impacts in real-world settings.

The author of this piece would like to thank the following people for helpful comments on this work: Ben Garfinkel, Stephen Clare, John Halstead, Iyngkarran Kumar, Daniel Rock, Peter Wills, Anton Korinek, Leonie Koessler, Markus Anderljung, Ben Bucknall, and Alan Chan for helpful conversations and feedback

Sam Manning can be contacted at

sa*********@go********.ai













Source link

Protected by Security by CleanTalk