Evaluating Predictions of Model Behaviour
A critical AI safety goal is understanding how new AI systems will behave in the real world. We can assess our understanding by trying to predict the results of model evaluations before running them.
Alan Chan
GovAI research blog posts represent the views of their authors, rather than the views of the organisation.
Introduction
Some existing AI systems have the potential to cause harm, for example through the misuse of their capabilities, through reliability issues, or through systemic bias. As AI systems become more capable, the scale of potential harm could increase. In order to make responsible decisions about whether and how to deploy new AI systems, it is important to be able to predict how they may behave when they are put into use in the real world.
One approach to predicting how models will behave in the real world is to run model evaluations. Model evaluations are tests for specific model capabilities (such as the ability to offer useful instructions on building weapons) and model tendencies (such as a tendency to exhibit gender bias when rating job applications). Although model evaluations can identify some harmful behaviours, it can be unclear how much information they provide about a model's real-world behaviour. The real world is often different from what can be captured in a model evaluation. In particular, once a model is deployed, it will be exposed to a much wider range of circumstances (e.g. user requests) than it can be exposed to in the lab.
To address this problem, I suggest implementing prediction evaluations to assess an actor’s ability to predict how model evaluation results will translate to a broader range of situations. In a prediction evaluation, an initial set of model evaluations is run on a model. An actor — such as the model evaluation team within an AI company — then attempts to predict the results of a separate set of model evaluations, based on the initial results. Prediction evaluations could fit into AI governance by helping to calibrate trust in model evaluations. For example, a developer could use prediction evaluations internally to gauge whether further investigation of a model’s safety properties is warranted.
More work is required to understand whether, how, and when to implement prediction evaluations. Actors that currently engage in model evaluations could experiment with prediction evaluations to make progress on this work.
Prediction evaluations can assess how well we understand model generalisation
Deciding when it is safe to deploy a new AI system is a crucial challenge. Model evaluations – tests conducted on models to assess them for potentially harmful capabilities or propensities – can inform these decisions.1 However, models will inevitably face a much wider range of conditions in the real world than they face during evaluations. For example, users often find new prompts (which evaluators never tested) that cause language models such as GPT-4 and Claude to behave in unexpected or unintended ways.2
We therefore need to understand how model evaluation results generalise: that is, how much information model evaluations provide about how a model will behave once deployed.3 Without an understanding of generalisation, model evaluation results may lead decision-makers to mistakenly deploy models that cause much more real-world harm than anticipated.4
We propose implementing prediction evaluations5 to assess an actor’s understanding of how model evaluation results will generalise. In a prediction evaluation, an initial set of model evaluations is run on a model and provided to an actor. The actor then predicts how the model will behave on a distinct set of evaluations (test evaluations), given certain limitations on what the actor knows (e.g. about details of the test evaluations) and can do while formulating their prediction (e.g. whether they can run the model). Finally, a judge grades the actor’s prediction based on the results of running the test set evaluations. The more highly the actors score, the more likely they are to have a strong understanding of how their model evaluation results will generalise to the real world.6
Figure 1 depicts the relationship between predictions, prediction evaluations, model evaluations, and understanding of generalisation.
Figure 1: Prediction evaluations indirectly assess the level of understanding that an actor has about how its model evaluations generalise to the real world. The basic theory is: If an actor cannot predict how its model will perform when exposed to an additional set of “test evaluations”, then the actor also probably cannot predict how its model will behave in the real world.
Prediction evaluations could support AI governance in a number of ways. A developer could use the results of internally run prediction evaluations to calibrate their trust in their own model evaluations. If a model displays unexpectedly high capability levels in some contexts, for example, the developer may want to investigate further and ensure that their safety mitigations are sufficient.
A regulator could also use the results of (potentially externally run) prediction evaluations to inform an array of safety interventions. For example, consider the context of a hypothetical licensing regime for models, in which developers must receive regulatory approval before releasing certain high-risk models. If a model developer performs poorly on prediction evaluations, their claims about the safety of a model may be less credible. A regulator could take into account this information when deciding whether to permit deployment of the model. If the developer’s predictions are poor, then the regulator could require it to evaluate its model more thoroughly.
How to run a prediction evaluation
In the appendix to this post, we provide more detail about how to run a prediction evaluation. Here, we provide a brief overview. First, the administrator of the prediction evaluation should select the model evaluations. Second, the administrator should prevent the actor from running the test evaluations when making the prediction. Finally, the administrator needs to establish standards for good prediction performance.
An example of running a prediction evaluation
Our example here focuses on a regulator in the context of a hypothetical licensing regime, in which developers of certain high-risk models require regulatory approval before these models can be deployed. Other potential examples to explore in future work could include a developer running prediction evaluations internally, a regulator running prediction evaluations on itself to assess its own understanding, or some actor running prediction evaluations on a model user (e.g. a company that uses models at a large scale).
Suppose that a developer submits a model and its evaluations to a regulator for approval. The regulator could administer a prediction evaluation to the developer through a process similar to the following:
- Based on the initial model evaluations that the developer submitted, the regulator builds a set of test evaluations. The test evaluations could include a wider variety of inputs than the initial model evaluations, but still feature the same category of task.
- The regulator puts the developer in a controlled, monitored environment, such that the developer cannot run the test evaluations on the model.
- The regulator provides the developer with a detailed description of the test set evaluations.
- For each test evaluation, the regulator asks the developer to predict whether the model will succeed at the task (the developer provides a “yes” or “no” answer).
- The developer provides a prediction to the regulator.7
- The regulator compares the prediction with the actual behaviour of the model on the test evaluations.8
Consider a case in which the developer does not perform much better than chance on the prediction evaluation (i.e. performs close to 50% accuracy for yes/no questions). Such performance would be evidence of a poor understanding of how the model’s behaviour generalises. As a result, greater caution from the regulator may be justified. The regulator’s response to the poor performance could vary in severity depending on the potential harm the model could cause. Some options include:
- Requesting more extensive model evaluations before deployment
- Subjecting deployment of the model to additional requirements, such as more stringent monitoring
- Blocking deployment or further training until specified conditions are met, such as good performance on subsequent prediction evaluations
Further research is required to understand whether and when any of these options would be appropriate, and what other options exist.
Limitations and open questions
There is still a great deal of uncertainty about whether it is worthwhile to run prediction evaluations. For example, suppose that a developer has run an initial set of model evaluations but still is not confident about how well these model evaluations will generalise to the real world. A comparatively straightforward strategy to become more confident would be to simply run a wider range of model evaluations, without bothering to make any explicit predictions. If these additional model evaluations also suggest that the model is safe, then — even if some of the specific results have been surprising — perhaps the developer would still be justified in believing that its models will ultimately also behave safely in the real world.
Furthermore, prediction accuracy may not vary enough — between the actors who are making the predictions or between the models that the predictions concern — for it to be worthwhile to assess prediction accuracy in individual cases. For example, it may be the case that people generally cannot reliably predict the results of model evaluations very well at all. Although this general result would be useful to know, it would also reduce the value of continuing to perform prediction evaluations in individual cases.
There are also various practical questions that will need to be answered before prediction evaluations can be run and used to inform decisions. These open questions include:
- How feasible is it to predict behaviour on model evaluations without running the model — and how does feasibility change with information or action limits on the actor?
- How should we limit what the actor knows and can do in a prediction evaluation?
- How should the initial and test evaluations be chosen?
- How should the results of a prediction evaluation be reported? For example, should the actor provide different predictions corresponding to different amounts of compute used?
If prediction evaluations should ultimately be built into a broader AI governance regime, then a number of additional questions arise.
- Who should administer prediction evaluations?
- Which actors should undergo prediction evaluations?
- How can prediction evaluations incentivise improvements in understanding?
- What is the role of prediction evaluations in an overall evaluation process?
Fortunately, there are immediate opportunities to make progress on these questions. For instance, to tackle questions 1-4, those developing and running evaluations on their models can at the same time run prediction evaluations internally. For such low-stakes experiments, one may easily be able to vary the amount of time, information, or compute given for the prediction evaluation and experiment with different reporting procedures.9
Conclusion
To make informed development and deployment decisions, decision-makers need to be able to predict how AI systems will behave in the real world. Model evaluations can help to inform these predictions by showing how AI systems behave in particular circumstances.
Unfortunately, it is often unclear how the results of model evaluations generalise to the real world. For example, a model may behave well in the circumstances tested by a particular model evaluation, but then behave poorly in other circumstances it encounters in the real world.
Prediction evaluations may help to address this problem, by testing how well an actor can predict how model evaluations will generalise to some additional circumstances. Scoring well on a prediction evaluation is evidence that the actor is capable of using the model evaluations to make informed decisions.
However, further work is needed to understand whether, how, and when to use prediction evaluations.
The author of this piece would like to thank the following people for helpful comments on this work: Ross Gruetzemacher, Toby Shevlane, Gabe Mukobi, Yawen Duan, David Krueger, Anton Korinek, Malcolm Murray, Jan Brauner, Lennart Heim, Emma Bluemke, Jide Alaga, Noemi Dreksler, Patrick Levermore, and Lujain Ibrahim. Thanks especially to Ben Garfinkel, Stephen Clare, and Markus Anderljung for extensive discussions and feedback.
Alan Chan can be contacted at alan.chan@governance.ai
Appendix
Running a prediction evaluation
This section describes each step in a prediction evaluation in more detail.
Selecting the model evaluations
The first step is choosing the initial and test set evaluations.
Since the science of model evaluations is still developing, it is not obvious which specific evaluations should be used for prediction evaluations. One hypothesis is that they should target specific use cases, such as ways to misuse models for cyberattacks. Such specific targeting may be desirable because understanding of generalisation in one use case may not transfer to understanding in another use case. That makes it more important to understand model generalisation in high-stakes use cases. On the other hand, it may be easier to work in simpler, but not necessarily realistic, environments. Such environments may provide clearer insights into generalisation,10 but the insights may not be immediately relevant to any deployment setting.
To separate test evaluations from initial evaluations, one should try to account for the range of conditions the model might face in the real world. For example, test evaluations may test a more diverse range of inputs to the model. When evaluating whether the model can complete complex tasks, it may also be important to vary how the environment responds to the model’s actions. One could vary the tools (e.g. access to web search) available to models in the initial and test evaluations to simulate how users may augment models with different tools following deployment.11 Initial and test evaluations could even assess the completion of different tasks. For instance, we may be interested in a model's ability to assist in the creation of chemical weapons. Test evaluations could focus on a different set of chemical weapons than the initial evaluations.
Preventing the actor from running the test evaluations
For a prediction evaluation to provide good evidence of understanding of generalisation, we need to prevent the actor from simply running the test evaluations and reporting the results. The incentive to cheat in this way may be strongest when a regulator subjects a model developer to a prediction evaluation. Prediction evaluations that a developer runs internally may not suffer from this problem as significantly.
Ideally, we want the actor to predict accurately only because they have gained some insight12 into generalisation. To achieve this goal, we can limit the actor’s knowledge and possible actions. Future research could investigate the effectiveness of the following possibilities:
- Limiting the actor’s knowledge
- Only providing the actor a high-level description of the test set evaluations
- Describing to the actor some but not all of the test set evaluations
- Limiting the number of initial evaluations provided to the actor
- Limiting the actor’s possible actions
- Limiting the extent to which the actor can interact with the model
- Limiting the amount of time an actor has to provide predictions
- Limiting the amount of computing power (compute) the actor can use to make the prediction.
Crucially, the limitations should not be so strict that it becomes impossible to make accurate predictions even for well-understood models.
Assessing prediction evaluations
The final step is to assess the accuracy of the predictions and make decisions based on the results. A key issue is to define what good prediction performance is. It may prove difficult to make accurate predictions even after extensive efforts are made to understand a model’s behaviour. As well, different standards may be appropriate for different applications of prediction evaluations (e.g., a model developer internally running prediction evaluations vs. a regulator administering prediction evaluations to a model developer).
A potential standard could require that the actor must have higher prediction accuracy than some reference class. For example, consider an independent control group whose members have no detailed knowledge of the model, except basic information such as training compute and model size. An actor that predicts worse than the expert group likely does not have a strong understanding of how model evaluation results generalise.
In the context of a decision about model deployment, the direction in which a prediction is inaccurate may be a key consideration. Underestimating a model’s capabilities (or overestimating its degree of safety) may be more costly than overestimating them (analogously, underestimating its degree of safety) because greater societal harm could result from the model’s deployment.
A regulator could more heavily penalise underestimation, but in so doing may create strong incentives to overestimate a model’s capabilities. Ideally, prediction evaluation should incentivise efforts to gain understanding. One potential solution could be to assess the work that actors produce to justify their predictions, in addition to the predictions themselves. Estimates based on faulty or vague reasoning could be judged to be inferior to the same estimates with good reasoning. Alternatively, the regulator could try to identify and penalise consistent overestimation across a number of different prediction evaluations.
Footnotes
1 - Evaluations can happen both before the initial deployment of a model and over the course of deployment, such as when the model is updated.
2 - Even when extensive effort is spent on evaluating a model before deployment, problems can still occur shortly after deployment. For example, GPT-4 was jailbroken shortly after release. As well, the discovery of chain-of-thought prompting after the release of GPT-3 led to significant, unexpected performance improvements on numerous tasks.
3 - To understand generalisation, one must know both (1) the range of possible conditions in the real world that the model will encounter and (2) how the model’s behaviour under those possible conditions would differ from the model evaluation results. We focus on the second problem in this work. For more on the first problem, see “Rethinking Model Evaluation as Narrowing the Socio-Technical Gap” and “Sociotechnical Safety Evaluation of Generative AI Systems”.
4 - A natural question might be, “Why can we not simply evaluate models on all environments of interest?” Unfortunately, there is likely too much variation among possible real-world conditions to allow for exhaustive evaluations. Ideally, a scientific understanding of generalisation could enable accurate predictions for model behaviour without exhaustive experiments.
5 - The blog post “Towards understanding-based safety evaluations” briefly discusses this idea and uses the term prediction-based evaluations.
6 - Making good predictions also depends upon understanding the evaluations setup, especially since model evaluation results can be extremely sensitive to the particular setup (e.g. see “What's going on with the Open LLM Leaderboard?”).
7 - Potential examples of techniques developers may use in making their predictions include scaling laws and explanations.
8 - Either the regulator or the model developer could run the evaluations. If the developer is running evaluations, care must be taken to ensure that the developer reports results accurately, rather than modifying the results to be more closely aligned with their predictions.
9 - Progress is also possible on entries in the list of open questions given above. Regarding question 5, it may be important to understand whether regulators have the expertise and resources to administer prediction evaluations, given that model evaluations must be run. For question 6, other actors to explore could include deployers, users, and regulators. On question 7, one important sub-question is whether penalties for poor prediction performance can cause a model developer to reallocate resources to improving understanding. As for question 8, one interesting question is what the optimal balance of prediction evaluations to model evaluations should be, given a fixed budget.
10 - E.g. The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" uses a simple, controlled environment to show that language models trained on statements “A is B” tend not to learn “B is A”.
11 - E.g. see AutoGPT, Toolformer: Language Models Can Teach Themselves to Use Tools.
12 - E.g. they have developed some scientific theory.