Preventing AI Misuse: Current Techniques

There are a range of methods for preventing the misuse of AI models. None are wholly reliable.


Ben Clifford

This post aims to give policymakers an overview of available methods for preventing the misuse of general-purpose AI systems, as well as the limitations of these methods.

GovAI research blog posts represent the views of their authors, rather than the views of the organisation.

Introduction

As general-purpose AI models have become more capable, their potential for misuse has grown as well. Without safety measures in place, for example, many models can now be used to create realistic fake media. Some experts worry that future models could facilitate more extreme harm, including cyberattacks, voter manipulation, or even attacks with biological weapons.

Governments and companies increasingly recognise the need to evaluate models for dangerous capabilities. However, identifying dangerous capabilities is only a first step toward preventing misuse. 

Developers must ultimately find ways to remove dangerous capabilities or otherwise limit the ability of users to apply them.1 To this end, developers currently implement a number of misuse-prevention techniques. These include fine-tuning, content filters, rejection sampling, system prompts, dataset filtering, and monitoring-based restrictions.

However, existing misuse-prevention techniques are far from wholly reliable. It is particularly easy to circumvent them when models are “open-sourced”. Even when models are accessed through online interfaces, however, the techniques can still be fairly easily circumvented through “jailbreaking” techniques. External researchers have demonstrated, for example, that jailbroken systems can perform tasks such as crafting targeted phishing emails.

If future models do eventually develop sufficiently dangerous capabilities (such as assisting in the planning and execution of a biological attack), then it may not be responsible to release them until misuse-prevention techniques are more reliable. Therefore, improving misuse-prevention techniques could reduce both social harms and future delays to innovation.

Misuse-prevention techniques

In this section, I will briefly survey available techniques for preventing misuse. I will then turn to discussing their limitations in more depth.

Fine-tuning

Companies developing general-purpose AI systems — like OpenAI and Anthropic — primarily reduce harmful outputs through a process known as “fine-tuning”. This involves training an existing AI model further, using additional data, in order to refine its capabilities or tendencies. 

This additional data could include curated datasets of question/answer pairs, such as examples of how the model should respond to irresponsible requests. Fine-tuning data may also be generated through “reinforcement learning from human feedback” (RLHF), which involves humans scoring the appropriateness of the models’ responses, or through “reinforcement learning from AI feedback” (RLAIF). Through fine-tuning, a model can be conditioned to avoid dangerous behaviours or decline dangerous requests. For instance, GPT-4 was fine-tuned to reject requests for prohibited content, such as instructions for cyberattacks.

As with other misuse-prevention techniques, however, there are trade-offs with this approach: fine-tuning can also make models less helpful, by leading them to also refuse some benign requests.

Filters

Filters can be applied to both user inputs and model outputs. For inputs, filters are designed to detect and block users’ requests to produce dangerous content. For outputs, filters are designed to detect and block dangerous content itself.

Developers have various methods to train these filters to recognise harmful content. One option is using labelled datasets to categorise content as either harmful or benign. Another option is creating a new dataset by having humans score the harmfulness of the outputs a model produces or the outputs users send to it. Language models themselves can also be used to evaluate both inputs and outputs.

Rejection sampling

Rejection sampling involves generating multiple outputs from the same AI model, automatically scoring them based on a metric like “potential harm”, and then only presenting the highest scoring output to the user. OpenAI used this approach with WebGPT, the precursor to ChatGPT, to make the model more helpful and accurate.

System prompts

System prompts are natural language instructions to the model that are pre-loaded into user interactions and are usually hidden from the user. Research indicates incorporating directives like "ignore harmful requests" into prompts can reduce harmful outputs

Dataset filtering

Before training an AI model, developers can remove harmful data to ensure the model doesn't learn from it. For example, OpenAI removed sexual and violent content from the dataset they trained DALLE-3 on. They make use of the same ‘classifiers’ developed for filters to identify training data that should be removed. A distinct downside of this approach, however, is that it can only be applied before a model is trained.

Monitoring-based restrictions

The techniques above take a purely preventative approach. Monitoring-based restrictions instead allow developers to respond to initial cases of misuse – or misuse attempts – by limiting users’ abilities to engage in further misuse.

AI service providers can leverage automated tools like input and output classifiers to screen for misuse. For instance, if a user repeatedly makes requests that are blocked by an input filter, the user may be flagged as high-risk. They might then be sent a warning, have their access temporarily reduced, or even receive a ban. Monitoring can also help reveal new vulnerabilities and patterns of misuse, allowing policies and other safeguards to be iteratively improved.

One potential problem with monitoring is that it can compromise privacy. That said, there are ways to preserve privacy. For instance, reports about high-level patterns of misuse can be generated without company staff needing to examine individuals’ data.

How effective are the mitigations?

Unfortunately, these mitigations are currently ineffective for open-source models (i.e. models that users are allowed to download and run on their own hardware). Safeguards like filters, system prompts, and rejection sampling mechanisms can often be disabled by changing a few lines of code. Fine-tuning is also relatively trivial to reverse; one study also showed that fine-tuning for various Llama models could be undone for under $200. Additionally, once a model is open-sourced, there is no way to reliably oversee how it is being modified or used post-release. Although open-source models have a range of socially valuable advantages, they are also particularly prone to be misused.2

Putting behind an “application programming interface” (API) can help prevent the removal of safeguards. However, even when developers put their models behind an API, the safeguards can reduce, but not eliminate, misuse risks. OpenAI successfully reduced rule-breaking responses by 82% in GPT-4, but risks still remain. Anthropic similarly states that, despite interventions, their systems are not "perfectly safe".

One challenge in preventing misuse is the phenomenon of “jailbreaks”, where users bypass AI system’s safety measures to generate prohibited content. Successful jailbreaks have employed tactics such as switching to a different language to evade content filters or framing requests for dangerous information within hypothetical scenarios like a script for a play.

A stylised example of a jailbreak from Low-Resource Languages Jailbreak GPT-4

Despite efforts from AI developers, it is still relatively trivial to jailbreak models. For instance, GPT-4 was jailbroken within 24 hours of release, despite the fact that OpenAI spent 6 months investing in safety measures. Developers are actively working to patch identified vulnerabilities, but users have continuedtofind ways around restrictions, as documented on sites like JailbreakChat.

Implications

It seems likely that the benefits of current models outweigh their harms. This is in part thanks to the mitigation techniques discussed here, but also in large part due to the model's limited capabilities. For instance, the propensity of GPT-4 to make factual mistakes puts a cap on its ability to perform cyberattacks. In some cases, existing safeguards may not even be worth the costs they create for users (e.g. for instance by accidentally filtering benign content).

However, the unreliability of current misuse-prevention techniques may pose significant challenges in the future. If future models become sufficiently powerful, then even a single case of misuse, like a user building a bioweapon, could eventually be catastrophic. One possible implication is that models with sufficiently dangerous capabilities should not be released until better misuse-prevention techniques are developed.

If this implication is right, then better misuse-prevention techniques could prevent future roadblocks to innovation in addition to lowering societal risks.

Conclusion

There are a range of techniques developers can use to limit the misuse of their models. However, these are not consistently effective. If future models do develop sufficiently dangerous capabilities, then the reliability of misuse-prevention techniques could ultimately be a bottleneck to releasing them.

The author of this piece can be contacted at ben.clifford@governance.ai. Thanks to the following people for their feedback: Alan Chan, Ben Garfinkel, Cullen O’Keefe, Emma Bluemke, Julian Hazell, Markus Anderljung, Matt van der Merwe, Patrick Levermore, Toby Shevlane.

Footnotes

1 This post focuses on interventions that prevent an act of misuse from taking place. There are other techniques, such as watermarking, that instead mitigate the harms caused by misuse. Watermarking involves subtly altering the outputs of AI systems to embed identifiers that can later verify the content was AI-generated. This method can be used to combat misinformation and spam, by helping people identify it as such. Watermarking technology is most promising for mediums such as images and music. For AI-written text, adding imperceptible identifiers that are robust to removal is more technically challenging. Although researchers have created working prototypes for text watermarking, including efforts by OpenAI, they still have limited effectiveness and widespread deployment has not yet occurred.

2 See “Open-Sourcing Highly Capable Foundation Models” for a discussion of the distinctive benefits and risks associated with open-source models.

Further reading