Removing Hazardous Knowledge from AIs

Project summary

This project aims to remove hazardous CBRN and cyber knowledge from AI models. To do this, researchers will develop CBRN and cyber knowledge evaluations which measure precursors for dangerous behavior but which are not info-hazardous. Then, the researchers develop unlearning techniques to remove this knowledge. Finally, the research will be communicated to NIST to inform their dual-use foundation model standards.

What are this project's goals and how they be achieved?

This project’s goal is to remove hazardous CBRN and cyber knowledge from AI models. This project includes:

Developing datasets of CBRN and cyber knowledge which contain precursors for dangerous behavior but which themselves are not info hazardous. (E.g., knowledge of reverse genetics itself isn’t hazardous but is required to do more hazardous things.)
Developing unlearning techniques to remove this precursor knowledge.

To develop the dataset, researchers will be working with a large group of academics, consultants, and companies, including cybersecurity researchers from Carnegie Mellon University and biosecurity experts from MIT (e.g., Kevin Esvelt’s lab).

To develop the unlearning techniques, researchers will experiment with many different methods. Methods need to 1) remove the relevant precursors to hazardous knowledge and 2) preserve general domain knowledge which is not hazardous.

How will this funding be used?

The funding will be used to pay consultants and contractors for dataset collection.

Who is on the team and what's their track record on similar projects?

Alex Pan, one of the research leads, is a PhD student at UC Berkeley in Jacob Steinhardt’s lab (https://aypan17.github.io/). He has published two first-authored papers at top-tier ML conferences (ICML and ICLR) on reward misspecification and measuring the safety and ethical behavior of LLM agents.

The other research lead is Nat Li, who is a 3rd year undergraduate at UC Berkeley and has co-authored two ML papers previously (1, 2). I will also help advise this project and have a long track record of empirical AI safety research (3).

What are the most likely causes and outcomes if this project fails? (premortem)

One of the main risks is whether the consultants and contractors can be directed. If the consultants produce general bio/chem knowledge instead of specifically precursors to hazardous capabilities, the resulting dataset won’t be useful to unlearn.

What other funding is this person or project getting?

None.

Main points in favor of this grant

Removing hazardous capabilities from models would greatly help reduce AI x-risk from malicious use and unilateral actors. Alex is a researcher with a strong track record who is interested in AI safety and has done previous AI safety research. The timing is right; NIST has been tasked by the recent EO to help develop standards and regulations on “dual-use foundation models.” Research now has a much higher likelihood of helping shape regulation.

Donor's main reservations

This is a relatively complex project with many moving parts. It’s crucial that the project is executed well on a relatively short timeline.

Process for deciding amount

It was estimated by the researchers that this was the total amount needed for the dataset. I have reviewed the budget and approved.

Conflicts of interest

I will be helping advise this project.