I hope you'll have enough money to launch your project, it looks promising!
You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
This project forms the first phase of a larger effort to develop technical control mechanisms that ensure agentic AI systems operate within human-specified constraints.
I (Francesca) recently founded Wiser Human, an AI Safety org, through the Catalyze incubator at Lisa (AI Safety office in London). Our mission is to combine our experience in risk management with technical AI control and scalable oversight techniques to drive practical, rapid progress to in AI control mechanisms in a world where timelines to advanced AI are short.
We believe the first step in achieving this is to identify the pathways most likely to lead to harm in emerging agentic AI systems, evaluate how effective existing safeguards are in mitigating these risks, and identifying which new control mechanisms and protocols we need to develop most urgently. This project addresses these challenges.
This project aims to develop a technical framework for human control mechanisms in agentic AI systems by creating a structured, repeatable methodology for risk and threat modeling. Our approach will produce both a methodology and a technical mechanism to apply it in practice which takes an agentic AI system use case description as input and outputs:
Threat models based upon the agent’s level of autonomy, action surfaces, tool access, affordances, capabilities, propensities, and deployment scale.
Existing safeguards that could be applied to address these threat models e.g. OWASP prevention and mitigation strategies, estimating their potential effectiveness, weaknesses and potential for subversion.
Critical safeguard gaps, identifying areas where new control mechanisms are required (serving as candidates for blue teams to develop control protocols via control evaluations, as introduced by Redwood Research).
Our aim is for this project to accelerate the AI control research agenda by making it faster to AI Safety researchers to understand the most critical threats to agentic AI use cases, where gaps exist against existing safeguards, and which existing safeguards would likely be most vulnerable to subversion.
We will publish our findings, release open-source code, and engage with the broader AI safety community to drive adoption.
This project aims to develop a technical framework for human control mechanisms in agentic AI systems by creating a structured, repeatable methodology for risk and threat modeling. Our approach will produce both a methodology and a technical mechanism to apply it in practice which takes an agentic AI system use case description as input and outputs:
Threat models based upon the agent’s level of autonomy, action surfaces, tool access, affordances, capabilities, propensities, and deployment scale.
Existing safeguards that could be applied to address these threat models e.g. OWASP prevention and mitigation strategies, estimating their potential effectiveness, weaknesses and potential for subversion.
Critical safeguard gaps, identifying areas where new control mechanisms are required (serving as candidates for blue teams to develop control protocols via control evaluations, as introduced by Redwood Research).
Our aim is for this project to accelerate the AI control research agenda by making it faster to AI Safety researchers to understand the most critical threats to agentic AI use cases, where gaps exist against existing safeguards, and which existing safeguards would likely be most vulnerable to subversion.
We will publish our findings, release open-source code, and engage with the broader AI safety community to drive adoption.
This funding will directly support:
Technical R&D: Designing, developing, and testing the risk and threat modeling methodology and mechanism to run this against agentic AI use cases.
Empirical validation: Running case studies on real-world agentic AI use cases to test the methodology’s effectiveness.
Publication and dissemination: Producing a research paper, a LessWrong blog post, and making the codebase openly available to AI safety researchers.
This funding will lay the foundation for future work, for us and other AI Safety researchers to conduct targeted control evaluations, and develop new control mechanisms resistant to agent subversion.
Francesca has over 10 years of experience in operational and technology risk management, including her role as Head of Operational Risk at Tandem, a digital bank. She holds a BSc in Artificial Intelligence and an MSc in Human-Centered Computer Systems (Sussex University, UK). Francesca co-authored research on Safety Cases for Frontier AI with Arcadia Impact, published in December 2024, contributing to the AI safety space. She is interested in the threat model of losing control to AI and has published a project on ‘A Monte Carlo Simulation for estimating the risk of loss of control to AI’, and in the summer worked on a game built using the ArcWeave platform for people to explore choices and power structures in a world shaped by advanced Artificial Intelligence technologies.
Seb has 15+ years of software engineering experience, including developing risk monitoring systems for financial institutions like JP Morgan and HSBC. Additionally, he has 4 years of product management experience at Silicon Valley startups, combining technical expertise with strategic execution. He has led teams of 20+ software engineers on complex systems developments.
Seb and Francesca previously co-founded a fintech company, which participated in the Techstars and Barclays Accelerator programs.
We believe the most likely causes of this project failing are:
It proves unfeasible to develop a methodology that is effective for all agentic AI use cases i.e. it does not fully account for specific deployment scenarios, such as multi-agent systems or highly open-ended environments where an agent operates with significant autonomy across a broad range of domains.
Lack of adoption, if AI safety researchers and the broader community does not engage with or apply our methodology.
If the project fails we expect the outcomes to be:
Gaps in existing control mechanisms for agentic AI systems are more likely to go unnoticed, increasing systemic AI risks, particularly in high-autonomy and multi-agent deployment scenarios.
Slower iterative testing and development of AI control protocols, plus a reduced ability to prioritise the most critical areas where safeguards are lacking.
5,000 from a private donor which has allowed us to start initial work on the design of the methodology.
Achieving our minimum funding allows us to develop the methodology for threat modelling but limits our ability to develop a technical means of running this. Full funding allows us to deliver the set of deliverables outlined above.
Romain Deléglise
5 days ago
I hope you'll have enough money to launch your project, it looks promising!