You're pledging to donate if the project hits its minimum goal and gets approved. If not, your funds will be returned.
My paper has been accepted at ICLR2025 and I seek travel funding to present it there.
We started working on this paper during the MATS 4.1 extension program and continued afterwards on our own savings. Thus, I have no institution or grant that could cover conference attendance.
The paper's topic is technical AI safety and mechanistic interpretability for LLMs. More specifically, Sparse Autoencoders (SAE) is a new method to extract human-interpretable features from LLM representations and could be useful to discover deception or undesired behavior and to enhance model robustness, controllability, and debugging. However, realistic ground-truth evaluations have been lacking.
In the paper, we propose and test a new method to evaluate SAEs and show the following things that might be valuable for further research in technical AI safety:
SAE features that have clear interpretation are often not great for control, so researchers should be more cautious when they interpret their results.
SAEs trained on real data are much worse than SAEs trained on toy or task-specific data, but some improvements in SAE architecture alleviate some of these issues.
We characterize other SAE problems like feature splitting, feature magnitudes, and occlusion
All together, these improve our ability to realistically evaluate SAE-based interpretability methods, preventing potential misinterpretations of model features, thereby contributing to safer and more robust AI systems.
Preparing the accompanying poster and showing it at ICLR2025.
Presenting the paper to other AI safety researchers who use SAEs in their research and raise awareness for SAE pitfalls and showing them how to use proper evaluations that our paper introduces.
As a secondary goal, I am currently looking for opportunities to continue to work in technical AI safety, i.e. am looking for jobs in research labs, non-profits, or academic labs and attending the conference would enable me to network, explore collaborations, and engage directly with other researchers.
900 $ ICLR2025 attendance fee
1050 $ basic economy flights Frankfurt - Singapore
900 $ hotel
I am applying for this grant alone. The other first author of the paper is Alex Makelov and this project was mentored by Neel Nanda.
I previously presented at other conferences with great engagement, for example at ICLR2024 or SfN. The arxiv preprint of this paper has 28 cites on google scholar.
Without funding, I couldn't present the paper at ICLR2025, reducing visibility of our proposed evaluation methods and critical insights into SAE limitations. Consequently, AI safety and mechanistic interpretability researchers might continue to rely on inadequately evaluated SAE features, potentially overstating their reliability and interpretability.
I haven't raised money in the last 12 months.