Activation vector steering with BCI

Technical AI safety

Lisa Thiergart

ActiveGrant

$30,260raised

$244,000funding goal

Donate

Project summary

Recent work (https://tinyurl.com/avgpt2xl) has shown that language models can be “steered” (towards text completions which resemble humans in differing mental states) by simply adding vectors to the model’s neural activations. Other recent work (e.g. https://tinyurl.com/latentlin) has shown that latent representations of different models can be bridged by a simple linear mapping. In this experiment our hypothesis is that (some aspects of) human brain states can be bridged to the latent representations of language models by simple mappings. This could contribute to prosaic AI alignment: (1) generative models could be steered to exhibit the specific brain states of specific people, to better represent their attitudes and opinions; (2) reward models could be trained to reproduce humanlike brain states during evaluation, making them more generalizable out-of-distribution; (3) scientific understanding of analogies between LLM behavior patterns and human behavior patterns could be improved.

What are this project's goals and how they be achieved?

Some of the specific steps:

Design the fMRI data-collection protocol
Implement the data-collection protocol (in particular, the display and keyboard elements)
Recruit human subjects
Connect with a suitable fMRI center and get the experiment approved (IRB process)
Administer the human-subject data-collection
Design the ML experiments (fMRI feature extraction pipeline, particular architecture modifications, loss function, validation metrics)
Implement the ML experiments (the dataset may be large enough to require cloud resources)
Write the technical report/paper

Impact:

Advancing the science of direct and meaningful connections between human minds and prosaic AI
Which is one potential pathway toward more generalizable AI value alignment—by ultimately modeling the process by which humans make value judgments more causally and mecahnistically, as opposed to merely its behavioral statistical features on a finite training distribution

How will this funding be used?

Salary

108000$ 6 months salary for 1 researchers + 3 months 1 ML engineer (16k/month 3 months for ML, 10k/month 6 months for 1 researcher)
- This will include one researcher + one ML engineer
900$ fMRI ops contractor (30h * 30$/h)
900$ Participant Volunteer compensation (25 Participants 1h 30$/h)
50000$ tax for the salaries (assumed ~45% total overhead regardless of specific tax optimizations)

Equipment

4800$ compute costs ( A100 GPU * 6 months)
16500$ = 25h of fMRI time at($660 per hour ). We think we’d need 20-25h at the lower bound, and the more hours we can get the better.
50$ rubber-based “Virtually Indestructible Keyboard” for MRI-compatibility, only available used
2000$ MRI-compatible screens for use inside the machine and/or travel to an fMRI facility with this installation already available
3000$ Research laptop for use onsite at recordings

One-off Misc

15600$ Office Costs (1400$/person office cost at FAR labs monthly 6 months 2 persons)
1776$ Proportional visa costs for 1 researcher for this time period

20% buffer

Total: $244k

Who is on the team and what's their track record on similar projects?

David “davidad” Dalrymple:

Suggested this experiment before seeing the original activation-engineering results
Coauthor of Physical Principles for Scalable Neural Recording (with Ed Boyden, George Church, Konrad Kording, Adam Marblestone, et al.)
Advisor to this Nature Methods paper on 3D neuroimaging (in Acknowledgments): https://www.nature.com/articles/nmeth.2964
Advisor to Brain Preservation Foundation https://www.brainpreservation.org/team/david-dalrymple/
Studied systems neuroscience in the Biophysics PhD program at Harvard
Main claim to fame: youngest MIT graduate student (obtained master’s at age 16)
Author of An Open Agency Architecture for Safe Transformative AI (see also this subsequent exposition).
- That is a completely different approach that relies on formal verification for safety rather than prosaic alignment; however, nonetheless, davidad believes there are some prosaic directions (such as this one) that deserve more attention and effort.

Lisa Thiergart:

Co-author on original activation engineering paper (soon will also be on arxiv) https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
Co-author on adding vector to steer a maze-solving agent https://www.lesswrong.com/posts/gRp6FAWcQiCWkouN5/maze-solving-agents-add-a-top-right-vector-make-the-agent-go
SERI MATS scholar
Previous experiences: https://www.linkedin.com/in/lisathiergart/
- neurotech / alignment relevant experience:
  - 6 months on Team Shard mentored by Alex Turner, various mechanistic interpretability projects including maze and natural abstraction
  - 4 months working as Research Scientist for BCI startup
  - 3 months upskilling at Entrepreneur First focused on Alignment and Neurotech domain exploration
  - Ran workshop on neurotech for alignment affiliated with foresight
  - 8 months CORE Robotics lab - specialist project on BCI control of robotics, experience with EEG recording, experiment execution with participants and experimental design. IRB certified.

What are the most likely causes and outcomes if this project fails? (premortem)

The most obvious is that AIs don't make value judgements like humans do and this is a waste of time. It still seems well worth trying though.

What other funding is this person or project getting?

Probably some from Foresight since they are applying and we are in discussions with them. They don’t want to very actively spend time seeking grants since it is very time-consuming.

donated $110

Adrian Regenfuss

about 1 year ago

I've referenced this proposal a bunch of times in conversation, and find it pretty cool.

donated $110

Adrian Regenfuss

about 1 year ago

@Adrian-Regenfuss I would be even more enthusiastic if there were plans to also train LLMs on human brain signals, since the activations look to me to be too inflexible to bridge potentially-extremely alien cognition to human cognition. But that's a much higher ask.

Sophia Pung

over 1 year ago

Hey Lisa and David!

I’m reaching out regarding a project that you might be interested in.

Previously I applied on Manifold for a grant for programming a phone to track GPS on a solar powered TuckTuck with Solar4Africa, and over the past three months, I’ve been working with a team to develop a novel Electroencephalogram.

Our team is called Monolith BCI (on Twitter), and we are creating a novel PCB design and ML model to process electrical signals generated by large clumps of neurons in the brain with an EEG. Henry is working on the ML model (Bramble), as well as a bunch of parts of the project, Cheru and JC designed the PCB with 0 PCB experience before this project. I’m creating a research paper, and dataset (FLUX- the Framework for Learning and Understanding Cortex Activity) to help fine-tune our LLM model.

We’ll be in SF for our 3rd sprint from Feb 24th-March 4th and would love to chat. We have a working v2.0 of our PCB (Trillium) which has 8 channels. By the end of our 3rd sprint goal is to have our own PCB working with our ML model to play Tetris and Pong without using EMG signals (we’ve already successfully used four modalities- jaw clench, blinking, focus, and non-focus), but all of these rely on EMG signals, and our product will only be viable once we’re able to capture thoughts alone- but we believe we’re really close to getting there.

Let us know if you’d like to chat!

-Monolith MMI (mind-machine interfaces)

donated $15,000

Marcus Abramovitch

over 1 year ago

I reached out to Lisa for a progress update on this:

-There really is an important minimum funding to make this project viable that hasn't been achieved yet and so they haven't started. Money is sitting in Lisa's account.

-She will talk with Davidad soon and decide if they are going to make a push for more funding, pivot to something else that could be lower cost or return the funds (expecting by end of March)

I like the honesty here and hope we can get Lisa funded to do a large project (whether this one of another one on areas she is interested in). I'm very "bullish" on Lisa still. I think she's someone that can/will just make something happen that should happen without needing too much permission. I think she also has a rather unique and needed blend of technical understanding + ability to do all the little organizing things that are needed for something to get off the ground.

donated $15,000

Marcus Abramovitch

over 1 year ago

Main points in favor of this grant

When I talked with Lisa, she was clearly able to articulate why the project is a good idea. Often people struggle to do this

Lisa is smart and talented and wants to be expanding her impact by leading projects. This seems well worth supporting.

Davidad is fairly well-known to be very insightful and proposed the project before seeing the original results.

Reviewers from Nonlinear Network gave great feedback on funding Lisa for two projects she proposed. She was most excited about this one and, with rare exceptions, when a person has two potential projects, they should do the one they are most excited about.

I think we need to get more tools in our arsenal for attacking alignment. With some early promising results, it seems very good to build out activation vector steering.

Donor's main reservations

I don’t feel I’m a good judge of whether or not this is worth doing. I think I judge talent well, but I don't have nearly enough alignment background or neurotech background to judge this. This is far more of a bet on the people than on the project. I also don't think many people would be qualified to judge the project.

It's expensive.

I somewhat worry that Lisa won't be full-time on the project and/or that this might distract her from her other work. She did say she had broad support from her current workplace to pursue this in tandem.

Process for deciding amount

The project is in discussion with Foresight to see if it's possible to do a scaled-down version that isn't as expensive. My $15k should go towards getting the ball rolling with the expectation of a few more people to get this at least to the scaled-down stage but preferably the full proposed project.

Conflicts of interest

None

donated $15,000

Marcus Abramovitch

over 1 year ago

I interviewed Lisa for this grant.

Reasons I am excited about Lisa:
-She is quite articulate and has good people/social skills and is able to simply explain concepts.
-She is already doing some management and wants to be expanding here. Worth supporting since it seems to me that there is a lack of management experience in the AI safety research space.
-She's quite smart and value-aligned.