Automatic circuit discovery on sparse autoencoded features

Can Rager

ActiveGrant

$25,000raised

$25,000funding goal

Fully funded and not currently accepting donations.

Project summary

Recent work has developed methods to automatically discover "circuits" implementing basic functionality in neural networks. These circuits are described in terms of computation involving neurons. However, concurrent work has argued that neurons are often in "superposition" representing multiple features at the same time, and that sparse autoencoders can extract these features from neurons. This poses the question: would circuits be more discoverable and easier to understand if represented at the feature level rather than neuron level? This project intends to explore automatic circuit discovery over features.

What are this project's goals and how they be achieved?

At a high-level the project will seek to determine if circuits over features are easier to understand and more readily discoverable by automatic means than circuits over neurons (the baseline method). Concretely, preliminary results have found circuits for subject-verb agreement. Can intends to extend this to a simple ELK-style problem, to identify a concept represented by the model from a labeled dataset of positive and negative examples of the concept, while disambiguating it from a highly correlated concept the model also represents based on understanding how the respective concepts are computed.

How will this funding be used?

Research stipend to Can Rager. Can may pay for other expenses such as office space and compute out of this stipend.

Who is on the team and what's their track record on similar projects?

Can Rager (Google Scholar) has experience working on automatic circuit discovery from the Attribution Patching Outperforms Automated Circuit Discovery preprint.

He will collaborate with Sam Marks, a post-doc in the Bau Lab, that specializes in interpretability, and Prof. David Bau.

What are the most likely causes and outcomes if this project fails? (premortem)

The research idea is high-risk high-reward: circuits on features may prove to be difficult to find and understand.

Additionally the entire field of mechanistic interpretability is taking a high-risk bet that neural networks can be reverse engineered at scale. Although automatic methods such as the one this project investigates help with scalability, many challenges remain. Even if the project succeeds by its own lights, if future work cannot develop the method further then it may turn out to be a research dead-end.

There is a risk in terms of execution with Can being relatively new to research. This is mitigated by collaboration with Sam & David, however David may be time-constrained (as a professor with many PHD students), and Sam's PhD was in math so may have limited hands-on engineering experience.

donated $25,000

Adam Gleave

over 1 year ago

Main points in favor of this grant

Promising research idea; "obvious next step" but not one that anyone else seems to be working on.
Can Rager has relevant research experience.
David Bau's lab is a recognized name in the field and a competent collaborator.

Donor's main reservations

Limited track record from Can.
Research project is high-risk, high-reward.

Process for deciding amount

$6000-$9000/month seems to be around the going rate for junior independent research based on previous LTFF grants. I went on the higher end as: (a) stipend may need to pay for office expenses not just living expenses; (b) Can intends to be based in the Bay Area for some of this time, a high cost-of-living location.

Conflicts of interest

Can may spend some of his stipend on a desk & membership in FAR Labs, an AI safety co-working space administered by the non-profit FAR AI that I am the founder and CEO of. This is not a condition of this grant, and I have encouraged Can to explore other office options as well. I do not directly benefit financially from additional members at FAR Labs, nor would one member materially change FAR AI's financial position. No other conflicts of interest.

Neel Nanda

11 months ago

@AdamGleave Just noting that I was quite impressed by the paper that came out of this ( https://arxiv.org/abs/2403.19647 ) - good grant, and good work by Sam, Can and co!