Iván and Jett have been accepted to MATS, under Arthur's mentorship. We are very excited about our project on unfaithful chain-of-thought and would love to start right away. Since all three of us have time and availability, we are looking for one-month funding to start the project before the MATS research phase.
Chain-of-thought (CoT) is a prompting approach that improves the reasoning abilities of LLMs, directing them to verbalize their reasoning in a step-by-step fashion. However, recent work has shown that the CoT produced by models is sometimes unfaithful: it doesn't accurately represent the internal reasons for a model's prediction (Turpin et al., 2023, Lanham et al., 2023). Other recent work has also shown how to decrease this unfaithfulness (Radhakrishnan et al., 2024, Chua et al., 2024). But how common is this behavior? Moreover, why does this happen (nostalgebraist., 2024)? We think using mechanistic interpretability to answer these questions is promising, due to our current basic confusion at unfaithful CoT.
During the MATS training program, Iván and Jett performed patching and probing experiments to understand and detect unfaithful CoT, using diverse Yes/No questions and biased contexts, Few-Shot Prompts (FSP) with all yes or all no answers, on the Llama 3.1 8B model. Our key findings were that:
Existing CoT unfaithfulness datasets (Turpin et al., 2023) are very repetitive and produce patterns that models pick up on, making then unfaithful CoT that's just a slight variation of the questions in the FSP.
The information from the biased context flows through different tokens, and different layers are more or less relevant, depending on the question and the specific CoT. We believe this is the reason why simple logistic regression and mean diff probes built on specific tokens did not work in our setup for detecting unfaithful CoT.
Attention probes can be used to detect unfaithful CoTs! These probes use attention mechanism to selectively probe at relevant tokens. In our experiments, we achieved 81% test accuracy on Layer 15.
During the following month, we plan to:
[Week 1] Perform experiments to evaluate what the attention probes are picking up on. This will include analyzing cases of correct and incorrect predictions and comparing performance on CoTs with and without few-shot prompts.
[Week 2] Further analysis of factors relevant for probe prediction, including training probes on partial sequences. Steering experiments to measure the causal effect of the learned directions.
[Week 3] Write up the findings up to this point (e.g. as a LessWrong post).
[Week 4] Drill down into findings about models being in “auto-pilot” and unfaithfully producing answers. We hypothesize that models operate on two different modes: "auto-pilot" and "reasoning". When on "auto-pilot" the CoT does not matter much for the response of the model. When "reasoning" the CoT is causally relevant for the final answer. Although this discrepancy in behavior has been observed before (Ye et al., 2022, Madaan and Yazdanbakhsh, 2022, Wang et al., 2023, Shi et al., 2023, Gao, 2023, Lanham et al., 2023), there is not yet a mechanistic explanation of how or why this happens. Our patching experiments already show mechanistic evidence that different tokens matter for different questions, but we would like to distill this finding even further.
We also plan to perform the following future work if there is extra time during this month, or otherwise during the MATS research phase:
Expand the experiments to other models beyond Llama 3 8B, such as Gemma 2 9B and 27B. This will in turn enable us to analyze if there are SAE latents that are interpretable and close to the learned probe directions. This is not possible at the moment with the Llama 3 8B since there are no good SAEs publicly available. We are assigning this as future work because there may be further changes in our understanding of unfaithful CoT on Llama 3 8B.
Gather further evidence that existing CoT unfaithfulness datasets (Turpin et al., 2023) are often making the model produce biased answers only by "induction" and are not that interesting for analyzing deception.
Run the learned probes on random chat datasets to see if they surface ~deceptive behavior in general.
We will use this funding for stipends ($5K each for Iván and Jett) and compute ($500 each for Iván and Jett). This would cost less than what MATS pays for the 10-week research phase, which is 12k USD but includes housing, food, and compute.
This research will be performed by Iván Arcuschin Moreno and Jett Janiak, with mentorship from Arthur Conmy. Iván has recently finished MATS under Adrià Garriga-Alonso’s mentorship, leading to a NeurIPS publication. Jett has recently finished LASR under Stefan Heimersheim’s mentorship, leading to a paper presented at SciForDL NeurIPS workshop. Arthur is a Research Engineer in Google DeepMind's Interpretability team, where he has worked and published on many different Interpretability projects, including mentoring researchers on several occasions (Gould et al., 2023, Syed et al., 2023, Kissane et al., 2024 [co-supervisor], Kharlapenko et al., 2024 [co-supervisor], Farrell et al., 2024, Chalnev et al., 2024).
This project can fail if we pursue a research direction that is incorrect, either because our current results obtained during the training program are flawed or because we misinterpret them. Iván and Jett have been trying to mitigate this by red-teaming each other's work, and continuously sharing results with Arthur for feedback. This project can also fail if Iván, Jett, and Arthur do not work well together. We feel this has already been mitigated since Iván and Jett have been successfully working together for three weeks (even before the training program's official start), complementing each other's strengths, and they have been mentored by Arthur during the training program without issues.
Both Iván and Jett got a $4.8k stiped by MATS for Neel Nanda's training program, which had a one-month duration. MATS will fund Iván and Jett in the winter for the research phase. Iván has received a grant from LTFF for the extension of a previous instance of the MATS program, from April to September 2024. Jett has received stipends for leading an AI Safety Camp project and for participating in LASR. Arthur has not received funds in the last 12 months.
Both Iván and Jett took part in Neel Nanda's training program for MATS. They were TA'd by Arthur, but also got occasional feedback from Neel, so there is a potential conflict of interest if he ends up funding this proposal. Nevertheless, this proposal was written without Neel's input.