Transformer Circuits Thread
Here’s a timeline of all the Circuits Updates & LLM researches released by Anthropic.
Circuits Updates — April 2024
A collection of small updates from the Anthropic Interpretability Team. Read more >
Circuits Updates — March 2024
Reflections on Qualitative Research – Some opinionated thoughts on why interpretability research may have qualitative aspects be more central than we’re used to in other fields. Read more >
Circuits Updates — February 2024
Circuits Updates – January 2024
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer. Read more >
Circuits Updates — July 2023
Circuits Updates — May 2023
Interpretability Dreams – Our present research aims to create a foundation for mechanistic interpretability research. In doing so, it’s important to keep sight of what we’re trying to lay the foundations for. Read more >
Distributed Representations: Composition & Superposition – An informal note on how “distributed representations” might be understood as two different, competing strategies — “composition” and “superposition” — with quite different properties. Read more >
Privileged Bases in the Transformer Residual Stream
Our mathematical theories of the Transformer architecture suggest that individual coordinates in the residual stream should have no special significance, but recent work has shown that this observation is false in practice. We investigate this phenomenon and provisionally conclude that the per-dimension normalizers in the Adam optimizer are to blame for the effect. Read more >
Superposition, Memorization, and Double Descent
We have little mechanistic understanding of how deep learning models overfit to their training data, despite it being a central problem. Here we extend our previous work on toy models to shed light on how models generalize beyond their training data. Read more >
Toy Models of Superposition
Neural networks often seem to pack many unrelated concepts into a single neuron – a puzzling phenomenon known as ‘polysemanticity’. In our latest interpretability work, we build toy models where the origins and dynamics of polysemanticity can be fully understood. Read more >
Softmax Linear Units
An alternative activation function increases the fraction of neurons which appear to correspond to human-understandable concepts. Read more >
Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases – An informal note on intuitions related to mechanistic interpretability. Read more >
In-context Learning and Induction Heads
An exploration of the hypothesis that induction heads are the primary mechanism behind in-context learning. Read more >
A Mathematical Framework for Transformer Circuits
Early mathematical framework for reverse engineering models, demonstrated by reverse engineering small toy models. Read more >
Original Distill Circuits Thread
What can we learn if we invest heavily in reverse engineering a single neural network? Read more >
What Is the Transformer Circuits Thread Project?
The Transformer Circuits Thread project is an ambitious research effort by Anthropic, it’s focus is to reverse engineer transformer language models into human-understandable computer programs. Inspired by the Distill Circuits Thread, Anthropic aims to create interactive articles and resources to make the inner workings of transformers more interpretable and accessible.
Transformers are state-of-the-art deep learning models used for natural language processing tasks. However, their complex architectures and millions of parameters make them notoriously difficult to understand and interpret. The Transformer Circuits Thread project seeks to open up this “black box” by systematically studying and explaining the computational patterns and motifs that emerge in trained transformer models.
Anthropic believes that gaining a mechanistic understanding of how transformers work is crucial for building safe and reliable AI systems. By reverse engineering transformers, they hope to:
- Explain current safety problems in language models
- Identify new potential issues
- Anticipate safety challenges in future, more powerful models
Making Transformers Tractable
To make this ambitious goal tractable, the Transformer Circuits Thread starts by analyzing the simplest possible transformer models and gradually works up to larger, more realistic architectures. Their initial focus is on transformers with only one or two layers and just attention blocks, in contrast to modern transformers like GPT-3 which have 96 layers alternating between attention and MLP blocks.
Early Progress
Despite starting small, Anthropic has already made significant progress in understanding these toy models by developing a new mathematical framework. Key findings include:
- Identifying “induction heads” that implement in-context learning in 2-layer models
- Showing how induction heads operate on specific examples
- Providing an elegant mathematical treatment of attention-only models
While these insights do not yet fully extend to practical transformers, Anthropic plans to show in future work that their framework and concepts like induction heads remain relevant in larger models. Although complete interpretability is still a distant goal, the Transformer Circuits Thread is an important step towards mechanistic understanding of transformers and building safer AI systems.
What Are Circuits Updates?
Circuits Updates are periodic blog posts where the Anthropic interpretability team shares developing research ideas, small-scale experiments, and minor findings that may not warrant a full paper. These informal updates aim to give visibility into Anthropic’s research process and plans to the wider AI community.
The Circuits Updates cover a diverse range of topics related to transformer interpretability, including:
- Experiments with model architectures and training techniques
- Analyses of learned features and attention patterns
- Hypotheses and conceptual frameworks for understanding circuits
- Replications and extensions of previous results
- Negative results and open problems
The common thread is that these are preliminary ideas that the Anthropic team is actively thinking about, but not yet ready to write a complete paper on. The updates are meant to be read more like informal lab meeting discussions rather than polished publications.
Topics covered in Circuits Updates
Below are some examples of topics covered in past Circuits Updates:
- Studying how attention heads learn to “superpose” multiple features
- Improving sparse autoencoders by modifying activation functions
- Analyzing the geometry of learned representations
- Identifying heads that copy, move, or compare tokens
- Scaling laws for interpretability
Anthropic hopes to facilitate discussion and collaboration with other interpretability researchers through sharing these works-in-progress. The Circuits Updates provide a window into the current frontiers of transformer interpretability research at Anthropic.
How Often Does Anthropic Publish Circuits Updates?
Anthropic publishes Circuits Updates on a roughly monthly cadence, with some variability. New updates are released as blog posts on the Transformer Circuits Thread website.
Since the Transformer Circuits Thread project began in late 2021, Anthropic has published Circuits Updates in the following months:
- December 2021 (initial framework paper)
- February 2022
- April 2022
- May 2022
- January 2023
- February 2023
- April 2023
- May 2023
The length of each update varies depending on how many new results the team has to share that month. Some updates focus on a single in-depth topic, while others briefly discuss several unrelated ideas.
What is the Anthropic Interpretability Team?
The Anthropic interpretability team is a group of researchers and engineers dedicated to making AI systems more interpretable and understandable. As of April 2024, the team has grown to 17 people, representing a significant fraction of the estimated 50 full-time mechanistic interpretability researchers worldwide.
The interpretability team is part of Anthropic, an AI safety and research company based in San Francisco. Anthropic’s mission is to ensure that transformative AI systems are reliable, interpretable, and beneficial to society. The interpretability team plays a key role in this mission by conducting research to reverse engineer and understand how AI models like transformers work under the hood.
Notable members of the Anthropic interpretability team include:
- Chris Olah: A former Google Brain researcher known for his work on neural network interpretability, including the original Distill Circuits Thread that inspired Anthropic’s Transformer Circuits project.
- Nelson Elhage: A software engineer and researcher who has worked on the interpretability team since its early days.
- Catherine Olsson: An AI safety researcher who collaborates closely with the interpretability team.
The team takes an interdisciplinary approach, combining expertise from machine learning, neuroscience, physics, and software engineering. They aim to treat interpretability as a rigorous science, developing new experimental methods and mathematical frameworks to study AI systems.
Some key research directions for the interpretability team include:
- Analyzing attention heads and computational circuits in transformers
- Improving interpretability with sparse autoencoders and monosemantic feature learning
- Using toy models to study emergent behaviors like superposition and induction heads
- Scaling interpretability techniques to larger models
The team frequently shares updates on their latest experiments, hypotheses, and results through the Circuits Updates series on the Transformer Circuits Thread website. These informal posts provide a window into the current frontiers of mechanistic interpretability research at Anthropic.
What Research Areas Do Anthropic’s Circuits Updates Focuses On?
Anthropic’s Circuits Updates span a wide range of transformer interpretability topics, but a few key research themes have emerged.
Analyzing Attention Heads and Circuits
Many Circuits Updates dive deep into the behavior of individual attention heads and the computational “circuits” they form with other heads, MLPs, and skip connections. Anthropic has identified several important head types, including:
- Induction heads that perform dynamic in-context learning
- Heads that copy or move tokens from one position to another
- Heads that compare tokens and attend based on content similarity
To study these heads, Anthropic applies techniques like:
- Deriving mathematical expressions for head computations
- Tracing head activations on specific inputs
- Perturbing or ablating heads and measuring the impact on model outputs
- Clustering heads based on activation patterns
The main goal is to break down the complex computations performed by transformers into smaller, human-interpretable components or algorithms. Induction heads were an early success, showing how a 2-layer transformer implements in-context learning via specific attention patterns.
More recent updates have started to explore how attention heads interact and compose to form larger circuits. Anthropic is developing methods to trace the flow of information between heads and identify when one head’s output is influencing another head’s computations.
Improving Interpretability with Sparse Autoencoders
Another line of research seeks to make transformers more interpretable by changing their architecture or training process. A key idea is using sparse autoencoders to learn more human-interpretable features or “dictionaries”.
Standard transformers often learn entangled, polysemantic features that are difficult to understand because they encode many unrelated concepts. Sparse autoencoders constrain the model to learn sparser, more disentangled features by:
- Imposing L1 regularization to encourage activations to be mostly zero
- Pruning weights to create a sparser connectivity pattern
- Using activation functions like SoLU that promote sparsity
Anthropic has found that sparse autoencoders can learn more interpretable, “monosemantic” features where each neuron activates for a single, human-understandable concept. Some updates have analyzed the training dynamics and geometry of these sparse representations.
Anthropic is also exploring other architectural modifications like SoLU and variations on the standard transformer block. The goal is to find model designs that achieve high performance while also being easier to interpret.
Studying Emergent Behaviors in Toy Models
A third research direction uses simple “toy models” to study emergent behaviors that may be relevant for understanding larger language models. By stripping down transformers to their core components, Anthropic can isolate specific phenomena in a more controlled setting.
For example, the Toy Models of Superposition update looked at tiny transformers trained to mimic a set of “ground truth” attention heads. This setup allowed Anthropic to precisely study how models learn to combine or “superpose” multiple attention patterns into a single head.
Surprisingly, even these minimalist models exhibited rich behaviors like:
- Learning compressed “multi-view” representations that encode all ground-truth features
- Undergoing sharp “phase transitions” in their learning dynamics
- Developing geometrically meaningful feature subspaces
Anthropic hypothesizes that similar superposition effects occur in large language models and contribute to their polysemanticity. Toy models provide a tractable way to explore these ideas.
Other updates have used toy models to study topics like grokking (transformers suddenly generalizing after memorizing the training data) and modularity (models decomposing tasks into reusable subtasks). The hope is that insights from these simplified settings can guide interpretability work on practical transformers.
Scaling Laws and Induction Heads
Finally, several Circuits Updates have investigated how transformer interpretability scales with model size. Anthropic has trained models of varying sizes on synthetic datasets to measure how learning dynamics and circuit formation change.
A key finding is the existence of “induction head bumps” – sharp increases in in-context learning performance that occur at critical model scales. These bumps coincide with the emergence of induction heads, suggesting that models need a certain capacity to implement in-context learning circuits.
Anthropic has also studied how the number and specificity of learned features scales with model size using centered kernel alignment (CKA) and other similarity measures. They find that larger models learn more features overall, but the features become less human-interpretable on average.
These scaling laws studies aim to predict how interpretability will change as transformers continue to grow in size. Anthropic hopes to use these insights to design interpretability tools that can keep pace with state-of-the-art models.