Anthropic's Circuits Updates: All Releases (Updated May 2024)

Transformer Circuits Thread

Here’s a timeline of all the Circuits Updates & LLM researches released by Anthropic.

APRIL 2024

Circuits Updates — April 2024

A collection of small updates from the Anthropic Interpretability Team. Read more >

MARCH 2024

Circuits Updates — March 2024

Reflections on Qualitative Research – Some opinionated thoughts on why interpretability research may have qualitative aspects be more central than we’re used to in other fields. Read more >

FEBRUARY 2024

Circuits Updates — February 2024

JANUARY 2024

Circuits Updates – January 2024

OCTOBER 2023

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer. Read more >

JULY 2023

Circuits Updates — July 2023

MAY 2023

Circuits Updates — May 2023

Interpretability Dreams – Our present research aims to create a foundation for mechanistic interpretability research. In doing so, it’s important to keep sight of what we’re trying to lay the foundations for. Read more >

Distributed Representations: Composition & Superposition – An informal note on how “distributed representations” might be understood as two different, competing strategies — “composition” and “superposition” — with quite different properties. Read more >

MARCH 2023

Privileged Bases in the Transformer Residual Stream

Our mathematical theories of the Transformer architecture suggest that individual coordinates in the residual stream should have no special significance, but recent work has shown that this observation is false in practice. We investigate this phenomenon and provisionally conclude that the per-dimension normalizers in the Adam optimizer are to blame for the effect. Read more >

JANUARY 2023

Superposition, Memorization, and Double Descent

We have little mechanistic understanding of how deep learning models overfit to their training data, despite it being a central problem. Here we extend our previous work on toy models to shed light on how models generalize beyond their training data. Read more >

SEPTEMBER 2022

Toy Models of Superposition

Neural networks often seem to pack many unrelated concepts into a single neuron – a puzzling phenomenon known as ‘polysemanticity’. In our latest interpretability work, we build toy models where the origins and dynamics of polysemanticity can be fully understood. Read more >

JUNE 2022

Softmax Linear Units

An alternative activation function increases the fraction of neurons which appear to correspond to human-understandable concepts. Read more >

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases – An informal note on intuitions related to mechanistic interpretability. Read more >

MARCH 2022

In-context Learning and Induction Heads

An exploration of the hypothesis that induction heads are the primary mechanism behind in-context learning. Read more >

DECEMBER 2021

A Mathematical Framework for Transformer Circuits

Early mathematical framework for reverse engineering models, demonstrated by reverse engineering small toy models. Read more >

MARCH 2020 – APRIL 2021

Original Distill Circuits Thread

What can we learn if we invest heavily in reverse engineering a single neural network? Read more >

What Is the Transformer Circuits Thread Project?

The Transformer Circuits Thread project is an ambitious research effort by Anthropic, it’s focus is to reverse engineer transformer language models into human-understandable computer programs. Inspired by the Distill Circuits Thread, Anthropic aims to create interactive articles and resources to make the inner workings of transformers more interpretable and accessible.

Transformers are state-of-the-art deep learning models used for natural language processing tasks. However, their complex architectures and millions of parameters make them notoriously difficult to understand and interpret. The Transformer Circuits Thread project seeks to open up this “black box” by systematically studying and explaining the computational patterns and motifs that emerge in trained transformer models.

Anthropic believes that gaining a mechanistic understanding of how transformers work is crucial for building safe and reliable AI systems. By reverse engineering transformers, they hope to:

Explain current safety problems in language models
Identify new potential issues
Anticipate safety challenges in future, more powerful models

Making Transformers Tractable

To make this ambitious goal tractable, the Transformer Circuits Thread starts by analyzing the simplest possible transformer models and gradually works up to larger, more realistic architectures. Their initial focus is on transformers with only one or two layers and just attention blocks, in contrast to modern transformers like GPT-3 which have 96 layers alternating between attention and MLP blocks.

Early Progress

Despite starting small, Anthropic has already made significant progress in understanding these toy models by developing a new mathematical framework. Key findings include:

Identifying “induction heads” that implement in-context learning in 2-layer models
Showing how induction heads operate on specific examples
Providing an elegant mathematical treatment of attention-only models

While these insights do not yet fully extend to practical transformers, Anthropic plans to show in future work that their framework and concepts like induction heads remain relevant in larger models. Although complete interpretability is still a distant goal, the Transformer Circuits Thread is an important step towards mechanistic understanding of transformers and building safer AI systems.

What Are Circuits Updates?

Circuits Updates are periodic blog posts where the Anthropic interpretability team shares developing research ideas, small-scale experiments, and minor findings that may not warrant a full paper. These informal updates aim to give visibility into Anthropic’s research process and plans to the wider AI community.

The Circuits Updates cover a diverse range of topics related to transformer interpretability, including:

Experiments with model architectures and training techniques
Analyses of learned features and attention patterns
Hypotheses and conceptual frameworks for understanding circuits
Replications and extensions of previous results
Negative results and open problems

The common thread is that these are preliminary ideas that the Anthropic team is actively thinking about, but not yet ready to write a complete paper on. The updates are meant to be read more like informal lab meeting discussions rather than polished publications.

Topics covered in Circuits Updates

Below are some examples of topics covered in past Circuits Updates:

Studying how attention heads learn to “superpose” multiple features
Improving sparse autoencoders by modifying activation functions
Analyzing the geometry of learned representations
Identifying heads that copy, move, or compare tokens
Scaling laws for interpretability

Anthropic hopes to facilitate discussion and collaboration with other interpretability researchers through sharing these works-in-progress. The Circuits Updates provide a window into the current frontiers of transformer interpretability research at Anthropic.

How Often Does Anthropic Publish Circuits Updates?

Anthropic publishes Circuits Updates on a roughly monthly cadence, with some variability. New updates are released as blog posts on the Transformer Circuits Thread website.

Since the Transformer Circuits Thread project began in late 2021, Anthropic has published Circuits Updates in the following months:

December 2021 (initial framework paper)
February 2022
April 2022
May 2022
January 2023
February 2023
April 2023
May 2023

The length of each update varies depending on how many new results the team has to share that month. Some updates focus on a single in-depth topic, while others briefly discuss several unrelated ideas.

What is the Anthropic Interpretability Team?

The Anthropic interpretability team is a group of researchers and engineers dedicated to making AI systems more interpretable and understandable. As of April 2024, the team has grown to 17 people, representing a significant fraction of the estimated 50 full-time mechanistic interpretability researchers worldwide.

The interpretability team is part of Anthropic, an AI safety and research company based in San Francisco. Anthropic’s mission is to ensure that transformative AI systems are reliable, interpretable, and beneficial to society. The interpretability team plays a key role in this mission by conducting research to reverse engineer and understand how AI models like transformers work under the hood.

Notable members of the Anthropic interpretability team include:

Chris Olah: A former Google Brain researcher known for his work on neural network interpretability, including the original Distill Circuits Thread that inspired Anthropic’s Transformer Circuits project.
Nelson Elhage: A software engineer and researcher who has worked on the interpretability team since its early days.
Catherine Olsson: An AI safety researcher who collaborates closely with the interpretability team.

The team takes an interdisciplinary approach, combining expertise from machine learning, neuroscience, physics, and software engineering. They aim to treat interpretability as a rigorous science, developing new experimental methods and mathematical frameworks to study AI systems.

Some key research directions for the interpretability team include:

Analyzing attention heads and computational circuits in transformers
Improving interpretability with sparse autoencoders and monosemantic feature learning
Using toy models to study emergent behaviors like superposition and induction heads
Scaling interpretability techniques to larger models

The team frequently shares updates on their latest experiments, hypotheses, and results through the Circuits Updates series on the Transformer Circuits Thread website. These informal posts provide a window into the current frontiers of mechanistic interpretability research at Anthropic.

What Research Areas Do Anthropic’s Circuits Updates Focuses On?

Anthropic’s Circuits Updates span a wide range of transformer interpretability topics, but a few key research themes have emerged.

Analyzing Attention Heads and Circuits

Many Circuits Updates dive deep into the behavior of individual attention heads and the computational “circuits” they form with other heads, MLPs, and skip connections. Anthropic has identified several important head types, including:

Induction heads that perform dynamic in-context learning
Heads that copy or move tokens from one position to another
Heads that compare tokens and attend based on content similarity

To study these heads, Anthropic applies techniques like:

Deriving mathematical expressions for head computations
Tracing head activations on specific inputs
Perturbing or ablating heads and measuring the impact on model outputs
Clustering heads based on activation patterns

The main goal is to break down the complex computations performed by transformers into smaller, human-interpretable components or algorithms. Induction heads were an early success, showing how a 2-layer transformer implements in-context learning via specific attention patterns.

More recent updates have started to explore how attention heads interact and compose to form larger circuits. Anthropic is developing methods to trace the flow of information between heads and identify when one head’s output is influencing another head’s computations.

Improving Interpretability with Sparse Autoencoders

Another line of research seeks to make transformers more interpretable by changing their architecture or training process. A key idea is using sparse autoencoders to learn more human-interpretable features or “dictionaries”.

Standard transformers often learn entangled, polysemantic features that are difficult to understand because they encode many unrelated concepts. Sparse autoencoders constrain the model to learn sparser, more disentangled features by:

Imposing L1 regularization to encourage activations to be mostly zero
Pruning weights to create a sparser connectivity pattern
Using activation functions like SoLU that promote sparsity

Anthropic has found that sparse autoencoders can learn more interpretable, “monosemantic” features where each neuron activates for a single, human-understandable concept. Some updates have analyzed the training dynamics and geometry of these sparse representations.

Anthropic is also exploring other architectural modifications like SoLU and variations on the standard transformer block. The goal is to find model designs that achieve high performance while also being easier to interpret.

Studying Emergent Behaviors in Toy Models

A third research direction uses simple “toy models” to study emergent behaviors that may be relevant for understanding larger language models. By stripping down transformers to their core components, Anthropic can isolate specific phenomena in a more controlled setting.

For example, the Toy Models of Superposition update looked at tiny transformers trained to mimic a set of “ground truth” attention heads. This setup allowed Anthropic to precisely study how models learn to combine or “superpose” multiple attention patterns into a single head.

Surprisingly, even these minimalist models exhibited rich behaviors like:

Learning compressed “multi-view” representations that encode all ground-truth features
Undergoing sharp “phase transitions” in their learning dynamics
Developing geometrically meaningful feature subspaces

Anthropic hypothesizes that similar superposition effects occur in large language models and contribute to their polysemanticity. Toy models provide a tractable way to explore these ideas.

Other updates have used toy models to study topics like grokking (transformers suddenly generalizing after memorizing the training data) and modularity (models decomposing tasks into reusable subtasks). The hope is that insights from these simplified settings can guide interpretability work on practical transformers.

Scaling Laws and Induction Heads

Finally, several Circuits Updates have investigated how transformer interpretability scales with model size. Anthropic has trained models of varying sizes on synthetic datasets to measure how learning dynamics and circuit formation change.

A key finding is the existence of “induction head bumps” – sharp increases in in-context learning performance that occur at critical model scales. These bumps coincide with the emergence of induction heads, suggesting that models need a certain capacity to implement in-context learning circuits.

Anthropic has also studied how the number and specificity of learned features scales with model size using centered kernel alignment (CKA) and other similarity measures. They find that larger models learn more features overall, but the features become less human-interpretable on average.

These scaling laws studies aim to predict how interpretability will change as transformers continue to grow in size. Anthropic hopes to use these insights to design interpretability tools that can keep pace with state-of-the-art models.

Transformer Circuits Thread

Circuits Updates — April 2024

Circuits Updates — March 2024

Circuits Updates — February 2024

Circuits Updates – January 2024

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Circuits Updates — July 2023

Circuits Updates — May 2023

Privileged Bases in the Transformer Residual Stream

Superposition, Memorization, and Double Descent

Toy Models of Superposition

Softmax Linear Units

In-context Learning and Induction Heads

A Mathematical Framework for Transformer Circuits

Original Distill Circuits Thread

What Is the Transformer Circuits Thread Project?

Making Transformers Tractable

Early Progress

What Are Circuits Updates?

Topics covered in Circuits Updates

How Often Does Anthropic Publish Circuits Updates?

What is the Anthropic Interpretability Team?

What Research Areas Do Anthropic’s Circuits Updates Focuses On?

Analyzing Attention Heads and Circuits

Improving Interpretability with Sparse Autoencoders

Studying Emergent Behaviors in Toy Models

Scaling Laws and Induction Heads

Leave a Reply Cancel reply

👋Stay in the know of all the cool Claude tricks, no BS.

Claude 3.5 Sonnet