Transformer Circuits: Understanding Attention Mechanisms

After replicating several papers from Anthropic's interpretability team and working through the ARENA curriculum, I've developed a deeper appreciation for how transformers actually work. Let me walk you through understanding attention mechanisms from a circuits perspective.

What Are Transformer Circuits?

Think of a neural network like a circuit board. Individual components (attention heads, MLP layers, neurons) are connected in complex ways to perform specific computations. A circuit is a subgraph of the network that implements a coherent algorithm.

The circuits framework asks: What is the algorithm that this part of the network learned? Not just "what does it activate on?" but "what computation does it perform?"

Attention: The Core Mechanism

Attention is how transformers route information between positions. Let me break down what's actually happening mathematically and algorithmically.

The Math

For a sequence of tokens with embeddings $X \in \mathbb{R}^{n \times d}$:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Where: - $Q = XW_Q$ (queries) - "what am I looking for?" - $K = XW_K$ (keys) - "what do I contain?" - $V = XW_V$ (values) - "what information do I send?"

The Intuition

Each token asks a question (query) and every other token offers information (key-value pairs). Attention weights determine how much to "listen" to each token.

But that's the textbook version. Let's dig deeper.

Types of Attention Heads

Through mechanistic interpretability research, we've discovered that attention heads learn specific algorithms:

1. Previous Token Heads

The simplest pattern: always attend to the previous token.

import torch
from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained("gpt2-small")

# Test prompt
prompt = "The cat sat on the mat and"
logits, cache = model.run_with_cache(prompt)

# Check attention pattern for layer 0, head 0
attention = cache["pattern", 0][0, 0]  # [query_pos, key_pos]

# Previous token heads show strong diagonal (offset by -1)
print("Attention pattern diagonal scores:")
for offset in range(-3, 1):
    score = attention.diagonal(offset).mean().item()
    print(f"Offset {offset}: {score:.3f}")

Why does this matter? Previous token heads enable simple bigram statistics - the model learns "what typically comes after this token?"

2. Induction Heads

One of the most important discoveries in transformer circuits. Induction heads enable in-context learning.

The Pattern: [A][B] ... [A] → predict [B]

# Classic induction test
test_prompts = [
    "The cat sat on the mat. The cat",  # Should predict " sat"
    "Alice went to the store. Alice",    # Should predict " went"
    "print hello world\nprint hello",     # Should predict " world"
]

def find_induction_heads(model, prompts):
    """Find heads that implement induction"""
    for layer in range(model.cfg.n_layers):
        for head in range(model.cfg.n_heads):
            scores = []

            for prompt in prompts:
                logits, cache = model.run_with_cache(prompt)
                pattern = cache["pattern", layer][0, head]

                # Induction heads attend to tokens after previous copies
                # Check if pattern shows this structure
                # ... implementation details ...

            if is_induction_head(scores):
                print(f"Induction head: Layer {layer}, Head {head}")

find_induction_heads(model, test_prompts)

Discovery: Induction heads emerge around layer 3-4 in GPT-2 Small. This is when the model suddenly gets much better at in-context learning.

3. Composition: K-Composition and V-Composition

Attention heads compose - the output of one head becomes input to another.

K-Composition: Earlier head determines where later head attends

# Layer 0 head identifies "the noun"
# Layer 1 head attends to "the noun" to incorporate its information

V-Composition: Earlier head computes features, later head moves them

# Layer 0 head: "is this word capitalized?"
# Layer 1 head: attends to first token, accumulates "capitalization" info

This is how transformers build hierarchical abstractions!

Deep Dive: The Induction Circuit

Let me walk through the full induction circuit - it's a beautiful example of algorithmic discovery.

Components

Previous Token Head (often Layer 0)
Stores "what came after each token last time"
Induction Head (often Layer 4-5)
Attends to the position stored by previous token head
Copies the value from that position

The Algorithm in Detail

For prompt: "The cat sat on the mat. The cat"

Step 1: Previous token head processes first "cat" - Sees: "The cat sat" - Stores in position[cat]: "next token is 'sat'"

Step 2: Induction head processes second "cat" - Query from "cat" position - Uses K-composition to find previous "cat" - Attends to token after previous "cat" (which is "sat") - Copies "sat" representation

Result: Model predicts "sat"!

Implementation

def analyze_induction_circuit(model, prompt):
    """Analyze how induction circuit processes this prompt"""

    logits, cache = model.run_with_cache(prompt)
    tokens = model.to_str_tokens(prompt)

    # Find repeated token
    repeated_pos = -1  # Position of second occurrence
    repeated_token = tokens[repeated_pos]
    first_pos = tokens.index(repeated_token)

    # Analyze previous token head (assume layer 0, head 5)
    prev_head_pattern = cache["pattern", 0][0, 5]
    print(f"Previous token head attention from '{repeated_token}':")
    print(f"  Attends to previous position: {prev_head_pattern[repeated_pos, repeated_pos-1]:.3f}")

    # Analyze induction head (assume layer 4, head 3)
    induction_pattern = cache["pattern", 4][0, 3]
    print(f"\nInduction head attention from '{repeated_token}':")
    print(f"  Attends to first '{repeated_token}': {induction_pattern[repeated_pos, first_pos]:.3f}")
    print(f"  Attends after first '{repeated_token}': {induction_pattern[repeated_pos, first_pos+1]:.3f}")

    # What does it predict?
    predicted_token_id = logits[0, repeated_pos].argmax()
    predicted_token = model.to_string(predicted_token_id)
    actual_next = tokens[first_pos + 1]

    print(f"\nPrediction: '{predicted_token}'")
    print(f"Expected (from pattern): '{actual_next}'")
    print(f"Match: {predicted_token == actual_next}")

analyze_induction_circuit(model, "The cat sat on the mat. The cat")

Attention Head Composition Circuits

The real power comes from composition. Let me show you a more complex circuit.

The IOI Circuit (Indirect Object Identification)

For sentences like: "When John and Mary went to the store, John gave a drink to"

The model should predict "Mary" (the indirect object).

Circuit components: 1. Name Mover Heads: Move name information to final position 2. Negative Name Mover Heads: Suppress the subject name (John) 3. S-Inhibition Heads: Identify subject vs. indirect object 4. Duplicate Token Heads: Identify repeated names

def analyze_ioi_circuit(model):
    """Analyze Indirect Object Identification circuit"""

    prompt = "When John and Mary went to the store, John gave a drink to"
    logits, cache = model.run_with_cache(prompt)

    # Identify key positions
    io_pos = 3   # "Mary" (indirect object)
    s_pos = 9    # "John" (subject)
    end_pos = -1 # Final position

    # Check Name Mover Heads (move Mary's info to end)
    for layer in [9, 10]:  # Typical NMH layers
        for head in range(model.cfg.n_heads):
            pattern = cache["pattern", layer][0, head]
            io_attn = pattern[end_pos, io_pos].item()

            if io_attn > 0.5:
                print(f"Name Mover Head: Layer {layer}, Head {head}")
                print(f"  Attention to IO (Mary): {io_attn:.3f}")

    # Check Negative Name Mover Heads (suppress John)
    # ... implementation details ...

    # Verify prediction
    predicted = model.to_string(logits[0, end_pos].argmax())
    print(f"\nPredicted: {predicted}")  # Should be "Mary"

analyze_ioi_circuit(model)

Attention as Computation

Let me share a key insight: Attention is not just "which tokens are relevant." It's a programming primitive.

Attention Implements:

Routing: Move information from position A to position B
Lookup: Find the token with property X
Aggregation: Combine information from multiple positions
Filtering: Select based on keys, extract via values

This is why transformers are so powerful - they learn to write their own routing logic!

Practical Techniques for Analysis

1. Activation Patching

The gold standard for proving causality:

def patch_head(model, clean_prompt, corrupt_prompt, layer, head):
    """
    Run model on corrupted input, but patch in clean activation
    for one specific head. Measure impact on output.
    """

    # Get clean activation
    _, clean_cache = model.run_with_cache(clean_prompt)
    clean_head_out = clean_cache["result", layer][:, :, head]

    # Patch into corrupted run
    def hook_fn(activation, hook):
        activation[:, :, head] = clean_head_out
        return activation

    patched_logits = model.run_with_hooks(
        corrupt_prompt,
        fwd_hooks=[(f"blocks.{layer}.attn.hook_result", hook_fn)]
    )

    return patched_logits

# If patching restores correct behavior, that head is important!

2. Attention Pattern Visualization

import circuitsvis as cv

def visualize_attention(model, prompt, layer, head):
    """Interactive attention visualization"""

    logits, cache = model.run_with_cache(prompt)
    tokens = model.to_str_tokens(prompt)
    pattern = cache["pattern", layer][0, head]

    cv.attention.attention_patterns(
        tokens=tokens,
        attention=pattern.unsqueeze(0),  # Add head dimension
        attention_head_names=[f"L{layer}H{head}"]
    )

visualize_attention(model, "The cat sat on the mat", layer=0, head=5)

3. Logit Lens

See what the model "thinks" at each layer:

def logit_lens(model, prompt):
    """Show predicted tokens at each layer"""

    logits, cache = model.run_with_cache(prompt)

    for layer in range(model.cfg.n_layers):
        # Get residual stream at this layer
        resid = cache["resid_post", layer][0, -1]  # Last position

        # Project to vocabulary
        layer_logits = model.unembed(model.ln_final(resid))
        predicted = model.to_string(layer_logits.argmax())

        print(f"Layer {layer}: {predicted}")

logit_lens(model, "The capital of France is")

Current Research Questions

1. Universality

Do different models learn the same circuits? Evidence suggests yes - induction heads appear reliably around layer ⅓ of most transformers.

2. Superposition in Attention

Can attention heads implement multiple algorithms simultaneously? Preliminary evidence says yes, via superposition.

3. Attention Head Diversity

Why does GPT-2 have 12 heads per layer? Would 1 large head work? 100 small heads?

Hypothesis: Different heads specialize in different composition types.

Conclusion

Attention is the fundamental building block of transformers, but it's more sophisticated than "selective focus." Attention heads implement learned algorithms that compose to build complex behaviors.

By studying circuits, we can: - Understand what models actually do - Predict when they'll fail - Design better architectures - Ensure AI safety

The field is young, and there's so much left to discover. Every paper I replicate raises more questions than it answers. But that's what makes it exciting!

Key Papers: - A Mathematical Framework for Transformer Circuits - In-context Learning and Induction Heads - Interpretability in the Wild

Want to discuss attention mechanisms or circuits? I'm always up for a deep technical conversation!