Unraveling Attention: Self-Attention vs. Cross-Attention in Transformers

If you have worked with Transformers—whether BERT, GPT, or the original Encoder-Decoder architecture—you are intimately familiar with the concept of "Attention." The equation is arguably the most famous in modern NLP:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

However, strictly memorizing the formula often masks the architectural nuances. While the mathematical operation is identical, the source of the inputs determines whether you are performing Self-Attention or Cross-Attention.

This article explores the mechanical and semantic differences between these two mechanisms, targeted at practitioners who understand the basics of deep learning.

The Prerequisite: Query, Key, and Value

Before diverging, let’s briefly ground ourselves in the shared inputs. Both mechanisms rely on projecting input vectors into three subspaces:

Query ($Q$): The token currently asking for information.
Key ($K$): The label or identifier of other tokens in the sequence.
Value ($V$): The actual content information of those tokens.

The dot product $QK^T$ measures similarity (how much the Query cares about the Key), creating an attention map that dictates how much of the Value is aggregated.

1. Self-Attention (Intra-Attention)

Self-attention is the mechanism that allows a model to build a contextual understanding of a single sequence. It looks at the correlations between different words within the same sentence.

The Mechanism

In self-attention, the Queries, Keys, and Values all originate from the same source (e.g., the output of the previous layer).

Given an input sequence embedding matrix $X$:

$$ Q = X W_Q \\ K = X W_K \\ V = X W_V$$

Where $W_Q, W_K, W_V$ are learnable weight matrices.

The Semantic Goal

The goal here is contextualization and disambiguation.

Consider the sentence: "The animal didn't cross the street because it was too tired."

A static word embedding (like Word2Vec) treats "it" generically. However, a Self-Attention layer allows the vector for "it" to query all other words in the sentence. The model learns that "it" has a high attention score with "animal" (the subject) and updates the representation of "it" to reflect that context.

Where it is used

Encoders (e.g., BERT): To understand the full context of a sentence.
Decoders (e.g., GPT): To generate the next token based on previous tokens (usually with causal masking to prevent looking ahead).

2. Cross-Attention (Encoder-Decoder Attention)

Cross-attention (often called generic "attention" in older seq2seq papers) is the mechanism that mixes information from two different sequences. This is the bridge between the Encoder and the Decoder.

The Mechanism

In cross-attention, the Queries come from the destination sequence, while Keys and Values come from the source sequence.

Let $H_{enc}$ be the output of the Encoder (the source) and $H_{dec}$ be the current representation of the Decoder (the target).

$$Q = H_{dec} W_Q \\ K = H_{enc} W_K \\ V = H_{enc} W_V$$

Crucial Distinction: $Q$ is derived from the target sequence (what we are generating), but $K$ and $V$ are derived from the source sequence (what we are translating/summarizing).

The Semantic Goal

The goal here is alignment and conditioning.

Imagine translating "I love coffee" to French ("J'aime le café").

The Encoder processes "I love coffee" into keys ($K$) and values ($V$).
The Decoder has generated "J'aime". Now it needs to generate the next word.
The Decoder produces a Query ($Q$) representing "J'aime".
This Query "searches" the Encoder's Keys. It finds a high match with "love" and "coffee".
The attention mechanism extracts the corresponding Values from the Encoder to help the Decoder choose the word "le".

Where it is used

Encoder-Decoder models (e.g., T5, BART, Original Transformer): Specifically used in the Decoder blocks to condition generation on the Encoder's output.
Stable Diffusion: Text-to-Image models use cross-attention where the image latent acts as the Query, and the text prompt embeddings act as Keys and Values.

Side-by-Side Comparison

Feature	Self-Attention	Cross-Attention
Input Source	Single sequence ($X$)	Two sequences ($X$ and $Y$)
Query ($Q$) Source	Input $X$	Target Sequence $Y$ (Decoder)
Key ($K$) Source	Input $X$	Context Sequence $X$ (Encoder)
Value ($V$) Source	Input $X$	Context Sequence $X$ (Encoder)
Matrix Shapes	$Q, K, V$ have same seq length	$Q$ has length $L_{tgt}$, $K, V$ have length $L_{src}$
Primary Function	Understanding relationships within data	Mapping/Aligning two different data modalities

Implementation Intuition (PyTorch-style)

To make this concrete, look at how the inputs differ in code.

import torch.nn as nn

class AttentionBlock(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.multihead_attn = nn.MultiheadAttention(embed_dim, num_heads=8)

    def forward(self, x, context=None):
        """
        x: The input stream (e.g., Decoder state)
        context: The external stream (e.g., Encoder output)
        """
        
        if context is None:
            # SELF-ATTENTION
            # Query, Key, and Value all come from x
            output, _ = self.multihead_attn(query=x, key=x, value=x)
            
        else:
            # CROSS-ATTENTION
            # Query comes from x (target)
            # Key and Value come from context (source)
            output, _ = self.multihead_attn(query=x, key=context, value=context)
            
        return output

In standard libraries like HuggingFace Transformers or torch.nn, the MultiheadAttention module is agnostic. The type of attention is defined purely by what tensors you pass into the query, key, and value arguments during the forward pass.

Summary

While the math inside the Softmax is identical, the flow of information differs significantly:

Self-Attention is about Refinement. It looks inward to create a sharper, context-aware representation of the data it already has.
Cross-Attention is about Conditioning. It looks outward to pull relevant information from a separate source to guide the current task.

Understanding this distinction is vital for debugging architectures (checking dimension mismatches) and designing new multimodal systems where different data streams must interact.

The Prerequisite: Query, Key, and Value

1. Self-Attention (Intra-Attention)

The Mechanism

The Semantic Goal

Where it is used

2. Cross-Attention (Encoder-Decoder Attention)

The Mechanism

The Semantic Goal

Where it is used

Side-by-Side Comparison

Implementation Intuition (PyTorch-style)

Summary

Read Next