Hulde

If you have worked with Transformers—whether BERT, GPT, or the original Encoder-Decoder architecture—you are intimately familiar with the concept of "Attention." The equation is arguably the most famous in modern NLP:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

However, strictly memorizing the formula often masks the architectural nuances. While the mathematical operation is identical, the source of the inputs determines whether you are performing Self-Attention or Cross-Attention.

This article explores the mechanical and semantic differences between these two mechanisms, targeted at practitioners who understand the basics of deep learning.

Unraveling Attention: Self-Attention vs. Cross-Attention in Transformers

Coming soon

Latest Posts