Machine Learning · Vaswani et al., 2017

Attention Is All You Need

The paper that introduced the Transformer—the architecture behind GPT, BERT, Claude, and every modern language model.

Why Language Is Hard for Computers

Consider this sentence:

"The cat sat on the mat because it was tired."

What does "it" refer to? You instantly know it's the cat, not the mat. Mats don't get tired.

But how did you know? You:

Read "it was tired"
Thought about what in the sentence could be tired
Connected "it" back to "cat" based on meaning

This is attention—relating words to other words based on context. Humans do it effortlessly. For computers, it's the core challenge of understanding language.

How Computers See Words

Computers can't read. They work with numbers. So we convert each word into a list of numbers called an embedding—a point in space where similar words are near each other.

Words with similar meanings end up near each other in embedding space

These embeddings are learned from massive amounts of text. The system sees "cat" and "dog" used in similar contexts ("the ___ ran", "pet the ___") so they end up with similar numbers.

The Problem: One Word, One Embedding

Here's the catch: each word gets exactly one embedding, determined only by the word itself. The embedding for "bank" is the same list of numbers regardless of what sentence it appears in.

"I deposited money at the bank."

"I sat by the river bank."

Both sentences use the same embedding for "bank"—the system can't tell them apart.

This is the core problem: words have fixed embeddings, but meaning depends on context. "Bank" should mean something different when surrounded by "money" and "deposited" vs. "river" and "sat."

We need a way for each word to look at the other words around it and adjust its representation based on what it sees. That's what attention does.

Historical note: This idea didn't appear out of nowhere. Attention mechanisms were first used in 2014 by Bahdanau et al. for machine translation—letting a decoder "look back" at different parts of the input sentence when generating each output word. The Transformer's innovation was making attention the only mechanism, removing recurrence entirely.

The Old Way: Reading Left to Right

Before Transformers, language models used Recurrent Neural Networks (RNNs). They read sentences one word at a time, like this:

Step 1: Read "The" → update memory

Step 2: Read "cat" → update memory

Step 3: Read "sat" → update memory

... and so on

This has two big problems:

🐌 Slow

You can't process word 5 until you've finished words 1-4. No parallelization. Training takes forever.

🧠 Forgetful

By the time you reach word 100, you've forgotten word 1. Information degrades as it passes through each step.

The Transformer's insight: why read sequentially at all? Let every word look at every other word directly, all at once.

The Attention Idea

💡 The Key Insight

Instead of reading words one-by-one, let every word "look at" every other word simultaneously and decide: "How relevant is that word to understanding me?"

For the sentence "The cat sat on the mat because it was tired":

▸ "it" looks at all words and pays most attention to "cat" (they're related in meaning)
▸ "tired" also pays attention to "cat" (cats can be tired, mats can't)
▸ "sat" might attend to both "cat" (who sat?) and "mat" (sat where?)

The result: each word gets a new, context-aware representation. "Bank" in a finance sentence becomes different from "bank" in a nature sentence, because they attend to different surrounding words.

How Do Words "Look At" Each Other?

Here's the mechanism. Each word's embedding gets transformed into three different vectors by multiplying it with three learned weight matrices (W^Q, W^K, W^V):

The same embedding is projected three different ways

Q

Query — "What am I looking for?"

The word's question to other words. When we compute attention for "it", we use "it"'s Query to search: "Who in this sentence could I refer to?"

K

Key — "What do I contain?"

The word's advertisement of what it offers. "Cat"'s Key encodes: "I'm a noun, an animal, a subject that does things." Keys are matched against Queries to compute relevance.

V

Value — "What do I contribute if selected?"

The actual information the word passes along. If "cat" is deemed relevant to "it", then "cat"'s Value gets mixed into "it"'s new representation. Think of V as the actual content, while K is just the label/index used for matching.

Why separate K and V? Think of it like a library:

Query: You ask the librarian for "books about cooking"
Key: Each book has a catalog entry describing its topic
Value: The actual content inside the book
Result: The librarian matches your query against catalog entries (Keys), then gives you the actual books (Values)

The catalog entry (K) and book content (V) are related but not identical—a cookbook's catalog entry says "recipes, food, cooking" but its Value is the actual recipes.

In attention: each word's Query is matched against every word's Key. High match = "these words are related." The output is a weighted blend of all Values, where the weights come from how well Keys matched the Query.

The Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

Vaswani et al., 2017, Equation 1

Let's break this down step by step:

QK^T — Match queries to keys

Compute the dot product between every query and every key. High dot product = these words are related. This gives you a matrix of "compatibility scores."

/ √d_k — Scale down

Divide by the square root of the dimension (d_k = 64 in the paper). Without this, large dot products would make softmax too "peaky"—one word gets all the attention, others get nearly zero. The paper notes this prevents "extremely small gradients." [Section 3.2.1]

softmax — Convert to probabilities

Turn the scores into probabilities that sum to 1. Now each word has an "attention distribution" over all other words. "It" might put 60% on "cat", 20% on "sat", 10% on "mat", etc.

× V — Weighted sum of values

Multiply attention weights by values and sum. Each word's new representation is a blend of all words' values, weighted by relevance. "It" becomes mostly "cat" information, with a bit of context from other words.

Try It Yourself

See attention in action. Click any word to see what it attends to. Adjust d_k to see why scaling matters—at d_k=1, attention becomes very sharp; at d_k=64, it's more diffuse.

Try different sentences:

"The cat sat on the mat because it was tired"

Note: This demo uses hand-crafted embeddings to illustrate the attention mechanism. A real transformer learns embeddings that capture linguistic relationships — for example, "their" would attend strongly to the noun it refers to. Here, some patterns work (cat/it), others don't (rainbows/their).

Attention Weights

Click any word to see what it attends to and the step-by-step computation.

Parameters

Dimension (d_k) 8

Scale: 1/√8 = 0.354

Temperature 1.00

Lower = sharper, Higher = diffuse

Try this: Notice how 'it' attends strongly to 'cat' — the model resolves the pronoun!

Explore: Set d_k=1 for sharp attention, or increase it to see how scaling prevents domination.

Show the math

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

Q = queries, K = keys, V = values, d_k = key dimension

In the original paper: d_model = 512, d_k = d_v = 64, heads = 8

Note: This simulation uses simplified random embeddings for visualization. Real Transformers use learned embeddings with d_model=512 and learned projection matrices W^Q, W^K, W^V that are trained on massive text corpora.

Multi-Head Attention: Looking at Things Differently

One attention pattern isn't enough. Consider:

"The animal didn't cross the street because it was too wide."

What's "it"? The animal or the street? (It's the street—streets are wide, animals aren't.)

To answer this, you need to track multiple kinds of relationships:

▸ Syntactic: "it" is a pronoun, what nouns could it refer to?
▸ Semantic: "wide" describes physical size, what things have width?
▸ Positional: "it" is near "street", maybe they're related?

The solution: run multiple attention "heads" in parallel, each learning to focus on different types of relationships.

MultiHead(Q, K, V) = Concat(head₁, ..., head_h)W^O

The paper uses h=8 heads. Each head has its own W^Q, W^K, W^V projections (d_k=d_v=64 each), and the outputs are concatenated back to 512 dimensions. [Section 3.2.2]

Research has shown different heads do learn different things—some track syntax, some track coreference, some focus on adjacent words.

Positional Encoding: Teaching Word Order

There's a problem: attention treats all words equally. It doesn't know that "cat sat" is different from "sat cat." Word order matters in language!

The solution: add a "position signal" to each word's embedding before attention. The paper uses sine and cosine waves at different frequencies:

PE_{(pos, 2i)} = sin(pos / 10000^2i/d_model)

PE_{(pos, 2i+1)} = cos(pos / 10000^2i/d_model)

Why sines and cosines? They create unique patterns for each position, and the model can learn to compute relative positions (how far apart two words are) from these patterns. [Section 3.5]

Modern models often use different approaches: learned position embeddings, rotary embeddings (RoPE), or relative position encodings. The original sinusoidal encoding was a starting point.

Why This Changed Everything

Before: RNNs/LSTMs

• Sequential processing (slow)
• Can't parallelize training
• Long-range dependencies hard
• Information bottleneck

After: Transformers

• All positions processed at once
• Massively parallelizable (GPU-friendly)
• Direct word-to-word connections
• Scales to billions of parameters

🌍 Impact

This paper has been cited over 100,000 times. The Transformer is now the foundation of:

• GPT, ChatGPT — decoder-only
• BERT — encoder-only
• T5, PaLM, Claude — various architectures
• Vision Transformers (ViT) — images
• Whisper — speech recognition
• AlphaFold 2 — protein folding

Limitations

• Quadratic complexity — Attention is O(n²) in sequence length. A 10,000 word document means 100 million attention scores. This is why models have context length limits. (Variants like Longformer and Flash Attention address this.)
• No inherent sense of order — Position encodings are added information, not built into the mechanism. This is both a strength (flexibility) and weakness (position information can be fragile).
• Needs lots of data — Transformers are data-hungry. They underperform on small datasets compared to models with stronger inductive biases.