Rumelhart, Hinton & Williams · 1986 · Nature
Backpropagation
How neural networks learn
Imagine you've built a machine with 10,000 knobs. You feed it an input, it produces an output. The output is wrong. You need to fix it.
Which knobs do you turn? And by how much?
This is the credit assignment problem — the central challenge of training neural networks. You have thousands (or millions) of adjustable parameters. When the network makes an error, you need to figure out which parameters are responsible and how to fix them.
Backpropagation is the elegant algorithm that solves this problem. It's the reason deep learning works. Every neural network trained today — from ChatGPT to image generators to AlphaFold — learns using backpropagation or some variant of it.
What is a neural network, really?
Strip away the hype, and a neural network is just a function with adjustable parameters. It takes an input (numbers), transforms it through layers of simple operations, and produces an output (more numbers).
Those little circles? Those are the weights — the adjustable knobs that determine behavior.
The magic isn't in the architecture. It's in finding the right settings for all those knobs. Learning is the process of adjusting these weights so the network produces correct outputs.
But here's the problem: modern networks have millions or even billions of weights. How do you know which ones to adjust, and by how much?
The naive approach (and why it fails)
Let's say your network has 10,000 weights and it's making wrong predictions. What's the obvious thing to try?
Idea 1: Random search
Randomly tweak weights. If the error goes down, keep the change. If not, try again.
Problem: With 10,000 weights, the space of possible combinations is astronomical. Random search would take longer than the age of the universe to find good settings.
Idea 2: Test each weight
For each weight, nudge it slightly and see if the error improves. Adjust accordingly.
Problem: This actually works! But testing each weight requires running the network. 10,000 weights = 10,000 network runs. Per update. Training needs millions of updates. Too slow.
The core insight
Both approaches test weights one at a time. We need a way to figure out how all the weights should change simultaneously. That's what backpropagation gives us.
What is a gradient?
Before we can understand backpropagation, we need to understand gradients. Don't worry — the intuition is simple.
A gradient answers this question: "If I nudge this weight up a tiny bit, does the error go up or down? And by how much?"
Positive gradient: Increasing this weight increases the error. So we should decrease it.
Negative gradient: Increasing this weight decreases the error. So we should increase it.
Think of it like standing on a hilly landscape where altitude represents error. The gradient tells you which direction is downhill. If you always step downhill, you'll eventually reach a valley (minimum error).
Error Landscape
Adjust Weight
Output
0.50 × 1 = 0.50
Target: 2
Error (Loss)
(0.50 - 2)² = 2.25
Gradient
-3.00
negative (↘ increasing weight decreases error)
Suggestion: Increase the weight to reduce error
What you're seeing
The curve shows how error changes as you adjust the weight. The gradient (dashed line) is the slope at your current position — it tells you which direction to move to reduce the error. A gradient of -3.00 means: if you increase the weight by 1, the error will change by approximately -3.00.
The key realization
If we knew the gradient for every weight, we'd know exactly how to adjust all of them in a single step. The question becomes: how do we compute all these gradients efficiently?
What is a "forward pass"?
Before we go further, let's define a term you'll see everywhere: forward pass.
A forward pass is simply running an input through the network to get an output. Data flows forward — from input, through each layer, to the final prediction.
Think of a 3-layer network recognizing a handwritten digit:
- Layer 1: Takes pixel values, computes 128 intermediate numbers
- Layer 2: Takes those 128 numbers, computes 64 new numbers
- Layer 3: Takes those 64 numbers, outputs 10 probabilities (one per digit)
Each layer multiplies its input by weights, adds biases, and applies an activation function. The entire process — input to output — is one forward pass.
Crucially, a forward pass also computes intermediate values at each layer. These are the activation values at each neuron — the 128 numbers after layer 1, the 64 after layer 2. Remember these. They're going to be important.
Every time you want to know what output the network produces for a given input, you run a forward pass. Want to measure the error? Forward pass to get the prediction, then compare to the target.
The computational tragedy
Here's the problem that stumped researchers for decades.
We know gradients tell us how to adjust weights. The obvious way to compute a gradient is numerical differentiation — the method you might remember from calculus:
To find how changing weight w affects the error:
- Step 1: Set w to (current value + tiny amount), run a forward pass, measure error
- Step 2: Set w to (current value - tiny amount), run a forward pass, measure error
- Step 3: Gradient ≈ (error₁ - error₂) / (2 × tiny amount)
This is literally measuring the slope by checking two nearby points.
The problem? Each gradient requires 2 forward passes. And you have to do this for every single weight.
10,000 weights × 2 forward passes = 20,000 forward passes
Just to compute one update.
Training requires millions of updates...
This computational wall essentially killed neural network research for over a decade. In 1969, Minsky and Papert published "Perceptrons," arguing that neural networks couldn't learn complex functions. Without an efficient way to train deep networks, they were right — and the field entered an "AI winter."
What researchers needed was a miracle: a way to compute all the gradients without testing each weight individually.
The backpropagation miracle
In 1986, Rumelhart, Hinton, and Williams published a paper in Nature that changed everything. Their key insight:
What if you could compute ALL gradients in ONE backward pass?
Not different gradients, not approximate gradients — the exact same gradients as numerical differentiation, just computed cleverly. Instead of O(n) forward passes for n weights, you need just one forward pass and one backward pass. Total.
Numerical Gradients
20,000
forward passes per update
Backpropagation
2
passes total (1 forward + 1 backward)
This is a 10,000× speedup. It's what made training deep networks practical. But how does it work?
How it works: reusing what we already computed
Remember those intermediate values from the forward pass? The 128 numbers after layer 1, the 64 after layer 2? Here's the key insight:
To compute a gradient, you need to know how a weight affects the output. But a weight doesn't directly touch the output — it affects the next layer's input, which affects the layer after that, which eventually affects the output.
With numerical differentiation, we recompute the entire chain from scratch for each weight. But here's what backprop realizes: we already computed most of this during the forward pass.
The forward pass computed every intermediate value in the network. If we save those values, we can use them to compute gradients without re-running the network. We just need to work backward through the saved values.
The chain rule (yes, from calculus)
The mathematical tool that makes this work is the chain rule — the same chain rule from Calculus 101. Rumelhart, Hinton, and Williams didn't invent it. Their insight was realizing how perfectly it applies to neural networks, and how to organize the computation efficiently.
The chain rule says: if y depends on x, and z depends on y, then the rate at which z changes with x is:
dz/dx = (dz/dy) × (dy/dx)
"How z changes with x" = "How z changes with y" × "How y changes with x"
An analogy: blame propagation
Imagine a company where the CEO makes a bad decision, but the blame needs to be distributed to everyone who contributed. The chain of responsibility goes:
Employee → wrote report → Manager → made recommendation → VP → advised → CEO → bad decision
How much blame does the employee get? You multiply along the chain:
- • CEO's decision was 100% responsible for the bad outcome
- • VP's advice was 60% responsible for CEO's decision
- • Manager's recommendation was 50% responsible for VP's advice
- • Employee's report was 30% responsible for manager's recommendation
Employee's total blame: 100% × 60% × 50% × 30% = 9%
This is exactly what the chain rule does. Each weight's responsibility for the final error is the product of responsibilities along the path from that weight to the output.
Why backward? Why "backpropagation"?
Here's the clever bit. Notice in the blame analogy that we started from the outcome and worked backward. That's not arbitrary — it's efficient.
At the output, we directly know how wrong we were (the error). From there, we can compute how much each neuron in the last hidden layer contributed. Then, using those contributions, we compute how much the layer before that contributed. And so on.
The key insight
Each layer's gradient depends only on the next layer's gradient (plus the intermediate values we saved from the forward pass). So we can compute all gradients in one backward sweep, reusing results as we go. No need to recompute anything from scratch.
That's why it's called backpropagation — the error signal propagates backward through the network, getting distributed to each weight according to its responsibility.
How to explore this
- 1. Start in Forward Pass view — watch values flow from inputs (left) to output (right)
- 2. Switch to Backward Pass — see gradients (how much each weight should change) flow backward
- 3. Click any weight (the connecting lines) to see the chain rule calculation
- 4. Click Apply Gradients to update all weights — watch the loss decrease!
Neural Network
Click any weight (the lines) to see how its gradient is computed
Inputs
Loss
0.0315
Forward Pass: Input values multiply through weights, get summed at each node, and pass through a sigmoid activation. The numbers on lines show weight values; numbers in circles show activations.
Why this is so efficient
With numerical differentiation, we did 20,000 forward passes (2 per weight). With backpropagation, we do 1 forward pass (save intermediate values) + 1 backward pass (compute all gradients using saved values). Same gradients, 10,000× faster.
Watching learning happen
Now let's put it all together. Training a neural network is a loop:
- 1 Forward pass: Run the input through the network, get a prediction
- 2 Compute loss: How wrong was the prediction?
- 3 Backward pass: Compute gradients for all weights
- 4 Update weights: Nudge each weight in the direction that reduces error
- 5 Repeat: Do this thousands or millions of times
With each iteration, the error gets a little smaller. The network gets a little better. Eventually, it learns to make accurate predictions.
Learning XOR
The task: Learn XOR — output 1 if exactly one input is 1, otherwise 0.
Loss Over Time
Network's Predictions
Controls
Accuracy
50%
Predictions vs Truth
Understanding this simulation
Reading the visualization
- • Each point in the square is an input pair (x, y). Bottom-left = (0,0), top-right = (1,1).
- • The color shows the network's output for that input: orange ≈ 1, blue ≈ 0.
- • The circles are training examples. Their fill shows the correct answer.
- • Success = each circle's background matches its fill color.
What to watch during training
- • Loss curve: Should decrease. Lower = outputs closer to targets.
- • Colored square: Watch colors shift as the network learns different outputs for different regions.
- • Predictions table: Values drift toward 0 or 1. Rounded outputs matching targets get ✓.
Why decimals like 0.472?
The network passes values through a sigmoid function that squishes any number into 0–1. Outputs are naturally continuous (0.23, 0.87). We round to check correctness: > 0.5 means "predicted 1".
Why random colors at epoch 0?
Weights start random, so the network computes a random function. Those initial outputs aren't guesses — they're just what random weights produce. Training adjusts weights until outputs match targets.
Why XOR is famous
What success looks like: Mostly orange, with blue at diagonal corners (0,0) and (1,1). XOR outputs 0 when inputs match, 1 when they differ.
Not linearly separable: Try drawing one straight line that puts (0,0) and (1,1) on one side, and (0,1) and (1,0) on the other. You can't — the blue corners are diagonal. Any line separating them cuts through an orange corner. Single-layer networks can only draw straight lines, so they fail at XOR. The hidden layer transforms inputs into a new space where the problem becomes separable. This is why multi-layer networks matter.
The math (for the curious)
Click to expand the formal equations
Forward Pass
h = σ(W1x + b1)
y = σ(W2h + b2)
Where σ is an activation function (like sigmoid or ReLU), W are weight matrices, and b are biases.
Loss Function
L = ½(y - t)²
Mean squared error between prediction y and target t.
Chain Rule (Backward Pass)
∂L/∂W2 = ∂L/∂y × ∂y/∂W2
∂L/∂W1 = ∂L/∂y × ∂y/∂h × ∂h/∂W1
Each layer's gradient is computed by multiplying local gradients.
Weight Update
W ← W - η × ∂L/∂W
Where η (eta) is the learning rate — how big a step to take.
Why this matters
The 1986 backpropagation paper didn't invent the algorithm (variants existed earlier), but it demonstrated conclusively that deep networks could learn useful representations. It ended the AI winter.
Every breakthrough in deep learning since then — convolutional networks for images, recurrent networks for sequences, transformers for language, diffusion models for generation — all rely on backpropagation to train.
GPT & ChatGPT
Trained with backprop on text data
Image Recognition
Backprop through convolutional layers
AlphaFold
Protein structure from backprop-trained networks
Stable Diffusion
Image generation via backprop
Understanding backpropagation is understanding how modern AI learns. It's the heartbeat of the deep learning revolution.
Limitations and extensions
Backpropagation isn't perfect. Researchers have discovered several challenges:
- • Vanishing gradients: In very deep networks, gradients can shrink to nearly zero, making early layers hard to train. Solutions: ReLU activations, skip connections, careful initialization.
- • Local minima: Gradient descent might find a "good enough" solution rather than the best one. In practice, this is less problematic than originally feared.
- • Requires differentiability: Backprop needs smooth, differentiable operations. Some architectures need clever workarounds.
Modern optimizers like Adam, techniques like batch normalization, and architectures like ResNets all build on backpropagation while addressing its limitations.
Original Paper
Learning representations by back-propagating errors
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams
Nature, Vol. 323, pp. 533-536, 1986
Read the original paper →