Machine Learning · Ho, Jain, Abbeel, 2020

Denoising Diffusion Probabilistic Models

What if we could generate perfect images by learning to reverse the process of destroying them with noise?

The Challenge of Creating from Nothing

Look at a photo of a face. Every pixel is exactly where it needs to be—skin tones blending smoothly, eyes reflecting light, hair strands catching shadow. Now imagine trying to place those millions of pixels correctly, starting from nothing. Where do you even begin?

"Generate a realistic image of anything"

Starting from pure randomness, how do we get structured, meaningful images?

A common misconception: You might think generating images requires understanding what's in them—knowing what a "cat" or "face" looks like. Diffusion models don't work that way. They just learn to remove noise. That's it. And somehow, that's enough.

Before 2020, the best image generators were GANs (Generative Adversarial Networks). But GANs were notoriously difficult to train and often produced blurry or distorted results.

The Old Way: GANs and Their Problems

GANs work by having two neural networks compete: a generator creates fake images, and a discriminator tries to spot the fakes. It's like a forger competing against a detective.

🎭 Training Instability

The two networks often fail to find balance, leading to training collapse

🎯 Mode Collapse

The generator sometimes learns to produce only a few types of images

🎲 Hard to Control

Difficult to generate specific types of images or interpolate smoothly

🏃 Single Step

Must generate the entire image in one shot, with no way to refine gradually

💡 The Key Insight

Instead of generating images directly, learn to reverse the process of gradually adding noise—turn chaos back into structure.

How It Works: The Sculptor's Method

Think of Michelangelo's approach to sculpture: "I saw the angel in the marble and carved until I set him free." Diffusion models work similarly—they start with pure noise (marble) and gradually carve away randomness to reveal structure (the angel).

🔄 Forward Process (Destroy)

Start with a real image and gradually add Gaussian noise (random static that follows a bell curve—most changes are small, few are large) over many timesteps:

Image → Slightly noisy → More noisy → ... → Pure noise

⏪ Reverse Process (Create)

Train a neural network to reverse this process step by step:

Pure noise → Less noisy → ... → Clean image

The Magic: Once trained, you can start from pure random noise and let the model gradually "carve away" the randomness, revealing a completely new image that looks real.

The Mathematics

The forward process adds noise according to a schedule, while the reverse process learns to predict and remove that noise.

q(x_t | x_t-1) = 𝒩(√(1-β_t) x_t-1, β_tI)

Ho et al., 2020, Equation 2

Forward Diffusion

At each timestep t, take the previous image x_t-1 and add Gaussian noise. The noise amount is controlled by β_t (the noise schedule).

Noise Prediction

Train a neural network ε_θ to predict what noise was added at each step. Given a noisy image and the timestep, predict the noise.

Reverse Sampling

Start with pure noise and iteratively subtract the predicted noise to gradually reveal a clean image.

Loss Function

The model is trained to minimize the difference between the actual noise and predicted noise: ‖ε - ε_θ(x_t, t)‖²

Watch Diffusion in Action

Watch diffusion in action on the Mona Lisa. In the forward process, Gaussian noise gradually destroys the image until only static remains. In the reverse process, the model learns to remove noise step-by-step—watch her enigmatic smile emerge from pure randomness.

Forward Process

Gradually destroying structure

Try different images:

Diffusion Process

Process Direction:

Total Steps (T) 30

More steps = smoother, more gradual transition

Noise Scale (β_t) 1.0

This is the β_t from the formula above—higher = more aggressive noise per step

0/30

What You're Seeing

Forward diffusion gradually adds Gaussian noise to each pixel. By step 30, the original image is completely destroyed—just random static.

Try both images—the same process works on any content.

Try This:

1. Watch destruction: In Forward mode, animate to see how recognizable features (eyes, smile, sharp edges) disappear into noise at different rates.

2. Watch creation: Switch to Reverse and animate. This is how diffusion models generate images—starting from random noise!

3. Compare schedules: Set β high (1.5) with few steps (10) vs β low (0.3) with many steps (50). Both destroy the image—one fast and violent, one slow and graceful. The reverse process has to undo whatever you chose.

4. Compare images: Switch between Mona Lisa (organic) and Mondrian (geometric) to see the same process works on any content—the reverse process is content-agnostic.

Note: Real diffusion models use learned neural networks to predict and remove noise. This demo shows the concept with pre-computed noise—the key insight is that structure can be recovered from chaos through gradual denoising.

The Diffusion Revolution

This paper launched a revolution. Diffusion models now power some of the most impressive AI systems:

System	Company	Capability
DALL-E 2	OpenAI	Text-to-image generation
Midjourney	Midjourney Inc	Artistic image creation
Stable Diffusion	Stability AI	Open-source image generation
Imagen	Google	Photorealistic text-to-image

Why Diffusion Models Won

• Stable training: No adversarial dynamics to balance
• High quality: Gradual refinement produces sharp, detailed images
• Controllable: Easy to condition on text, classes, or other inputs
• Scalable: Performance improves predictably with model size and data

Limitations

• Slow generation — Requires many denoising steps (typically 50-1000), though newer methods like DDIM reduce this
• Computational cost — Training requires enormous compute resources and datasets
• Fixed noise schedule — The original paper uses predetermined β values, later work makes this learnable
• Sample diversity — Can sometimes produce similar outputs, though still better than mode collapse in GANs
• Evaluation challenges — Measuring image quality and diversity remains difficult across different methods