Denoising Diffusion Probabilistic Models
What if we could generate perfect images by learning to reverse the process of destroying them with noise?
The Challenge of Creating from Nothing
Look at a photo of a face. Every pixel is exactly where it needs to be—skin tones blending smoothly, eyes reflecting light, hair strands catching shadow. Now imagine trying to place those millions of pixels correctly, starting from nothing. Where do you even begin?
"Generate a realistic image of anything"
Starting from pure randomness, how do we get structured, meaningful images?
A common misconception: You might think generating images requires understanding what's in them—knowing what a "cat" or "face" looks like. Diffusion models don't work that way. They just learn to remove noise. That's it. And somehow, that's enough.
Before 2020, the best image generators were GANs (Generative Adversarial Networks). But GANs were notoriously difficult to train and often produced blurry or distorted results.
The Old Way: GANs and Their Problems
GANs work by having two neural networks compete: a generator creates fake images, and a discriminator tries to spot the fakes. It's like a forger competing against a detective.
🎭 Training Instability
The two networks often fail to find balance, leading to training collapse
🎯 Mode Collapse
The generator sometimes learns to produce only a few types of images
🎲 Hard to Control
Difficult to generate specific types of images or interpolate smoothly
🏃 Single Step
Must generate the entire image in one shot, with no way to refine gradually
💡 The Key Insight
Instead of generating images directly, learn to reverse the process of gradually adding noise—turn chaos back into structure.
How It Works: The Sculptor's Method
Think of Michelangelo's approach to sculpture: "I saw the angel in the marble and carved until I set him free." Diffusion models work similarly—they start with pure noise (marble) and gradually carve away randomness to reveal structure (the angel).
🔄 Forward Process (Destroy)
Start with a real image and gradually add Gaussian noise (random static that follows a bell curve—most changes are small, few are large) over many timesteps:
Image → Slightly noisy → More noisy → ... → Pure noise
⏪ Reverse Process (Create)
Train a neural network to reverse this process step by step:
Pure noise → Less noisy → ... → Clean image
The Magic: Once trained, you can start from pure random noise and let the model gradually "carve away" the randomness, revealing a completely new image that looks real.
The Mathematics
The forward process adds noise according to a schedule, while the reverse process learns to predict and remove that noise.
q(xt | xt-1) = 𝒩(√(1-βt) xt-1, βtI)
Forward Diffusion
At each timestep t, take the previous image xt-1 and add Gaussian noise. The noise amount is controlled by βt (the noise schedule).
Noise Prediction
Train a neural network εθ to predict what noise was added at each step. Given a noisy image and the timestep, predict the noise.
Reverse Sampling
Start with pure noise and iteratively subtract the predicted noise to gradually reveal a clean image.
Loss Function
The model is trained to minimize the difference between the actual noise and predicted noise: ‖ε - εθ(xt, t)‖²
Watch Diffusion in Action
Watch diffusion in action on the Mona Lisa. In the forward process, Gaussian noise gradually destroys the image until only static remains. In the reverse process, the model learns to remove noise step-by-step—watch her enigmatic smile emerge from pure randomness.
Try different images:
Diffusion Process
Process Direction:
More steps = smoother, more gradual transition
This is the βt from the formula above—higher = more aggressive noise per step
What You're Seeing
Forward diffusion gradually adds Gaussian noise to each pixel. By step 30, the original image is completely destroyed—just random static.
Try both images—the same process works on any content.
Try This:
Note: Real diffusion models use learned neural networks to predict and remove noise. This demo shows the concept with pre-computed noise—the key insight is that structure can be recovered from chaos through gradual denoising.
The Diffusion Revolution
This paper launched a revolution. Diffusion models now power some of the most impressive AI systems:
| System | Company | Capability |
|---|---|---|
| DALL-E 2 | OpenAI | Text-to-image generation |
| Midjourney | Midjourney Inc | Artistic image creation |
| Stable Diffusion | Stability AI | Open-source image generation |
| Imagen | Photorealistic text-to-image |
Why Diffusion Models Won
- • Stable training: No adversarial dynamics to balance
- • High quality: Gradual refinement produces sharp, detailed images
- • Controllable: Easy to condition on text, classes, or other inputs
- • Scalable: Performance improves predictably with model size and data
Limitations
- • Slow generation — Requires many denoising steps (typically 50-1000), though newer methods like DDIM reduce this
- • Computational cost — Training requires enormous compute resources and datasets
- • Fixed noise schedule — The original paper uses predetermined β values, later work makes this learnable
- • Sample diversity — Can sometimes produce similar outputs, though still better than mode collapse in GANs
- • Evaluation challenges — Measuring image quality and diversity remains difficult across different methods