I’ve been riffing on the connections between diffusion models (like DALL-E and Midjourney) and autoregressive models (like GPT, Claude, etc.) as a way to deepen my understanding of both paradigms. I used both in my thesis research.

I originally posted some ideas on this here in October 2024.

This page contains my ongoing notes and favourite resources, which should become even more relevant as DMs and Autoregressors continue to cross-pollinate.


  • In autoregression, we mask one whole value / token / pixel at each step.
  • In denoising diffusion models (DDMs), we mask a little bit of each of the values / tokens / pixels at each step.
  • Both autoregression and diffusion are about sequentially subtracting information and having a model learn to sequentially restore the information.
  • DDMs are autoregressors, but in the steps of a diffusion process not a sequence of tokens.
  • Diffusion models are autoregressive, but across noise levels rather than time steps.

[Diffusion] is a soft version of autoregression in frequency space, or if you want to make it sound fancier, approximate spectral autoregression.

Diffusion is spectral autoregression

wh on X: “A visualization of how I think of diffusion in frequency space Diffusion often generates the low frequencies in the earlier steps before generating the higher frequencies in the later steps


This is interesting as a first large diffusion-based LLM.

Most of the LLMs you’ve been seeing are ~clones as far as the core modeling approach goes. They’re all trained “autoregressively”, i.e. predicting tokens from left to right. Diffusion is different - it doesn’t go left to right, but all at once. You start with noise and gradually denoise into a token stream.

Most of the image / video generation AI tools actually work this way and use Diffusion, not Autoregression. It’s only text (and sometimes audio!) that have resisted. So it’s been a bit of a mystery to me and many others why, for some reason, text prefers Autoregression, but images/videos prefer Diffusion. This turns out to be a fairly deep rabbit hole that has to do with the distribution of information and noise and our own perception of them, in these domains. If you look close enough, a lot of interesting connections emerge between the two as well.

All that to say that this model has the potential to be different, and possibly showcase new, unique psychology, or new strengths and weaknesses. I encourage people to try it out!

Andrej Karpathy on X discussing Inception Labs’ diffusion LLMs


Why does diffusion work better than auto-regression (highly recommended, particularly from 8:33 to 13:00)

  • Random noise is (near) optimal way of spreading out a selection of pixels to minimise local dependencies.
  • DDMs are just like an autoregressor over the pixels, except instead of predicting one entire pixel each step, they predict a small amount of noise over all pixels at each step.
  • In autoregression, we mask one whole value/token/pixel at each step.
  • In DDMs, we mask a bit of all values/tokens/pixels at each step.
  • Both autoregression and diffusion are about sequentially subtracting information and having a model learn to sequentially restore the information.
  • “DDMs are autoregressors, but in the steps of the diffusion process”
  • “Diffusion models are autoregressive, but across noise levels rather than time steps.”
  • Side-note on classifier-free guidance:
    • Train model to generate image both with and without conditional text prompt by alternating between.
    • To sample: generate image with prompt and without prompt, difference them, and you get only the parts of the generation that are most dependent on the prompt.

The crux

  • Self-supervised is key to data leverage and avoiding “snowy wolf” problem.
  • Predict-then-subtract paradigm is numerically stable and handy.