Skip to content

Image Generation

Generative Adversarial Networks (GANs)

GANs (Goodfellow et al., 2014) frame image generation as a two-player game between a generator \(G\) that produces synthetic images and a discriminator \(D\) that tries to distinguish real from fake.

Minimax Objective

\[ \min_G \max_D \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))] \]

where \(z\) is a latent vector sampled from a simple prior (e.g., Gaussian). At the Nash equilibrium, \(G\) produces samples indistinguishable from real data and \(D\) outputs 0.5 everywhere.

In practice, the generator is trained with the non-saturating loss \(-\log D(G(z))\) instead of \(\log(1 - D(G(z)))\) to avoid vanishing gradients early in training when \(D\) easily rejects \(G\)'s outputs.

flowchart LR
    Z["Latent z ~ N(0, I)"] --> G["Generator G"]
    G --> Fake["Generated Image"]
    Real["Real Image"] --> D["Discriminator D"]
    Fake --> D
    D --> Loss["Real / Fake"]
    Loss -- "Update D" --> D
    Loss -- "Update G" --> G

GAN Variants

Variant Key Innovation Loss / Regularization
DCGAN Convolutional architecture, batch norm, no pooling Standard GAN loss
WGAN Wasserstein distance instead of JS divergence \(\mathbb{E}[D(x)] - \mathbb{E}[D(G(z))]\), weight clipping
WGAN-GP Gradient penalty replaces weight clipping \(\lambda \, \mathbb{E}[(\|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1)^2]\)
StyleGAN Mapping network, adaptive instance norm, style mixing Progressive growing, R1 penalty
StyleGAN2 Weight demodulation, no progressive growing Path length regularization

GANs dominated image generation from 2014--2021 but suffer from mode collapse (generator ignores parts of the data distribution) and training instability (adversarial dynamics are hard to balance). Diffusion models, covered next, largely resolved these issues.


Diffusion Models

Diffusion models (Ho et al., 2020; Sohl-Dickstein et al., 2015) generate images by learning to reverse a gradual noising process. They are now the dominant paradigm for image generation.

Forward Process

Starting from a clean image \(x_0\), the forward process adds Gaussian noise over \(T\) steps:

\[ q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \, \beta_t \mathbf{I}) \]

where \(\{\beta_t\}_{t=1}^T\) is the noise schedule. Define \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{s=1}^t \alpha_s\). A key property is that we can sample any \(x_t\) directly from \(x_0\) without iterating:

\[ x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon \qquad \text{where } \epsilon \sim \mathcal{N}(0, \mathbf{I}) \]

Reverse Process

The reverse process learns to denoise step by step:

\[ p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\left(x_{t-1};\; \mu_\theta(x_t, t),\; \sigma_t^2 \mathbf{I}\right) \]

Simplified Training Objective

Rather than predicting \(\mu_\theta\) directly, Ho et al. (2020) showed that a noise-prediction parameterization with a simplified objective works best:

\[ \mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\!\left[\left\| \epsilon - \epsilon_\theta(x_t, t) \right\|^2\right] \]

The model \(\epsilon_\theta\) predicts the noise \(\epsilon\) that was added to produce \(x_t\). At inference time, starting from pure noise \(x_T \sim \mathcal{N}(0, \mathbf{I})\), the model iteratively estimates and subtracts the noise.

/// details | Simplified loss derivation The full variational lower bound (VLB) for diffusion decomposes into per-timestep KL divergences between the learned reverse process and the tractable posterior \(q(x_{t-1} \mid x_t, x_0)\). This posterior is Gaussian with known mean and variance:

\[ q(x_{t-1} \mid x_t, x_0) = \mathcal{N}\!\left(x_{t-1};\; \tilde{\mu}_t(x_t, x_0),\; \tilde{\beta}_t \mathbf{I}\right) \]

where \(\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{1-\beta_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t} x_t\).

Substituting \(x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(x_t - \sqrt{1-\bar{\alpha}_t}\,\epsilon)\) and simplifying the KL divergence leads to the noise-prediction loss above, with time-dependent weighting factors dropped for the "simple" variant. ///

flowchart LR
    X0["Clean Image x_0"] -- "Add noise (T steps)" --> XT["Pure Noise x_T"]
    XT -- "Learned denoising (T steps)" --> X0_hat["Generated Image x_0"]

Latent Diffusion / Stable Diffusion

Running diffusion in pixel space is computationally expensive for high-resolution images. Latent Diffusion Models (Rombach et al., 2022) solve this by operating in the compressed latent space of a pretrained autoencoder.

Architecture

  1. Encoder \(\mathcal{E}\): Compresses an image \(x \in \mathbb{R}^{H \times W \times 3}\) into a latent \(z = \mathcal{E}(x) \in \mathbb{R}^{h \times w \times c}\), typically with 8x spatial downsampling.
  2. Diffusion model: A U-Net (with cross-attention layers) operates on \(z\), performing the noising/denoising process in latent space.
  3. Decoder \(\mathcal{D}\): Reconstructs the image from the denoised latent: \(\hat{x} = \mathcal{D}(\hat{z})\).

Cross-Attention Conditioning

Text conditioning (or any conditioning signal) is injected via cross-attention layers in the U-Net:

\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^T}{\sqrt{d}}\right) V \qquad Q = W_Q \cdot \phi(z_t), \quad K = W_K \cdot \tau_\theta(y), \quad V = W_V \cdot \tau_\theta(y) \]

where \(\phi(z_t)\) is a spatial feature from the U-Net and \(\tau_\theta(y)\) is the text embedding (e.g., from CLIP). Queries come from the image latent; keys and values come from the text -- exactly the cross-attention pattern from the original Transformer decoder.

Classifier-Free Guidance

To strengthen the influence of the conditioning signal at inference time, classifier-free guidance (Ho & Salimans, 2022) interpolates between conditioned and unconditioned predictions:

\[ \hat{\epsilon} = \epsilon_\theta(z_t, \varnothing) + s \cdot \left(\epsilon_\theta(z_t, y) - \epsilon_\theta(z_t, \varnothing)\right) \]

where \(s > 1\) is the guidance scale and \(\varnothing\) represents the null (empty) conditioning. During training, the text condition is randomly dropped (replaced with \(\varnothing\)) with some probability (e.g., 10%) so the model learns both conditional and unconditional generation.

flowchart LR
    Text["Text Prompt y"] --> CLIP["Text Encoder"]
    CLIP --> CA["Cross-Attention"]
    Noise["z_T ~ N(0,I)"] --> UNet["U-Net Denoiser"]
    CA --> UNet
    UNet -- "T denoising steps" --> Latent["Denoised Latent z_0"]
    Latent --> Dec["Decoder D"]
    Dec --> Image["Output Image"]

Code Example

import torch
from diffusers import StableDiffusionPipeline


def generate_image(
    prompt: str,
    guidance_scale: float = 7.5,
    num_inference_steps: int = 50,
    seed: int = 42,
) -> "PIL.Image.Image":
    """Generate an image from a text prompt using Stable Diffusion.

    Requires a GPU with >= 10 GB VRAM for float16 inference.
    """
    pipe = StableDiffusionPipeline.from_pretrained(
        "stabilityai/stable-diffusion-2-1",
        torch_dtype=torch.float16,
    ).to("cuda")

    generator = torch.Generator("cuda").manual_seed(seed)
    result = pipe(
        prompt,
        guidance_scale=guidance_scale,
        num_inference_steps=num_inference_steps,
        generator=generator,
    )
    return result.images[0]

Diffusion Transformers (DiT)

Peebles & Xie (2023) replaced the U-Net backbone of latent diffusion with a Vision Transformer, producing the Diffusion Transformer (DiT) architecture. This is the backbone of modern systems like DALL-E 3, Stable Diffusion 3, and Flux.

Patchification

Instead of operating on spatial feature maps with convolutions, DiT treats the latent \(z \in \mathbb{R}^{h \times w \times c}\) as a sequence of non-overlapping patches (analogous to ViT):

\[ z \rightarrow \{p_1, p_2, \ldots, p_N\} \qquad N = \frac{h \cdot w}{p^2} \]

where \(p\) is the patch size. Each patch is linearly projected into the Transformer's hidden dimension.

AdaLN-Zero Conditioning

DiT conditions on the diffusion timestep \(t\) and (optionally) class/text embeddings via Adaptive Layer Normalization (AdaLN-Zero):

\[ \text{AdaLN}(h, y) = y_s \odot \text{LayerNorm}(h) + y_b \]

where \(y_s\) (scale) and \(y_b\) (shift) are regressed from the conditioning signal. The "Zero" variant initializes the output projection of each block to zero, so the Transformer starts as an identity function -- a critical trick for stable training of deep generative models.

U-Net vs. DiT

Aspect U-Net DiT
Backbone Convolutional + attention at low resolutions Pure Transformer
Spatial processing Multi-scale feature maps with skip connections Flat sequence of patches
Scaling Limited by architectural complexity Scales cleanly with parameters (like LLMs)
Conditioning Cross-attention + residual AdaLN-Zero
Compute profile Mixed conv + attention Uniform attention (GPU-friendly)
Used in Stable Diffusion 1.x/2.x, Imagen SD3, Flux, DALL-E 3, Sora
flowchart TB
    Latent["Noisy Latent z_t"] --> Patch["Patchify + Linear Embed"]
    Patch --> PE["+ Positional Embedding"]
    PE --> Block1["DiT Block 1: AdaLN-Zero + Self-Attention + FFN"]
    Block1 --> Block2["DiT Block 2"]
    Block2 --> BlockN["DiT Block N"]
    BlockN --> Unpatch["Linear + Unpatchify"]
    Unpatch --> Pred["Predicted Noise / v"]
    Timestep["Timestep t + Text Embed"] --> Block1
    Timestep --> Block2
    Timestep --> BlockN

Modern Systems: LLMs + DiT

State-of-the-art image generation systems increasingly combine a large language model for understanding and planning with a DiT for pixel synthesis. The LLM processes the user's prompt, performs reasoning about composition and layout, and produces rich conditioning signals that the DiT consumes.

Orchestration Pattern

flowchart LR
    User["User Prompt"] --> LLM["LLM (Understanding + Planning)"]
    LLM -- "Detailed caption / layout tokens" --> Encoder["Text Encoder (T5 / CLIP)"]
    Encoder -- "Conditioning embeddings" --> DiT["DiT (Latent Denoiser)"]
    DiT -- "Denoised latent" --> VAE["VAE Decoder"]
    VAE --> Image["Output Image"]

This separation of concerns mirrors classical software architecture: the LLM handles the "what" (semantic understanding, spatial reasoning, prompt rewriting) while the DiT handles the "how" (pixel-level synthesis). Systems like DALL-E 3 use GPT-4 to rewrite user prompts into detailed captions before passing them to the diffusion model, significantly improving prompt adherence.

/// details | Autoregressive vs. diffusion approaches to image generation Two paradigms now compete for image generation:

  • Autoregressive (e.g., Parti, Chameleon): Quantize the image into discrete tokens (via VQ-VAE) and predict them left-to-right like text. Shares infrastructure with LLMs but generates images slowly (one token at a time).
  • Diffusion (e.g., DALL-E 3, Flux): Generate all pixels/latents simultaneously through iterative denoising. Faster at high resolution but requires separate architecture from the LLM.

Hybrid approaches are emerging: some systems use autoregressive models to produce coarse layouts or conditioning signals, then hand off to diffusion for final synthesis -- essentially the LLM + DiT pattern described above. ///


From 2D synthesis to 3D reconstruction -- Image generation produces flat 2D outputs. But for applications like virtual try-on and product visualization, we need full 3D representations that can be rendered from any viewpoint. The next section covers how neural techniques reconstruct 3D scenes from 2D photographs -- using differentiable rendering to bridge the two domains. Continue to 3D Generation.

For learned upscaling techniques (GAN-based and diffusion-based), see Super-Resolution.