Evaluation Metrics¶
Quantitative evaluation is essential for comparing generative models, tuning hyperparameters, and validating perceptual quality against human judgment. Image quality metrics fall into three families based on the information they require: full-reference metrics compare a generated image against a ground-truth target, distributional metrics compare statistics over entire sets of images, and blind (no-reference) metrics score a single image in isolation.
Full-Reference Metrics (Paired)¶
These metrics require a pixel-aligned reference image and are most commonly used in super-resolution, denoising, and image restoration tasks.
PSNR¶
Peak Signal-to-Noise Ratio expresses reconstruction fidelity in decibels. It is derived directly from the Mean Squared Error (MSE) between the generated image \(\hat{I}\) and the reference \(I\):
where \(\text{MAX}_I\) is the maximum possible pixel value (255 for 8-bit images).
Strengths: Simple, fast, widely reported. Limits: Purely pixel-wise -- two images that look identical to humans can have very different PSNR if shifted by a single pixel. Does not model human perception.
SSIM¶
Structural Similarity Index Measure decomposes the comparison into three perceptually motivated components computed over a sliding window:
| Component | Formula |
|---|---|
| Luminance | \(l(x,y) = \dfrac{2\mu_x \mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1}\) |
| Contrast | \(c(x,y) = \dfrac{2\sigma_x \sigma_y + C_2}{\sigma_x^2 + \sigma_y^2 + C_2}\) |
| Structure | \(s(x,y) = \dfrac{\sigma_{xy} + C_3}{\sigma_x \sigma_y + C_3}\) |
The overall SSIM index is:
where \(\mu\), \(\sigma\), and \(\sigma_{xy}\) denote the local mean, standard deviation, and cross-covariance, and \(C_1, C_2, C_3\) are small stabilization constants.
Strengths: More robust than PSNR to uniform brightness changes and mild contrast shifts. Limits: Still sensitive to spatial misalignment; the sliding-window approach assumes local stationarity.
LPIPS¶
Learned Perceptual Image Patch Similarity measures distance in the feature space of a pretrained deep network (typically VGG-16 or AlexNet). Given feature maps \(\phi^l\) at layer \(l\):
where \(w_l\) are learned per-channel weights calibrated against human perceptual judgments.
Strengths: Spatially robust to imperceptible pixel-level shifts; correlates well with human similarity ratings. Limits: Depends on the backbone architecture and its training distribution; heavier compute than PSNR/SSIM.
DISTS¶
Deep Image Structure and Texture Similarity extends the LPIPS idea by explicitly decomposing deep features into structure and texture components, analogous to how SSIM separates luminance, contrast, and structure, but operating in feature space:
The structure term uses a correlation-based comparison (like SSIM), while the texture term measures statistical distribution distance across channels.
Strengths: Much less sensitive to high-frequency texture misalignment (e.g., grass, fabric, hair) compared to LPIPS. Combines the best intuitions of SSIM and LPIPS. Limits: Slightly higher computational cost; less widely adopted so far.
Full-Reference Comparison¶
| Metric | Alignment Sensitive? | Perceptual Correlation | Computational Cost | Scale | Better |
|---|---|---|---|---|---|
| PSNR | High | Low | Very low | dB (typ. 20--50) | Higher |
| SSIM | Moderate | Moderate | Low | 0 -- 1 | Higher |
| LPIPS | Low | High | Moderate | 0 -- 1+ | Lower |
| DISTS | Low | High | Moderate | 0 -- 1 | Lower |
Distributional Metrics (Unpaired / Set-level)¶
These metrics do not require paired images. Instead, they compare the statistical distribution of a generated image set against a real image set. They are the standard for evaluating generative models such as GANs and diffusion models.
FID¶
Frechet Inception Distance models the activations from the pool-3 layer of Inception v3 (dimensionality 2048) as multivariate Gaussians for both real (\(r\)) and generated (\(g\)) sets, then computes:
Strengths: Single scalar summarizing both quality (mode quality) and diversity (mode coverage). De facto standard for generative model benchmarks. Limits: Assumes Gaussian feature distributions (which rarely holds exactly). Biased for small sample sizes; at least 2048 images per set are recommended for stable estimates. Sensitive to image preprocessing (resize method, compression).
KID¶
Kernel Inception Distance uses the Maximum Mean Discrepancy (MMD) with a polynomial kernel instead of assuming Gaussian distributions:
where \(k\) is typically a polynomial kernel \(k(x,y) = \left(\frac{1}{d}x^\top y + 1\right)^3\) and \(f\) denotes Inception features.
Strengths: Unbiased estimator -- works reliably with smaller sample sizes (a few hundred images). No Gaussian assumption. Limits: Higher variance than FID; less commonly reported, making cross-paper comparison harder.
Distributional Comparison¶
| Metric | Gaussian Assumption | Min. Sample Size | Bias | Adoption |
|---|---|---|---|---|
| FID | Yes | ~2048+ | Biased (small N) | Very high |
| KID | No | ~100+ | Unbiased | Growing |
Blind / No-Reference Metrics¶
These metrics evaluate a single image without any reference. They are used when ground-truth images do not exist, such as evaluating unconditional generation or real-world image quality.
NIQE¶
Natural Image Quality Evaluator fits a multivariate Gaussian model to a corpus of pristine natural images using Natural Scene Statistics (NSS) features extracted from local normalized luminance patches. The quality score is the Mahalanobis distance between the test image's NSS features and the pristine model:
Strengths: Completely opinion-free -- no training on human Mean Opinion Scores (MOS). Lower score indicates a more natural-looking image. Limits: Penalizes artistic or stylized content that deviates from natural scene statistics; cannot distinguish semantic quality.
MANIQA¶
Multi-dimension Attention Network for No-Reference Image Quality Assessment uses a Swin Transformer backbone trained on large-scale human MOS datasets (KonIQ-10k, PIPAL). It captures both local distortions and global composition through a dual-branch attention mechanism.
Strengths: State-of-the-art correlation with human quality judgments. Outputs a score in \([0, 1]\). Limits: Requires a trained model checkpoint; quality depends on the training distribution. Computationally heavier than NIQE.
No-Reference Comparison¶
| Metric | Trained on Human Scores? | Score Range | Better | Computational Cost |
|---|---|---|---|---|
| NIQE | No (opinion-free) | 0+ (unbounded) | Lower | Low |
| MANIQA | Yes (MOS-supervised) | 0 -- 1 | Higher | Moderate |
Summary Comparison Table¶
| Metric | Family | Reference Needed? | Alignment Sensitive? | Scale | Better | Computational Cost |
|---|---|---|---|---|---|---|
| PSNR | Full-Reference | Yes (paired) | High | dB (typ. 20--50) | Higher | Very low |
| SSIM | Full-Reference | Yes (paired) | Moderate | 0 -- 1 | Higher | Low |
| LPIPS | Full-Reference | Yes (paired) | Low | 0 -- 1+ | Lower | Moderate |
| DISTS | Full-Reference | Yes (paired) | Low | 0 -- 1 | Lower | Moderate |
| FID | Distributional | Set vs. set | N/A | 0+ (unbounded) | Lower | Moderate |
| KID | Distributional | Set vs. set | N/A | 0+ (unbounded) | Lower | Moderate |
| NIQE | No-Reference | None | N/A | 0+ (unbounded) | Lower | Low |
| MANIQA | No-Reference | None | N/A | 0 -- 1 | Higher | Moderate |
For practical applications of these metrics in the context of generative pipelines, see the Image Generation and Super-Resolution pages.