Skip to content

Evaluation Metrics

Quantitative evaluation is essential for comparing generative models, tuning hyperparameters, and validating perceptual quality against human judgment. Image quality metrics fall into three families based on the information they require: full-reference metrics compare a generated image against a ground-truth target, distributional metrics compare statistics over entire sets of images, and blind (no-reference) metrics score a single image in isolation.


Full-Reference Metrics (Paired)

These metrics require a pixel-aligned reference image and are most commonly used in super-resolution, denoising, and image restoration tasks.

PSNR

Peak Signal-to-Noise Ratio expresses reconstruction fidelity in decibels. It is derived directly from the Mean Squared Error (MSE) between the generated image \(\hat{I}\) and the reference \(I\):

\[ \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} \bigl(I_i - \hat{I}_i\bigr)^2 \]
\[ \text{PSNR} = 10 \cdot \log_{10}\!\left(\frac{\text{MAX}_I^2}{\text{MSE}}\right) \]

where \(\text{MAX}_I\) is the maximum possible pixel value (255 for 8-bit images).

Strengths: Simple, fast, widely reported. Limits: Purely pixel-wise -- two images that look identical to humans can have very different PSNR if shifted by a single pixel. Does not model human perception.

SSIM

Structural Similarity Index Measure decomposes the comparison into three perceptually motivated components computed over a sliding window:

Component Formula
Luminance \(l(x,y) = \dfrac{2\mu_x \mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1}\)
Contrast \(c(x,y) = \dfrac{2\sigma_x \sigma_y + C_2}{\sigma_x^2 + \sigma_y^2 + C_2}\)
Structure \(s(x,y) = \dfrac{\sigma_{xy} + C_3}{\sigma_x \sigma_y + C_3}\)

The overall SSIM index is:

\[ \text{SSIM}(x, y) = l(x,y) \cdot c(x,y) \cdot s(x,y) \]

where \(\mu\), \(\sigma\), and \(\sigma_{xy}\) denote the local mean, standard deviation, and cross-covariance, and \(C_1, C_2, C_3\) are small stabilization constants.

Strengths: More robust than PSNR to uniform brightness changes and mild contrast shifts. Limits: Still sensitive to spatial misalignment; the sliding-window approach assumes local stationarity.

LPIPS

Learned Perceptual Image Patch Similarity measures distance in the feature space of a pretrained deep network (typically VGG-16 or AlexNet). Given feature maps \(\phi^l\) at layer \(l\):

\[ \text{LPIPS}(x, y) = \sum_{l} w_l \cdot \left\lVert \phi^l(x) - \phi^l(y) \right\rVert_2^2 \]

where \(w_l\) are learned per-channel weights calibrated against human perceptual judgments.

Strengths: Spatially robust to imperceptible pixel-level shifts; correlates well with human similarity ratings. Limits: Depends on the backbone architecture and its training distribution; heavier compute than PSNR/SSIM.

DISTS

Deep Image Structure and Texture Similarity extends the LPIPS idea by explicitly decomposing deep features into structure and texture components, analogous to how SSIM separates luminance, contrast, and structure, but operating in feature space:

\[ \text{DISTS}(x, y) = \sum_{l} \Bigl[\alpha_l \cdot d_{\text{structure}}^l + \beta_l \cdot d_{\text{texture}}^l\Bigr] \]

The structure term uses a correlation-based comparison (like SSIM), while the texture term measures statistical distribution distance across channels.

Strengths: Much less sensitive to high-frequency texture misalignment (e.g., grass, fabric, hair) compared to LPIPS. Combines the best intuitions of SSIM and LPIPS. Limits: Slightly higher computational cost; less widely adopted so far.

Full-Reference Comparison

Metric Alignment Sensitive? Perceptual Correlation Computational Cost Scale Better
PSNR High Low Very low dB (typ. 20--50) Higher
SSIM Moderate Moderate Low 0 -- 1 Higher
LPIPS Low High Moderate 0 -- 1+ Lower
DISTS Low High Moderate 0 -- 1 Lower

Distributional Metrics (Unpaired / Set-level)

These metrics do not require paired images. Instead, they compare the statistical distribution of a generated image set against a real image set. They are the standard for evaluating generative models such as GANs and diffusion models.

FID

Frechet Inception Distance models the activations from the pool-3 layer of Inception v3 (dimensionality 2048) as multivariate Gaussians for both real (\(r\)) and generated (\(g\)) sets, then computes:

\[ \text{FID} = \left\lVert \mu_r - \mu_g \right\rVert_2^2 + \text{Tr}\!\left(\Sigma_r + \Sigma_g - 2\bigl(\Sigma_r \Sigma_g\bigr)^{1/2}\right) \]

Strengths: Single scalar summarizing both quality (mode quality) and diversity (mode coverage). De facto standard for generative model benchmarks. Limits: Assumes Gaussian feature distributions (which rarely holds exactly). Biased for small sample sizes; at least 2048 images per set are recommended for stable estimates. Sensitive to image preprocessing (resize method, compression).

KID

Kernel Inception Distance uses the Maximum Mean Discrepancy (MMD) with a polynomial kernel instead of assuming Gaussian distributions:

\[ \text{KID} = \text{MMD}^2\!\bigl(f_r, f_g\bigr) = \mathbb{E}[k(f_r, f_r')] + \mathbb{E}[k(f_g, f_g')] - 2\,\mathbb{E}[k(f_r, f_g)] \]

where \(k\) is typically a polynomial kernel \(k(x,y) = \left(\frac{1}{d}x^\top y + 1\right)^3\) and \(f\) denotes Inception features.

Strengths: Unbiased estimator -- works reliably with smaller sample sizes (a few hundred images). No Gaussian assumption. Limits: Higher variance than FID; less commonly reported, making cross-paper comparison harder.

Distributional Comparison

Metric Gaussian Assumption Min. Sample Size Bias Adoption
FID Yes ~2048+ Biased (small N) Very high
KID No ~100+ Unbiased Growing

Blind / No-Reference Metrics

These metrics evaluate a single image without any reference. They are used when ground-truth images do not exist, such as evaluating unconditional generation or real-world image quality.

NIQE

Natural Image Quality Evaluator fits a multivariate Gaussian model to a corpus of pristine natural images using Natural Scene Statistics (NSS) features extracted from local normalized luminance patches. The quality score is the Mahalanobis distance between the test image's NSS features and the pristine model:

\[ \text{NIQE} = \sqrt{(\nu_1 - \nu_2)^\top \left(\frac{\Sigma_1 + \Sigma_2}{2}\right)^{-1} (\nu_1 - \nu_2)} \]

Strengths: Completely opinion-free -- no training on human Mean Opinion Scores (MOS). Lower score indicates a more natural-looking image. Limits: Penalizes artistic or stylized content that deviates from natural scene statistics; cannot distinguish semantic quality.

MANIQA

Multi-dimension Attention Network for No-Reference Image Quality Assessment uses a Swin Transformer backbone trained on large-scale human MOS datasets (KonIQ-10k, PIPAL). It captures both local distortions and global composition through a dual-branch attention mechanism.

Strengths: State-of-the-art correlation with human quality judgments. Outputs a score in \([0, 1]\). Limits: Requires a trained model checkpoint; quality depends on the training distribution. Computationally heavier than NIQE.

No-Reference Comparison

Metric Trained on Human Scores? Score Range Better Computational Cost
NIQE No (opinion-free) 0+ (unbounded) Lower Low
MANIQA Yes (MOS-supervised) 0 -- 1 Higher Moderate

Summary Comparison Table

Metric Family Reference Needed? Alignment Sensitive? Scale Better Computational Cost
PSNR Full-Reference Yes (paired) High dB (typ. 20--50) Higher Very low
SSIM Full-Reference Yes (paired) Moderate 0 -- 1 Higher Low
LPIPS Full-Reference Yes (paired) Low 0 -- 1+ Lower Moderate
DISTS Full-Reference Yes (paired) Low 0 -- 1 Lower Moderate
FID Distributional Set vs. set N/A 0+ (unbounded) Lower Moderate
KID Distributional Set vs. set N/A 0+ (unbounded) Lower Moderate
NIQE No-Reference None N/A 0+ (unbounded) Lower Low
MANIQA No-Reference None N/A 0 -- 1 Higher Moderate

For practical applications of these metrics in the context of generative pipelines, see the Image Generation and Super-Resolution pages.