Evaluation Metrics¶

Quantitative evaluation is essential for comparing generative models, tuning hyperparameters, and validating perceptual quality against human judgment. Image quality metrics fall into three families based on the information they require: full-reference metrics compare a generated image against a ground-truth target, distributional metrics compare statistics over entire sets of images, and blind (no-reference) metrics score a single image in isolation.

Full-Reference Metrics (Paired)¶

These metrics require a pixel-aligned reference image and are most commonly used in super-resolution, denoising, and image restoration tasks.

PSNR¶

Peak Signal-to-Noise Ratio expresses reconstruction fidelity in decibels. It is derived directly from the Mean Squared Error (MSE) between the generated image \(\hat{I}\) and the reference \(I\):

\[ \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} \bigl(I_i - \hat{I}_i\bigr)^2 \]

\[ \text{PSNR} = 10 \cdot \log_{10}\!\left(\frac{\text{MAX}_I^2}{\text{MSE}}\right) \]

where \(\text{MAX}_I\) is the maximum possible pixel value (255 for 8-bit images).

Strengths: Simple, fast, widely reported. Limits: Purely pixel-wise -- two images that look identical to humans can have very different PSNR if shifted by a single pixel. Does not model human perception.

SSIM¶

Structural Similarity Index Measure decomposes the comparison into three perceptually motivated components computed over a sliding window:

Component	Formula
Luminance	\(l(x,y) = \dfrac{2\mu_x \mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1}\)
Contrast	\(c(x,y) = \dfrac{2\sigma_x \sigma_y + C_2}{\sigma_x^2 + \sigma_y^2 + C_2}\)
Structure	\(s(x,y) = \dfrac{\sigma_{xy} + C_3}{\sigma_x \sigma_y + C_3}\)

The overall SSIM index is:

\[ \text{SSIM}(x, y) = l(x,y) \cdot c(x,y) \cdot s(x,y) \]

where \(\mu\), \(\sigma\), and \(\sigma_{xy}\) denote the local mean, standard deviation, and cross-covariance, and \(C_1, C_2, C_3\) are small stabilization constants.

Strengths: More robust than PSNR to uniform brightness changes and mild contrast shifts. Limits: Still sensitive to spatial misalignment; the sliding-window approach assumes local stationarity.

LPIPS¶

Learned Perceptual Image Patch Similarity measures distance in the feature space of a pretrained deep network (typically VGG-16 or AlexNet). Given feature maps \(\phi^l\) at layer \(l\):

\[ \text{LPIPS}(x, y) = \sum_{l} w_l \cdot \left\lVert \phi^l(x) - \phi^l(y) \right\rVert_2^2 \]

where \(w_l\) are learned per-channel weights calibrated against human perceptual judgments.

Strengths: Spatially robust to imperceptible pixel-level shifts; correlates well with human similarity ratings. Limits: Depends on the backbone architecture and its training distribution; heavier compute than PSNR/SSIM.

DISTS¶

Deep Image Structure and Texture Similarity extends the LPIPS idea by explicitly decomposing deep features into structure and texture components, analogous to how SSIM separates luminance, contrast, and structure, but operating in feature space:

\[ \text{DISTS}(x, y) = \sum_{l} \Bigl[\alpha_l \cdot d_{\text{structure}}^l + \beta_l \cdot d_{\text{texture}}^l\Bigr] \]

The structure term uses a correlation-based comparison (like SSIM), while the texture term measures statistical distribution distance across channels.

Strengths: Much less sensitive to high-frequency texture misalignment (e.g., grass, fabric, hair) compared to LPIPS. Combines the best intuitions of SSIM and LPIPS. Limits: Slightly higher computational cost; less widely adopted so far.

Full-Reference Comparison¶

Metric	Alignment Sensitive?	Perceptual Correlation	Computational Cost	Scale	Better
PSNR	High	Low	Very low	dB (typ. 20--50)	Higher
SSIM	Moderate	Moderate	Low	0 -- 1	Higher
LPIPS	Low	High	Moderate	0 -- 1+	Lower
DISTS	Low	High	Moderate	0 -- 1	Lower

Distributional Metrics (Unpaired / Set-level)¶

These metrics do not require paired images. Instead, they compare the statistical distribution of a generated image set against a real image set. They are the standard for evaluating generative models such as GANs and diffusion models.

FID¶

Frechet Inception Distance models the activations from the pool-3 layer of Inception v3 (dimensionality 2048) as multivariate Gaussians for both real (\(r\)) and generated (\(g\)) sets, then computes:

\[ \text{FID} = \left\lVert \mu_r - \mu_g \right\rVert_2^2 + \text{Tr}\!\left(\Sigma_r + \Sigma_g - 2\bigl(\Sigma_r \Sigma_g\bigr)^{1/2}\right) \]

Strengths: Single scalar summarizing both quality (mode quality) and diversity (mode coverage). De facto standard for generative model benchmarks. Limits: Assumes Gaussian feature distributions (which rarely holds exactly). Biased for small sample sizes; at least 2048 images per set are recommended for stable estimates. Sensitive to image preprocessing (resize method, compression).

KID¶

Kernel Inception Distance uses the Maximum Mean Discrepancy (MMD) with a polynomial kernel instead of assuming Gaussian distributions:

\[ \text{KID} = \text{MMD}^2\!\bigl(f_r, f_g\bigr) = \mathbb{E}[k(f_r, f_r')] + \mathbb{E}[k(f_g, f_g')] - 2\,\mathbb{E}[k(f_r, f_g)] \]

where \(k\) is typically a polynomial kernel \(k(x,y) = \left(\frac{1}{d}x^\top y + 1\right)^3\) and \(f\) denotes Inception features.

Strengths: Unbiased estimator -- works reliably with smaller sample sizes (a few hundred images). No Gaussian assumption. Limits: Higher variance than FID; less commonly reported, making cross-paper comparison harder.

Distributional Comparison¶

Metric	Gaussian Assumption	Min. Sample Size	Bias	Adoption
FID	Yes	~2048+	Biased (small N)	Very high
KID	No	~100+	Unbiased	Growing

These metrics evaluate a single image without any reference. They are used when ground-truth images do not exist, such as evaluating unconditional generation or real-world image quality.

NIQE¶

Natural Image Quality Evaluator fits a multivariate Gaussian model to a corpus of pristine natural images using Natural Scene Statistics (NSS) features extracted from local normalized luminance patches. The quality score is the Mahalanobis distance between the test image's NSS features and the pristine model:

\[ \text{NIQE} = \sqrt{(\nu_1 - \nu_2)^\top \left(\frac{\Sigma_1 + \Sigma_2}{2}\right)^{-1} (\nu_1 - \nu_2)} \]

Strengths: Completely opinion-free -- no training on human Mean Opinion Scores (MOS). Lower score indicates a more natural-looking image. Limits: Penalizes artistic or stylized content that deviates from natural scene statistics; cannot distinguish semantic quality.

MANIQA¶

Multi-dimension Attention Network for No-Reference Image Quality Assessment uses a Swin Transformer backbone trained on large-scale human MOS datasets (KonIQ-10k, PIPAL). It captures both local distortions and global composition through a dual-branch attention mechanism.

Strengths: State-of-the-art correlation with human quality judgments. Outputs a score in \([0, 1]\). Limits: Requires a trained model checkpoint; quality depends on the training distribution. Computationally heavier than NIQE.

No-Reference Comparison¶

Metric	Trained on Human Scores?	Score Range	Better	Computational Cost
NIQE	No (opinion-free)	0+ (unbounded)	Lower	Low
MANIQA	Yes (MOS-supervised)	0 -- 1	Higher	Moderate

Summary Comparison Table¶

Metric	Family	Reference Needed?	Alignment Sensitive?	Scale	Better	Computational Cost
PSNR	Full-Reference	Yes (paired)	High	dB (typ. 20--50)	Higher	Very low
SSIM	Full-Reference	Yes (paired)	Moderate	0 -- 1	Higher	Low
LPIPS	Full-Reference	Yes (paired)	Low	0 -- 1+	Lower	Moderate
DISTS	Full-Reference	Yes (paired)	Low	0 -- 1	Lower	Moderate
FID	Distributional	Set vs. set	N/A	0+ (unbounded)	Lower	Moderate
KID	Distributional	Set vs. set	N/A	0+ (unbounded)	Lower	Moderate
NIQE	No-Reference	None	N/A	0+ (unbounded)	Lower	Low
MANIQA	No-Reference	None	N/A	0 -- 1	Higher	Moderate

For practical applications of these metrics in the context of generative pipelines, see the Image Generation and Super-Resolution pages.