Skip to content

Super-Resolution (Upscaling)

Modern super-resolution has moved decisively from classical interpolation to learned generative models. The core task is to recover a high-resolution image \(I_{HR}\) from a low-resolution input \(I_{LR}\), where the relationship is modeled as:

\[ I_{LR} = (I_{HR} \ast k) \downarrow_s + n \]

with blur kernel \(k\), downsampling factor \(s\), and noise \(n\). Classical methods (bicubic, Lanczos) treat this as pure interpolation. Learned methods treat it as conditional generation -- synthesizing plausible high-frequency detail that is consistent with the low-resolution input.

GAN-Based: ESRGAN and Real-ESRGAN

ESRGAN (Wang et al., 2018) adapts the GAN framework to super-resolution. A generator \(G\) upscales \(I_{LR}\), and a discriminator \(D\) distinguishes the result from real high-resolution images. The key innovations over the earlier SRGAN:

  • Residual-in-Residual Dense Blocks (RRDB) -- deeper feature extraction without batch normalization (which causes artifacts at high magnification).
  • Relativistic discriminator -- \(D\) predicts whether a real image is more realistic than the generated one, rather than classifying in absolute terms.
  • Perceptual loss -- computed on VGG features before activation (sharper textures than post-activation features).

Real-ESRGAN (Wang et al., 2021) extends this to real-world images by training with a high-order degradation pipeline that simulates realistic artifacts (JPEG compression, camera noise, blur chains) rather than clean bicubic downsampling only.

Model Approach Degradation Model Typical Scale
SRCNN CNN regression Bicubic only 2x--4x
SRGAN GAN + perceptual loss Bicubic only 4x
ESRGAN GAN + RRDB + relativistic D Bicubic only 4x
Real-ESRGAN GAN + high-order degradation Realistic (blur, noise, JPEG) 2x--4x
SD x4 Upscaler Conditioned latent diffusion Learned 4x
SUPIR Diffusion + LLM captioning Learned Variable

Diffusion-Based Super-Resolution

The same latent diffusion framework from the previous section applies directly: condition the denoising process on \(I_{LR}\) (concatenated to the noisy latent or injected via cross-attention) and generate \(I_{HR}\).

Stable Diffusion x4 Upscaler concatenates the low-resolution image with the noisy latent at each denoising step, providing pixel-level guidance. The diffusion approach excels at hallucinating coherent detail (faces, text, fabric texture) that GAN-based methods often distort, but at the cost of slower inference (multiple denoising steps vs. a single forward pass).

The next section covers modern diffusion-based restoration systems -- including SUPIR, Crystal Clear, LucidFlux, and CoDiff -- that build on this foundation with increasingly sophisticated conditioning strategies.

flowchart LR
    LR["Low-Res Image"] --> Encode["Encode / Concat"]
    LR --> Caption["LLM Caption (SUPIR)"]
    Caption --> Cond["Text Conditioning"]
    Encode --> Denoise["Diffusion Denoising (T steps)"]
    Cond --> Denoise
    Denoise --> Decode["VAE Decode"]
    Decode --> HR["High-Res Output"]

Diffusion-Based Systems (Creative)

The following systems push diffusion-based super-resolution beyond simple conditioned denoising, each introducing architectural innovations that improve restoration quality, semantic coherence, or inference efficiency.

Crystal Clear (ClarityAI)

Crystal Clear uses a hybrid multi-stage pipeline combining traditional and diffusion-based techniques. The architecture starts with 4x-UltraSharp ESRGAN for initial upscaling, then applies a finetuned Stable Diffusion 1.5 model (specifically Juggernaut) through a controlled image-to-image partial denoise process. To maintain structural integrity, it employs ControlNet Tile with a "resemblance" parameter that prevents excessive hallucination while adding realistic details. For processing large images, MultiDiffusion tiling (896x1152 pixel tiles) ensures seamless results without visible seams. The system is further enhanced with LoRAs like SDXLrender (reduces blur) and Add More Details (increases sharpness).

This approach achieves competitive quality by intelligently combining ESRGAN's speed with controlled diffusion hallucination, making it an effective open-source alternative to commercial tools like Magnific AI.

SUPIR (Scaling-UP Image Restoration)

SUPIR (Yu et al., 2024) is a generative restoration framework that integrates Multimodal Language Models (LMMs) directly into the restoration pipeline. Built upon the Stable Diffusion XL (SDXL) backbone, SUPIR distinguishes itself from SD1.5-based approaches (like Crystal Clear) by employing a vision-language model (specifically LLaVA) as a semantic interpreter. Instead of blindly denoising pixels, the LMM analyzes the degraded input image to generate a highly detailed textual prompt describing the scene's content, lighting, and materials.

This Semantic Guidance forces the SDXL generative prior to hallucinate textures that are logically consistent with the subject (e.g., ensuring "fur" looks like fur, not noise), enabling it to restore severely degraded images with photorealistic fidelity and contextual intelligence that discriminative models (like DRCT) and smaller generative models cannot match.

LucidFlux (Caption-Free Universal Image Restoration)

LucidFlux (late 2025) adapts the Flux.1 Diffusion Transformer backbone for image restoration, abandoning the U-Net architecture used by SUPIR and Crystal Clear. Its primary innovation is a Caption-Free Dual-Branch Architecture: instead of relying on manual text prompts or VLM-generated captions (like SUPIR), it utilizes a SigLIP encoder to extract semantic features directly from the image, paired with a lightweight conditioner that processes both the degraded input and a "lightly restored proxy" to guide the generation.

While theoretically superior in handling mixed resolutions due to its Transformer backbone, LucidFlux currently suffers from Texture Collapse -- a "plastic/waxy" appearance on photographic content. Because the base Flux model is heavily biased toward smooth, synthetic aesthetics, LucidFlux tends to aggressively denoise high-frequency film grain and skin pores, interpreting them as errors rather than features. This makes it inferior to SD1.5/SDXL-based models (Crystal Clear/SUPIR) for photorealistic restoration until specific texture-aware fine-tunes are released.

CoDiff (Compression-Aware One-Step Diffusion)

CoDiff (ICCV 2025) is specifically engineered to solve the "Digital Rubble" problem -- JPEG Quality Factor < 10, where pixels are effectively lost. Built on a distilled One-Step Diffusion (OSD) backbone (leveraging priors from Stable Diffusion 2.1), its core innovation is the Compression-Aware Visual Embedder (CaVE). Unlike standard encoders that treat artifacts as generic noise, CaVE employs a Dual Learning Strategy: an explicit branch trained to predict the exact JPEG Quality Factor (QF) from the input, and an implicit branch optimized for latent reconstruction.

This allows the model to condition the diffusion process on the specific mathematical signature of the compression, enabling it to hallucinate plausible high-frequency textures (grain, fabric) that strictly align with the quantization tables of the destroyed image -- effectively reverse-engineering the compression algorithm in a single inference step.

Comparison

Model Backbone Key Innovation Strength GitHub
Crystal Clear ESRGAN + SD1.5 ControlNet Tile resemblance + MultiDiffusion tiling Balanced speed/quality, open-source Magnific alternative clarity-upscaler
SUPIR SDXL LLaVA semantic guidance Severe degradation, photorealistic hallucination SUPIR
LucidFlux Flux.1 (DiT) Caption-free SigLIP dual-branch Mixed-resolution inputs (texture collapse caveat) LucidFlux
CoDiff SD2.1 (one-step) CaVE dual QF prediction Extreme JPEG compression (QF < 10) codiff

Transformer-Based (Highest Fidelity)

Transformer-based super-resolution models take a fundamentally different approach from diffusion systems. Rather than generating stochastic hallucinations, they learn deterministic mappings that minimize pixel-wise reconstruction error (PSNR/SSIM). This makes them the preferred choice when fidelity to the original content matters more than perceptual "creativity."

DRCT (Dense Residual Connected Transformer)

DRCT (2024) addresses the information bottleneck inherent in deep Swin Transformer networks like SwinIR. Its core innovation replaces standard Residual Swin Transformer Blocks with Swin-Dense-Residual-Connected Blocks (SDRCB), organized within Residual Dense Groups (RDG). By employing dense distinct connections between stacked layers and a multi-level residual learning strategy, DRCT ensures that feature maps from shallow layers are concatenated directly with deep features, preserving high-frequency spatial information throughout the forward pass.

This dense wiring scheme prevents the "forgetting" of fine textures (grain, pores) during deep feature extraction, allowing the model to achieve State-of-the-Art fidelity metrics (PSNR/SSIM) by mathematically minimizing pixel-wise reconstruction error rather than generating stochastic hallucinations.

SRFormer (Permuted Self-Attention Transformer)

SRFormer is a lightweight transformer architecture designed to overcome the receptive field limitations of SwinIR by introducing Permuted Self-Attention (PSA). Unlike SwinIR, which is restricted to small, non-overlapping 8x8 windows that sever global context, SRFormer utilizes a channel-spatial permutation strategy to enable attention across significantly larger windows (e.g., 24x24) without increasing computational complexity.

By reshaping spatial information into the channel dimension before computing attention, it effectively "tricks" the network into processing broader contextual dependencies -- crucial for recognizing large-scale artifact patterns like JPEG blocking -- while remaining computationally efficient. Additionally, it replaces the standard feed-forward network with a ConvFFN (Convolutional Feed-Forward Network), which re-injects high-frequency local bias, ensuring that while the attention mechanism cleans global noise, fine details like text edges and sharp lines are not over-smoothed.

GRL (Global, Regional, and Local)

GRL is a specialized Transformer architecture designed to model image hierarchies explicitly, addressing the key limitation of window-based models like SwinIR (restricted receptive fields). Its core innovation is Anchored Stripe Self-Attention, a mechanism that replaces standard square windows with anisotropic horizontal and vertical stripes. By using "anchors" to approximate the attention map, GRL can efficiently capture long-range dependencies across the entire image width or height without the quadratic computational cost of full global attention.

This makes it uniquely superior for reconstructing repetitive geometric patterns (e.g., skyscraper facades, brick walls, fences) and preserving distinct straight lines -- scenarios where standard window-based Transformers often create visible "grid artifacts" or break continuous edges.

MambaIR Family (v1, v2, MatIR)

MambaIR (2024/2025) abandons both CNNs and Transformers in favor of State Space Models (SSMs). Its primary breakthrough is achieving global receptive fields with linear complexity \(O(N)\), solving the quadratic complexity bottleneck that plagues Transformers. Unlike Transformers that partition images into windows, MambaIR treats the entire image as a continuous sequence, using a Selective Scan Mechanism (SSM) to process pixels in a stream while dynamically "remembering" long-range dependencies across the entire frame.

The v2 upgrade (CVPR 2025) introduces an Attentive State-Space Equation to enable "non-causal" modeling -- allowing the model to access future pixels it hasn't scanned yet -- effectively combining the global context awareness of GRL with the computational efficiency of a CNN. This makes MambaIR the current efficiency leader for tasks requiring massive context (like deblurring or removing large watermark patterns) where window-based models fail, and especially useful for very large images when GPU memory is constrained.

Comparison

Model Architecture Key Innovation Best For
DRCT Swin Transformer + dense connections SDRCB dense wiring, multi-level residual learning Highest fidelity upscaling (PSNR/SSIM)
SRFormer Permuted Self-Attention Transformer PSA for large windows + ConvFFN local bias JPEG de-blocking, text/edge preservation
GRL Anchored Stripe Attention Transformer Anisotropic stripe attention with anchors Repetitive geometric patterns, straight lines
MambaIR State Space Model (SSM) Global receptive field at \(O(N)\) complexity Very large images, memory-constrained environments

Tiled Inference for Large Images

At production resolutions, upscaling entire images at once exceeds GPU memory. Tiled inference splits the image into overlapping patches, upscales each independently, and blends the results using weighted averaging in the overlap region (typically a cosine or linear ramp). This connects directly to the feathering and blending techniques covered in Computer Vision Techniques.