3D Generation¶

Traditional Photogrammetry¶

Structure from Motion (SfM) is the classical approach to 3D reconstruction from photographs. It estimates camera poses and a sparse 3D point cloud simultaneously from feature correspondences across multiple views.

SfM Pipeline¶

flowchart LR
    Images["Input Photos"] --> Features["Feature Detection (SIFT/SuperPoint)"]
    Features --> Match["Feature Matching"]
    Match --> BA["Bundle Adjustment"]
    BA --> Sparse["Sparse Point Cloud + Camera Poses"]
    Sparse --> MVS["Multi-View Stereo"]
    MVS --> Dense["Dense Point Cloud / Mesh"]

Feature detection -- Extract keypoints and descriptors (SIFT, ORB, or learned features like SuperPoint).
Feature matching -- Find correspondences across image pairs.
Bundle adjustment -- Jointly optimize 3D point positions and camera parameters by minimizing reprojection error.
Dense reconstruction -- Multi-View Stereo (MVS) algorithms like COLMAP produce dense point clouds or meshes.

Limitations¶

Challenge	Description
Textureless surfaces	Feature detectors fail on uniform regions (white walls, plain fabric)
Reflective / transparent surfaces	Specular highlights violate the Lambertian assumption
Thin structures	Fine geometry (hair, wire fences) is lost in point cloud representations
Lighting variation	Different exposure or white balance across images degrades matching
Computation time	Dense reconstruction of large scenes can take hours

Neural Radiance Fields (NeRF)¶

NeRF (Mildenhall et al., 2020) represents a scene as a continuous volumetric function parameterized by a neural network. It takes a 3D position and viewing direction as input and outputs color and density, enabling photorealistic novel-view synthesis.

Scene Representation¶

A neural network \(F_\theta\) maps a 3D position \(\mathbf{x} = (x, y, z)\) and viewing direction \(\mathbf{d} = (\theta, \phi)\) to an RGB color \(\mathbf{c}\) and volume density \(\sigma\):

\[ F_\theta : (\mathbf{x}, \mathbf{d}) \rightarrow (\mathbf{c}, \sigma) \]

Density \(\sigma\) depends only on position (geometry is view-independent), while color depends on both position and direction (to model view-dependent effects like specular highlights).

Volume Rendering¶

To render a pixel, NeRF casts a ray \(\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}\) from the camera origin \(\mathbf{o}\) through the pixel in direction \(\mathbf{d}\), and integrates color along the ray:

\[ C(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \, \sigma(\mathbf{r}(t)) \, \mathbf{c}(\mathbf{r}(t), \mathbf{d}) \, dt \]

where the transmittance \(T(t)\) is the probability that the ray travels from \(t_n\) to \(t\) without hitting anything:

\[ T(t) = \exp\!\left(-\int_{t_n}^{t} \sigma(\mathbf{r}(s)) \, ds\right) \]

In practice, the integral is approximated by sampling \(N\) points along the ray and using quadrature:

\[ \hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i \, \alpha_i \, \mathbf{c}_i \qquad \alpha_i = 1 - \exp(-\sigma_i \delta_i) \qquad T_i = \prod_{j=1}^{i-1}(1 - \alpha_j) \]

where \(\delta_i = t_{i+1} - t_i\) is the distance between adjacent samples.

Positional Encoding¶

Raw \((x, y, z)\) coordinates are mapped through a positional encoding before being fed to the network. This allows the MLP to represent high-frequency scene details:

\[ \gamma(p) = \left(\sin(2^0 \pi p), \cos(2^0 \pi p), \ldots, \sin(2^{L-1} \pi p), \cos(2^{L-1} \pi p)\right) \]

This is the same idea as the Transformer's sinusoidal positional encoding -- mapping a low-dimensional input into a higher-dimensional space so that the network can learn high-frequency functions. Without it, MLPs exhibit a strong bias toward smooth, low-frequency outputs (the "spectral bias" of neural networks).

Hierarchical Sampling¶

NeRF uses a coarse-to-fine strategy: a coarse network first samples points uniformly along the ray, then the density estimates from the coarse pass inform a second round of importance sampling that concentrates points near surfaces. This dramatically improves efficiency.

flowchart TB
    Ray["Camera Ray r(t)"] --> Coarse["Coarse Sampling (uniform)"]
    Coarse --> CoarseNet["Coarse MLP"]
    CoarseNet --> Weights["Density Weights"]
    Weights --> Fine["Fine Sampling (importance)"]
    Fine --> FineNet["Fine MLP"]
    FineNet --> Render["Volume Rendering -> Pixel Color"]
    Render --> Loss["MSE Loss vs. Ground Truth Pixel"]

3D Gaussian Splatting¶

3D Gaussian Splatting (Kerbl et al., 2023) represents scenes as a collection of explicit 3D Gaussian primitives that are differentiably rasterized onto images. It achieves real-time rendering while matching or exceeding NeRF quality.

Representation¶

Each Gaussian primitive is defined by:

Position \(\mu \in \mathbb{R}^3\) -- the center
Covariance \(\Sigma \in \mathbb{R}^{3 \times 3}\) -- the shape and orientation (parameterized as \(\Sigma = RSS^TR^T\) where \(R\) is a rotation quaternion and \(S\) is a diagonal scale matrix, ensuring \(\Sigma\) is always positive semi-definite)
Opacity \(\alpha \in [0, 1]\)
Color -- represented via spherical harmonics coefficients for view-dependent appearance

The influence of each Gaussian at a 3D point \(\mathbf{x}\) is:

\[ G(\mathbf{x}) = \exp\!\left(-\frac{1}{2}(\mathbf{x} - \mu)^T \Sigma^{-1} (\mathbf{x} - \mu)\right) \]

Differentiable Rasterization¶

Instead of ray marching (NeRF's approach), Gaussian Splatting projects each 3D Gaussian onto the 2D image plane:

\[ \Sigma' = J W \Sigma W^T J^T \]

where \(W\) is the world-to-camera transform and \(J\) is the Jacobian of the projective transformation. The projected 2D Gaussians are then alpha-composited front-to-back using a tile-based rasterizer, which is massively parallelizable on GPUs.

The pixel color is computed by sorted alpha blending:

\[ C = \sum_{i \in \mathcal{N}} c_i \, \alpha_i \, G_i'(\mathbf{p}) \prod_{j=1}^{i-1}\left(1 - \alpha_j \, G_j'(\mathbf{p})\right) \]

Note the structural similarity to the NeRF volume rendering equation -- both are front-to-back alpha compositing, but Gaussian Splatting operates on explicit primitives rather than sampled points along rays.

Adaptive Densification¶

The optimization starts from a sparse SfM point cloud and iteratively:

Clones Gaussians in under-reconstructed regions (high positional gradient, small Gaussians)
Splits over-large Gaussians that cover too much scene geometry
Prunes Gaussians with near-zero opacity or excessive scale

This adaptive process is reminiscent of the iterative refinement theme -- the representation itself evolves during optimization, not just the parameters.

flowchart TB
    SfM["SfM Point Cloud"] --> Init["Initialize Gaussians"]
    Init --> Forward["Differentiable Rasterization"]
    Forward --> Loss["L1 + D-SSIM Loss vs. GT Image"]
    Loss --> Backward["Backprop Gradients"]
    Backward --> Update["Update mu, Sigma, alpha, SH"]
    Update --> Densify{"Densification Check"}
    Densify -- "Under-reconstructed" --> Clone["Clone / Split"]
    Densify -- "Transparent" --> Prune["Prune"]
    Clone --> Forward
    Prune --> Forward
    Densify -- "Converged" --> Done["Final Gaussians"]

Comparison: Photogrammetry vs. NeRF vs. Gaussian Splatting¶

Aspect	Photogrammetry	NeRF	3D Gaussian Splatting
Representation	Explicit mesh / point cloud	Implicit (MLP weights)	Explicit (Gaussian primitives)
Training time	Hours (dense MVS)	Hours (per-scene optimization)	Minutes (~15-30 min)
Rendering speed	Real-time (rasterization)	Seconds to real-time (original: ~30s; Instant-NGP: real-time)	Real-time (~100+ FPS)
Novel view quality	Good (mesh artifacts at edges)	Excellent (continuous field)	Excellent (smooth blending)
View-dependent effects	Limited (baked textures)	Yes (direction input)	Yes (spherical harmonics)
Editability	High (standard mesh tools)	Low (implicit, entangled)	Medium (explicit primitives)
Memory	Large (dense meshes)	Small (network weights)	Medium (millions of Gaussians)
Input requirements	Many views, good texture	Many views, known poses	Moderate views, SfM initialization

/// details | Hybrid 3D approaches Several recent methods combine the strengths of multiple representations:

Instant-NGP (Mueller et al., 2022): Replaces the MLP with a multi-resolution hash encoding, training NeRF in seconds rather than hours. Modern implementations such as Instant-NGP and Nerfacto achieve real-time rendering, making the original ~30s per-frame figure historical context only.
Neuralangelo (Li et al., 2023): Combines NeRF with surface extraction using multi-resolution hash encoding for high-fidelity mesh recovery.
SuGaR (Guedon & Lepetit, 2024): Extracts meshes from Gaussian Splatting by regularizing Gaussians to align with surfaces, enabling traditional mesh editing of splat-derived scenes.
2DGS (Huang et al., 2024): Uses 2D (flat) Gaussians instead of 3D ellipsoids, improving surface reconstruction quality while maintaining splatting speed.

The field is converging toward methods that are fast to optimize (like Gaussians), produce high-quality novel views (like NeRF), and yield editable geometry (like traditional meshes). ///