3D Generation¶
Traditional Photogrammetry¶
Structure from Motion (SfM) is the classical approach to 3D reconstruction from photographs. It estimates camera poses and a sparse 3D point cloud simultaneously from feature correspondences across multiple views.
SfM Pipeline¶
flowchart LR
Images["Input Photos"] --> Features["Feature Detection (SIFT/SuperPoint)"]
Features --> Match["Feature Matching"]
Match --> BA["Bundle Adjustment"]
BA --> Sparse["Sparse Point Cloud + Camera Poses"]
Sparse --> MVS["Multi-View Stereo"]
MVS --> Dense["Dense Point Cloud / Mesh"]
- Feature detection -- Extract keypoints and descriptors (SIFT, ORB, or learned features like SuperPoint).
- Feature matching -- Find correspondences across image pairs.
- Bundle adjustment -- Jointly optimize 3D point positions and camera parameters by minimizing reprojection error.
- Dense reconstruction -- Multi-View Stereo (MVS) algorithms like COLMAP produce dense point clouds or meshes.
Limitations¶
| Challenge | Description |
|---|---|
| Textureless surfaces | Feature detectors fail on uniform regions (white walls, plain fabric) |
| Reflective / transparent surfaces | Specular highlights violate the Lambertian assumption |
| Thin structures | Fine geometry (hair, wire fences) is lost in point cloud representations |
| Lighting variation | Different exposure or white balance across images degrades matching |
| Computation time | Dense reconstruction of large scenes can take hours |
Neural Radiance Fields (NeRF)¶
NeRF (Mildenhall et al., 2020) represents a scene as a continuous volumetric function parameterized by a neural network. It takes a 3D position and viewing direction as input and outputs color and density, enabling photorealistic novel-view synthesis.
Scene Representation¶
A neural network \(F_\theta\) maps a 3D position \(\mathbf{x} = (x, y, z)\) and viewing direction \(\mathbf{d} = (\theta, \phi)\) to an RGB color \(\mathbf{c}\) and volume density \(\sigma\):
Density \(\sigma\) depends only on position (geometry is view-independent), while color depends on both position and direction (to model view-dependent effects like specular highlights).
Volume Rendering¶
To render a pixel, NeRF casts a ray \(\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}\) from the camera origin \(\mathbf{o}\) through the pixel in direction \(\mathbf{d}\), and integrates color along the ray:
where the transmittance \(T(t)\) is the probability that the ray travels from \(t_n\) to \(t\) without hitting anything:
In practice, the integral is approximated by sampling \(N\) points along the ray and using quadrature:
where \(\delta_i = t_{i+1} - t_i\) is the distance between adjacent samples.
Positional Encoding¶
Raw \((x, y, z)\) coordinates are mapped through a positional encoding before being fed to the network. This allows the MLP to represent high-frequency scene details:
This is the same idea as the Transformer's sinusoidal positional encoding -- mapping a low-dimensional input into a higher-dimensional space so that the network can learn high-frequency functions. Without it, MLPs exhibit a strong bias toward smooth, low-frequency outputs (the "spectral bias" of neural networks).
Hierarchical Sampling¶
NeRF uses a coarse-to-fine strategy: a coarse network first samples points uniformly along the ray, then the density estimates from the coarse pass inform a second round of importance sampling that concentrates points near surfaces. This dramatically improves efficiency.
flowchart TB
Ray["Camera Ray r(t)"] --> Coarse["Coarse Sampling (uniform)"]
Coarse --> CoarseNet["Coarse MLP"]
CoarseNet --> Weights["Density Weights"]
Weights --> Fine["Fine Sampling (importance)"]
Fine --> FineNet["Fine MLP"]
FineNet --> Render["Volume Rendering -> Pixel Color"]
Render --> Loss["MSE Loss vs. Ground Truth Pixel"]
3D Gaussian Splatting¶
3D Gaussian Splatting (Kerbl et al., 2023) represents scenes as a collection of explicit 3D Gaussian primitives that are differentiably rasterized onto images. It achieves real-time rendering while matching or exceeding NeRF quality.
Representation¶
Each Gaussian primitive is defined by:
- Position \(\mu \in \mathbb{R}^3\) -- the center
- Covariance \(\Sigma \in \mathbb{R}^{3 \times 3}\) -- the shape and orientation (parameterized as \(\Sigma = RSS^TR^T\) where \(R\) is a rotation quaternion and \(S\) is a diagonal scale matrix, ensuring \(\Sigma\) is always positive semi-definite)
- Opacity \(\alpha \in [0, 1]\)
- Color -- represented via spherical harmonics coefficients for view-dependent appearance
The influence of each Gaussian at a 3D point \(\mathbf{x}\) is:
Differentiable Rasterization¶
Instead of ray marching (NeRF's approach), Gaussian Splatting projects each 3D Gaussian onto the 2D image plane:
where \(W\) is the world-to-camera transform and \(J\) is the Jacobian of the projective transformation. The projected 2D Gaussians are then alpha-composited front-to-back using a tile-based rasterizer, which is massively parallelizable on GPUs.
The pixel color is computed by sorted alpha blending:
Note the structural similarity to the NeRF volume rendering equation -- both are front-to-back alpha compositing, but Gaussian Splatting operates on explicit primitives rather than sampled points along rays.
Adaptive Densification¶
The optimization starts from a sparse SfM point cloud and iteratively:
- Clones Gaussians in under-reconstructed regions (high positional gradient, small Gaussians)
- Splits over-large Gaussians that cover too much scene geometry
- Prunes Gaussians with near-zero opacity or excessive scale
This adaptive process is reminiscent of the iterative refinement theme -- the representation itself evolves during optimization, not just the parameters.
flowchart TB
SfM["SfM Point Cloud"] --> Init["Initialize Gaussians"]
Init --> Forward["Differentiable Rasterization"]
Forward --> Loss["L1 + D-SSIM Loss vs. GT Image"]
Loss --> Backward["Backprop Gradients"]
Backward --> Update["Update mu, Sigma, alpha, SH"]
Update --> Densify{"Densification Check"}
Densify -- "Under-reconstructed" --> Clone["Clone / Split"]
Densify -- "Transparent" --> Prune["Prune"]
Clone --> Forward
Prune --> Forward
Densify -- "Converged" --> Done["Final Gaussians"]
Comparison: Photogrammetry vs. NeRF vs. Gaussian Splatting¶
| Aspect | Photogrammetry | NeRF | 3D Gaussian Splatting |
|---|---|---|---|
| Representation | Explicit mesh / point cloud | Implicit (MLP weights) | Explicit (Gaussian primitives) |
| Training time | Hours (dense MVS) | Hours (per-scene optimization) | Minutes (~15-30 min) |
| Rendering speed | Real-time (rasterization) | Seconds to real-time (original: ~30s; Instant-NGP: real-time) | Real-time (~100+ FPS) |
| Novel view quality | Good (mesh artifacts at edges) | Excellent (continuous field) | Excellent (smooth blending) |
| View-dependent effects | Limited (baked textures) | Yes (direction input) | Yes (spherical harmonics) |
| Editability | High (standard mesh tools) | Low (implicit, entangled) | Medium (explicit primitives) |
| Memory | Large (dense meshes) | Small (network weights) | Medium (millions of Gaussians) |
| Input requirements | Many views, good texture | Many views, known poses | Moderate views, SfM initialization |
/// details | Hybrid 3D approaches Several recent methods combine the strengths of multiple representations:
- Instant-NGP (Mueller et al., 2022): Replaces the MLP with a multi-resolution hash encoding, training NeRF in seconds rather than hours. Modern implementations such as Instant-NGP and Nerfacto achieve real-time rendering, making the original ~30s per-frame figure historical context only.
- Neuralangelo (Li et al., 2023): Combines NeRF with surface extraction using multi-resolution hash encoding for high-fidelity mesh recovery.
- SuGaR (Guedon & Lepetit, 2024): Extracts meshes from Gaussian Splatting by regularizing Gaussians to align with surfaces, enabling traditional mesh editing of splat-derived scenes.
- 2DGS (Huang et al., 2024): Uses 2D (flat) Gaussians instead of 3D ellipsoids, improving surface reconstruction quality while maintaining splatting speed.
The field is converging toward methods that are fast to optimize (like Gaussians), produce high-quality novel views (like NeRF), and yield editable geometry (like traditional meshes). ///