Generative AI¶

Generative AI encompasses a family of techniques that learn to produce new data -- text, images, 3D scenes -- rather than merely classify or cluster existing data. While the modalities differ, a handful of recurring ideas appear across all of them:

Embedding spaces. Raw inputs (words, pixels, point clouds) are mapped into continuous vector spaces where arithmetic operations carry semantic meaning. Word2Vec showed that "king - man + woman = queen" in embedding space; the same principle resurfaces in CLIP embeddings that bridge text and images, and in the latent spaces of autoencoders that compress images for diffusion.
Attention mechanisms. The ability to let every element of a sequence dynamically weight its relationship to every other element -- introduced for machine translation -- now underpins language models, image generators, and 3D reconstructors alike.
Iterative refinement. Rather than producing an output in a single forward pass, many generative systems refine an initial guess over multiple steps. Autoregressive decoders extend text one token at a time; diffusion models denoise a random sample over dozens of steps; 3D Gaussian Splatting iteratively densifies and prunes a set of primitives to match observed images.
Differentiable rendering. The bridge between 2D observations and 3D representations requires rendering operations whose gradients can flow back into the model. Neural Radiance Fields use differentiable volume rendering; Gaussian Splatting uses differentiable rasterization. Both allow 3D scene parameters to be optimized directly from photographs.

This chapter traces these ideas across three domains:

Natural Language Processing -- where embeddings and attention were first developed
Image Generation -- where diffusion models have largely superseded GANs, and where the Transformer architecture reappears in the form of Diffusion Transformers
Super-Resolution -- where learned upscaling recovers high-frequency detail from low-resolution inputs
3D Generation -- where neural scene representations enable photorealistic novel-view synthesis from sparse photographs

The progression is deliberate: each section builds on concepts introduced in the previous one, and cross-references highlight the shared mathematical machinery.

Relevance to the Sartiq Pipeline¶

The techniques covered in this chapter connect directly to the Sartiq product pipeline. Natural language understanding powers prompt interpretation for image generation tasks. Latent diffusion models drive the image synthesis and editing stages (inpainting, outpainting, virtual try-on). And 3D reconstruction from product photographs enables novel-view rendering for catalog generation.

flowchart TB
    subgraph NLP["NLP Layer"]
        Prompt["User Prompt"] --> LLM["LLM: Parse Intent + Generate Captions"]
    end

    subgraph ImageGen["Image Generation Layer"]
        LLM -- "Conditioning text" --> Diffusion["Latent Diffusion (Inpainting / Generation)"]
        Diffusion --> Composite["Compositing Pipeline"]
    end

    subgraph ThreeD["3D Reconstruction Layer"]
        Photos["Product Photos"] --> GS["Gaussian Splatting / NeRF"]
        GS --> NovelView["Novel View Synthesis"]
        NovelView --> Composite
    end

    subgraph Output["Output"]
        Composite --> Final["Final Product Images"]
    end

NLP layer -- LLMs interpret editing instructions, generate detailed captions for diffusion conditioning, and handle prompt-based workflows (see APIs > Integrations for the Anthropic integration).
Image generation layer -- Latent diffusion models perform inpainting, background replacement, and virtual try-on. Classifier-free guidance ensures prompt adherence. The compositing pipeline (see Computer Vision Techniques) handles mask refinement and alpha blending.
3D reconstruction layer -- Gaussian Splatting reconstructs 3D garment geometry from studio photographs, enabling novel-view rendering for angles not captured during the shoot. SfM provides the initial camera poses and sparse point cloud.

These layers do not operate in isolation. The LLM can reason about 3D composition (e.g., "show the garment from a 45-degree angle") and route to the appropriate reconstruction or generation pipeline. The output of 3D rendering feeds back into the 2D compositing stage, where diffusion-based refinement ensures photorealistic final output.