Skip to content

Natural Language Processing

Word Embeddings

Before neural approaches, words were represented as one-hot vectors -- sparse, high-dimensional, and devoid of semantic structure. Word2Vec (Mikolov et al., 2013) changed this by learning dense, low-dimensional vectors where geometric relationships encode meaning.

Skip-Gram and CBOW

Word2Vec offers two training objectives:

  • Skip-gram -- Given a center word, predict the surrounding context words. Works well for rare words.
  • CBOW (Continuous Bag of Words) -- Given surrounding context words, predict the center word. Faster to train, slightly better for frequent words.

For skip-gram, the objective maximizes the probability of observing context word \(w_O\) given center word \(w_I\):

\[ p(w_O \mid w_I) = \frac{\exp(\mathbf{v}_{w_O}' \cdot \mathbf{v}_{w_I})}{\sum_{w=1}^{V} \exp(\mathbf{v}_w' \cdot \mathbf{v}_{w_I})} \]

where \(\mathbf{v}\) and \(\mathbf{v}'\) are the input and output embedding vectors, and \(V\) is the vocabulary size. The softmax denominator is prohibitively expensive for large vocabularies.

Negative Sampling

Instead of computing the full softmax, negative sampling approximates the objective by contrasting the true context pair against \(k\) randomly sampled negative pairs:

\[ \mathcal{L} = -\log \sigma(\mathbf{v}_{w_O}' \cdot \mathbf{v}_{w_I}) - \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} \left[ \log \sigma(-\mathbf{v}_{w_i}' \cdot \mathbf{v}_{w_I}) \right] \]

where \(\sigma\) is the sigmoid function and \(P_n(w) \propto f(w)^{3/4}\) is the noise distribution (the 3/4 exponent down-weights frequent words relative to their raw frequency).

/// details | Embedding arithmetic The geometric structure of trained embeddings supports vector arithmetic that captures analogies:

\[ \mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}} \]

This works because the training objective implicitly factorizes a word co-occurrence matrix (Levy & Goldberg, 2014). Directions in the embedding space correspond to semantic or syntactic relationships: gender, tense, plurality, and more. The same principle -- meaningful directions in learned vector spaces -- reappears in the latent spaces of image generators, where vector arithmetic can interpolate between visual concepts. ///

Code Example

from gensim.models import Word2Vec
from typing import Optional

def train_word2vec(
    sentences: list[list[str]],
    vector_size: int = 100,
    window: int = 5,
    negative: int = 5,
    min_count: int = 2,
) -> Word2Vec:
    """Train a Word2Vec skip-gram model on tokenized sentences.

    Returns a trained model whose .wv attribute provides the embedding lookup.
    """
    model = Word2Vec(
        sentences=sentences,
        vector_size=vector_size,
        window=window,
        sg=1,  # skip-gram
        negative=negative,
        min_count=min_count,
        workers=4,
    )
    return model


def analogy(
    model: Word2Vec,
    positive: list[str],
    negative: list[str],
    topn: int = 3,
) -> list[tuple[str, float]]:
    """Solve an analogy using vector arithmetic.

    Example: analogy(model, ["king", "woman"], ["man"]) -> [("queen", 0.87), ...]
    """
    return model.wv.most_similar(positive=positive, negative=negative, topn=topn)

From static embeddings to contextual representations -- Word2Vec assigns a single vector to each word regardless of context ("bank" has the same embedding whether it refers to a river bank or a financial institution). The attention mechanism, introduced next, allows representations to vary dynamically based on the surrounding sequence -- the foundation of all modern language models.

The Attention Mechanism

Attention (Bahdanau et al., 2014; Vaswani et al., 2017) allows a model to dynamically focus on different parts of the input when producing each element of the output. The scaled dot-product attention formulation from "Attention Is All You Need" is:

\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V \]

where:

  • \(Q \in \mathbb{R}^{n \times d_k}\) -- queries (what we are looking for)
  • \(K \in \mathbb{R}^{m \times d_k}\) -- keys (what is available to attend to)
  • \(V \in \mathbb{R}^{m \times d_v}\) -- values (the information to retrieve)
  • \(d_k\) -- key dimension

The \(\sqrt{d_k}\) scaling prevents the dot products from growing large in magnitude as \(d_k\) increases, which would push the softmax into regions of extremely small gradients.

Multi-Head Attention

Rather than performing a single attention operation, the Transformer projects \(Q\), \(K\), \(V\) into \(h\) different subspaces and computes attention in parallel:

\[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \, W^O \]
\[ \text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V) \]

Each head can specialize -- one head might attend to syntactic structure, another to semantic similarity, another to positional proximity.

flowchart TD
    QKV["Input: Q, K, V"] --> Split["Split into h heads"]
    Split --> H1["Head 1: Attention"]
    Split --> H2["Head 2: Attention"]
    Split --> Hh["Head h: Attention"]
    H1 --> Cat["Concat"]
    H2 --> Cat
    Hh --> Cat
    Cat --> WO["Linear W^O"]
    WO --> Out["Output"]

Code Example

import torch
import torch.nn as nn
import torch.nn.functional as F
import math


class ScaledDotProductAttention(nn.Module):
    """Scaled dot-product attention as described in Vaswani et al. (2017)."""

    def forward(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        mask: torch.Tensor | None = None,
    ) -> torch.Tensor:
        """Compute attention output.

        Args:
            q: Queries of shape (batch, heads, seq_q, d_k).
            k: Keys of shape (batch, heads, seq_k, d_k).
            v: Values of shape (batch, heads, seq_k, d_v).
            mask: Optional mask broadcastable to (batch, heads, seq_q, seq_k).

        Returns:
            Attention output of shape (batch, heads, seq_q, d_v).
        """
        d_k = q.size(-1)
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float("-inf"))
        weights = F.softmax(scores, dim=-1)
        return torch.matmul(weights, v)

Transformer Architecture

The Transformer (Vaswani et al., 2017) replaced recurrence entirely with attention, enabling full parallelism over the sequence dimension. The original architecture is an encoder-decoder model.

flowchart TB
    subgraph Encoder["Encoder (x N)"]
        direction TB
        E_SA["Self-Attention"] --> E_FF["Feed-Forward"]
        E_FF --> E_LN["Add & LayerNorm"]
    end
    subgraph Decoder["Decoder (x N)"]
        direction TB
        D_MSA["Masked Self-Attention"] --> D_CA["Cross-Attention"]
        D_CA --> D_FF["Feed-Forward"]
        D_FF --> D_LN["Add & LayerNorm"]
    end
    Input["Input Embeddings + Positional Encoding"] --> Encoder
    Encoder --> Decoder
    Target["Target Embeddings + Positional Encoding"] --> Decoder
    Decoder --> Linear["Linear + Softmax"]

Key Architectural Choices

Component Purpose Detail
Residual connections Enable gradient flow through deep stacks Output = LayerNorm(x + Sublayer(x))
Layer normalization Stabilize activations Applied after each sub-layer (post-norm) or before (pre-norm, more common now)
Positional encoding Inject sequence order (attention is permutation-invariant) Sinusoidal or learned embeddings
Feed-forward network Per-position nonlinear transformation Two linear layers with ReLU/GELU; typically \(d_{ff} = 4 \times d_{model}\)

Sinusoidal Positional Encoding

Since attention has no inherent notion of order, position must be explicitly injected. The original Transformer uses fixed sinusoidal functions:

\[ PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \qquad PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \]

/// details | Properties of sinusoidal positional encoding The sinusoidal encoding has two useful properties:

  1. Relative position via linear projection. For any fixed offset \(k\), \(PE_{pos+k}\) can be expressed as a linear function of \(PE_{pos}\). This means the model can learn to attend to relative positions without needing explicit relative position representations.

  2. Bounded magnitude. Sine and cosine values lie in \([-1, 1]\), so positional signals do not dominate the content embeddings regardless of sequence length. Each dimension oscillates at a different frequency, creating a unique "fingerprint" for every position.

In practice, most modern models (GPT, LLaMA) use learned positional embeddings or Rotary Position Embeddings (RoPE), which encode relative positions directly in the attention computation. RoPE rotates query and key vectors by angles proportional to their positions, making the dot product naturally depend on the distance between tokens. ///


GPT-Style Autoregressive Decoders

GPT (Radford et al., 2018) simplified the Transformer by discarding the encoder entirely and using only the decoder stack with causal (left-to-right) masking. This decoder-only architecture is now the dominant paradigm for large language models.

Autoregressive Factorization

The model factors the probability of a sequence as a product of conditional distributions:

\[ p(x_1, x_2, \ldots, x_n) = \prod_{t=1}^{n} p(x_t \mid x_1, \ldots, x_{t-1}) \]

At each step, a causal mask prevents attention to future positions, ensuring the model can only condition on past tokens. This enables efficient teacher-forced training on entire sequences in parallel.

flowchart LR
    subgraph CausalMask["Causal Attention Matrix"]
        direction TB
        M["
            1  0  0  0
            1  1  0  0
            1  1  1  0
            1  1  1  1
        "]
    end
    Tokens["x1, x2, x3, x4"] --> CausalMask
    CausalMask --> Predictions["p(x2|x1), p(x3|x1,x2), p(x4|x1..x3)"]

Scaling Laws

Kaplan et al. (2020) and Hoffmann et al. (2022, "Chinchilla") established that language model loss follows a power-law relationship with compute, parameters, and data:

\[ L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} \]

where \(N\) is parameter count and \(\alpha_N \approx 0.076\). The Chinchilla result refined the compute-optimal allocation: for a given compute budget \(C\), parameters and training tokens should scale roughly equally (\(N \propto C^{0.5}\), \(D \propto C^{0.5}\)).

Model Parameters Training Tokens Compute-Optimal?
GPT-3 175B 300B Under-trained
Chinchilla 70B 1.4T Yes (for its budget)
LLaMA 2 70B 2T Over-trained (for inference efficiency)
LLaMA 3 70B 15T Heavily over-trained

Tokenization: Byte Pair Encoding (BPE)

Language models do not operate on raw characters or whitespace-delimited words. Byte Pair Encoding (Sennrich et al., 2016) constructs a subword vocabulary by iteratively merging the most frequent adjacent token pair.

Algorithm

  1. Start with a vocabulary of individual characters (or bytes).
  2. Count all adjacent pairs in the training corpus.
  3. Merge the most frequent pair into a new token.
  4. Repeat from step 2 until the desired vocabulary size is reached.

This produces a vocabulary that balances between character-level (handles any input, very long sequences) and word-level (semantic units, shorter sequences).

Tokenizer Algorithm Used By
BPE Greedy merge by frequency GPT-2, GPT-3, GPT-4
WordPiece Greedy merge by likelihood gain BERT
SentencePiece Unigram LM or BPE on raw text (no pre-tokenization) T5, LLaMA
tiktoken BPE with byte-level fallback OpenAI models
from typing import Counter as CounterType
from collections import Counter


def learn_bpe_merges(
    corpus: list[str],
    num_merges: int,
) -> list[tuple[str, str]]:
    """Learn BPE merge rules from a whitespace-tokenized corpus.

    Each word is represented as a tuple of characters plus an end-of-word marker.
    Returns the ordered list of merge pairs.
    """
    # Initialize: split every word into characters + end-of-word marker
    vocab: dict[tuple[str, ...], int] = {}
    for word in corpus:
        symbols = tuple(word) + ("</w>",)
        vocab[symbols] = vocab.get(symbols, 0) + 1

    merges: list[tuple[str, str]] = []

    for _ in range(num_merges):
        pairs: CounterType[tuple[str, str]] = Counter()
        for symbols, count in vocab.items():
            for i in range(len(symbols) - 1):
                pairs[(symbols[i], symbols[i + 1])] += count

        if not pairs:
            break

        best = max(pairs, key=pairs.get)  # type: ignore[arg-type]
        merges.append(best)

        # Apply the merge to the vocabulary
        new_vocab: dict[tuple[str, ...], int] = {}
        for symbols, count in vocab.items():
            new_symbols: list[str] = []
            i = 0
            while i < len(symbols):
                if (
                    i < len(symbols) - 1
                    and symbols[i] == best[0]
                    and symbols[i + 1] == best[1]
                ):
                    new_symbols.append(best[0] + best[1])
                    i += 2
                else:
                    new_symbols.append(symbols[i])
                    i += 1
            new_vocab[tuple(new_symbols)] = count
        vocab = new_vocab

    return merges

From text to pixels -- The Transformer architecture and attention mechanism were designed for sequences of text tokens, but the same principles transfer directly to images. The next section shows how generative models produce images -- first through the adversarial framework of GANs, then through the iterative denoising of diffusion models, and finally through Transformers applied directly to image patches. Continue to Image Generation.