Natural Language Processing¶
Word Embeddings¶
Before neural approaches, words were represented as one-hot vectors -- sparse, high-dimensional, and devoid of semantic structure. Word2Vec (Mikolov et al., 2013) changed this by learning dense, low-dimensional vectors where geometric relationships encode meaning.
Skip-Gram and CBOW¶
Word2Vec offers two training objectives:
- Skip-gram -- Given a center word, predict the surrounding context words. Works well for rare words.
- CBOW (Continuous Bag of Words) -- Given surrounding context words, predict the center word. Faster to train, slightly better for frequent words.
For skip-gram, the objective maximizes the probability of observing context word \(w_O\) given center word \(w_I\):
where \(\mathbf{v}\) and \(\mathbf{v}'\) are the input and output embedding vectors, and \(V\) is the vocabulary size. The softmax denominator is prohibitively expensive for large vocabularies.
Negative Sampling¶
Instead of computing the full softmax, negative sampling approximates the objective by contrasting the true context pair against \(k\) randomly sampled negative pairs:
where \(\sigma\) is the sigmoid function and \(P_n(w) \propto f(w)^{3/4}\) is the noise distribution (the 3/4 exponent down-weights frequent words relative to their raw frequency).
/// details | Embedding arithmetic The geometric structure of trained embeddings supports vector arithmetic that captures analogies:
This works because the training objective implicitly factorizes a word co-occurrence matrix (Levy & Goldberg, 2014). Directions in the embedding space correspond to semantic or syntactic relationships: gender, tense, plurality, and more. The same principle -- meaningful directions in learned vector spaces -- reappears in the latent spaces of image generators, where vector arithmetic can interpolate between visual concepts. ///
Code Example¶
from gensim.models import Word2Vec
from typing import Optional
def train_word2vec(
sentences: list[list[str]],
vector_size: int = 100,
window: int = 5,
negative: int = 5,
min_count: int = 2,
) -> Word2Vec:
"""Train a Word2Vec skip-gram model on tokenized sentences.
Returns a trained model whose .wv attribute provides the embedding lookup.
"""
model = Word2Vec(
sentences=sentences,
vector_size=vector_size,
window=window,
sg=1, # skip-gram
negative=negative,
min_count=min_count,
workers=4,
)
return model
def analogy(
model: Word2Vec,
positive: list[str],
negative: list[str],
topn: int = 3,
) -> list[tuple[str, float]]:
"""Solve an analogy using vector arithmetic.
Example: analogy(model, ["king", "woman"], ["man"]) -> [("queen", 0.87), ...]
"""
return model.wv.most_similar(positive=positive, negative=negative, topn=topn)
From static embeddings to contextual representations -- Word2Vec assigns a single vector to each word regardless of context ("bank" has the same embedding whether it refers to a river bank or a financial institution). The attention mechanism, introduced next, allows representations to vary dynamically based on the surrounding sequence -- the foundation of all modern language models.
The Attention Mechanism¶
Attention (Bahdanau et al., 2014; Vaswani et al., 2017) allows a model to dynamically focus on different parts of the input when producing each element of the output. The scaled dot-product attention formulation from "Attention Is All You Need" is:
where:
- \(Q \in \mathbb{R}^{n \times d_k}\) -- queries (what we are looking for)
- \(K \in \mathbb{R}^{m \times d_k}\) -- keys (what is available to attend to)
- \(V \in \mathbb{R}^{m \times d_v}\) -- values (the information to retrieve)
- \(d_k\) -- key dimension
The \(\sqrt{d_k}\) scaling prevents the dot products from growing large in magnitude as \(d_k\) increases, which would push the softmax into regions of extremely small gradients.
Multi-Head Attention¶
Rather than performing a single attention operation, the Transformer projects \(Q\), \(K\), \(V\) into \(h\) different subspaces and computes attention in parallel:
Each head can specialize -- one head might attend to syntactic structure, another to semantic similarity, another to positional proximity.
flowchart TD
QKV["Input: Q, K, V"] --> Split["Split into h heads"]
Split --> H1["Head 1: Attention"]
Split --> H2["Head 2: Attention"]
Split --> Hh["Head h: Attention"]
H1 --> Cat["Concat"]
H2 --> Cat
Hh --> Cat
Cat --> WO["Linear W^O"]
WO --> Out["Output"]
Code Example¶
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class ScaledDotProductAttention(nn.Module):
"""Scaled dot-product attention as described in Vaswani et al. (2017)."""
def forward(
self,
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
mask: torch.Tensor | None = None,
) -> torch.Tensor:
"""Compute attention output.
Args:
q: Queries of shape (batch, heads, seq_q, d_k).
k: Keys of shape (batch, heads, seq_k, d_k).
v: Values of shape (batch, heads, seq_k, d_v).
mask: Optional mask broadcastable to (batch, heads, seq_q, seq_k).
Returns:
Attention output of shape (batch, heads, seq_q, d_v).
"""
d_k = q.size(-1)
scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float("-inf"))
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, v)
Transformer Architecture¶
The Transformer (Vaswani et al., 2017) replaced recurrence entirely with attention, enabling full parallelism over the sequence dimension. The original architecture is an encoder-decoder model.
flowchart TB
subgraph Encoder["Encoder (x N)"]
direction TB
E_SA["Self-Attention"] --> E_FF["Feed-Forward"]
E_FF --> E_LN["Add & LayerNorm"]
end
subgraph Decoder["Decoder (x N)"]
direction TB
D_MSA["Masked Self-Attention"] --> D_CA["Cross-Attention"]
D_CA --> D_FF["Feed-Forward"]
D_FF --> D_LN["Add & LayerNorm"]
end
Input["Input Embeddings + Positional Encoding"] --> Encoder
Encoder --> Decoder
Target["Target Embeddings + Positional Encoding"] --> Decoder
Decoder --> Linear["Linear + Softmax"]
Key Architectural Choices¶
| Component | Purpose | Detail |
|---|---|---|
| Residual connections | Enable gradient flow through deep stacks | Output = LayerNorm(x + Sublayer(x)) |
| Layer normalization | Stabilize activations | Applied after each sub-layer (post-norm) or before (pre-norm, more common now) |
| Positional encoding | Inject sequence order (attention is permutation-invariant) | Sinusoidal or learned embeddings |
| Feed-forward network | Per-position nonlinear transformation | Two linear layers with ReLU/GELU; typically \(d_{ff} = 4 \times d_{model}\) |
Sinusoidal Positional Encoding¶
Since attention has no inherent notion of order, position must be explicitly injected. The original Transformer uses fixed sinusoidal functions:
/// details | Properties of sinusoidal positional encoding The sinusoidal encoding has two useful properties:
-
Relative position via linear projection. For any fixed offset \(k\), \(PE_{pos+k}\) can be expressed as a linear function of \(PE_{pos}\). This means the model can learn to attend to relative positions without needing explicit relative position representations.
-
Bounded magnitude. Sine and cosine values lie in \([-1, 1]\), so positional signals do not dominate the content embeddings regardless of sequence length. Each dimension oscillates at a different frequency, creating a unique "fingerprint" for every position.
In practice, most modern models (GPT, LLaMA) use learned positional embeddings or Rotary Position Embeddings (RoPE), which encode relative positions directly in the attention computation. RoPE rotates query and key vectors by angles proportional to their positions, making the dot product naturally depend on the distance between tokens. ///
GPT-Style Autoregressive Decoders¶
GPT (Radford et al., 2018) simplified the Transformer by discarding the encoder entirely and using only the decoder stack with causal (left-to-right) masking. This decoder-only architecture is now the dominant paradigm for large language models.
Autoregressive Factorization¶
The model factors the probability of a sequence as a product of conditional distributions:
At each step, a causal mask prevents attention to future positions, ensuring the model can only condition on past tokens. This enables efficient teacher-forced training on entire sequences in parallel.
flowchart LR
subgraph CausalMask["Causal Attention Matrix"]
direction TB
M["
1 0 0 0
1 1 0 0
1 1 1 0
1 1 1 1
"]
end
Tokens["x1, x2, x3, x4"] --> CausalMask
CausalMask --> Predictions["p(x2|x1), p(x3|x1,x2), p(x4|x1..x3)"]
Scaling Laws¶
Kaplan et al. (2020) and Hoffmann et al. (2022, "Chinchilla") established that language model loss follows a power-law relationship with compute, parameters, and data:
where \(N\) is parameter count and \(\alpha_N \approx 0.076\). The Chinchilla result refined the compute-optimal allocation: for a given compute budget \(C\), parameters and training tokens should scale roughly equally (\(N \propto C^{0.5}\), \(D \propto C^{0.5}\)).
| Model | Parameters | Training Tokens | Compute-Optimal? |
|---|---|---|---|
| GPT-3 | 175B | 300B | Under-trained |
| Chinchilla | 70B | 1.4T | Yes (for its budget) |
| LLaMA 2 | 70B | 2T | Over-trained (for inference efficiency) |
| LLaMA 3 | 70B | 15T | Heavily over-trained |
Tokenization: Byte Pair Encoding (BPE)¶
Language models do not operate on raw characters or whitespace-delimited words. Byte Pair Encoding (Sennrich et al., 2016) constructs a subword vocabulary by iteratively merging the most frequent adjacent token pair.
Algorithm¶
- Start with a vocabulary of individual characters (or bytes).
- Count all adjacent pairs in the training corpus.
- Merge the most frequent pair into a new token.
- Repeat from step 2 until the desired vocabulary size is reached.
This produces a vocabulary that balances between character-level (handles any input, very long sequences) and word-level (semantic units, shorter sequences).
| Tokenizer | Algorithm | Used By |
|---|---|---|
| BPE | Greedy merge by frequency | GPT-2, GPT-3, GPT-4 |
| WordPiece | Greedy merge by likelihood gain | BERT |
| SentencePiece | Unigram LM or BPE on raw text (no pre-tokenization) | T5, LLaMA |
| tiktoken | BPE with byte-level fallback | OpenAI models |
from typing import Counter as CounterType
from collections import Counter
def learn_bpe_merges(
corpus: list[str],
num_merges: int,
) -> list[tuple[str, str]]:
"""Learn BPE merge rules from a whitespace-tokenized corpus.
Each word is represented as a tuple of characters plus an end-of-word marker.
Returns the ordered list of merge pairs.
"""
# Initialize: split every word into characters + end-of-word marker
vocab: dict[tuple[str, ...], int] = {}
for word in corpus:
symbols = tuple(word) + ("</w>",)
vocab[symbols] = vocab.get(symbols, 0) + 1
merges: list[tuple[str, str]] = []
for _ in range(num_merges):
pairs: CounterType[tuple[str, str]] = Counter()
for symbols, count in vocab.items():
for i in range(len(symbols) - 1):
pairs[(symbols[i], symbols[i + 1])] += count
if not pairs:
break
best = max(pairs, key=pairs.get) # type: ignore[arg-type]
merges.append(best)
# Apply the merge to the vocabulary
new_vocab: dict[tuple[str, ...], int] = {}
for symbols, count in vocab.items():
new_symbols: list[str] = []
i = 0
while i < len(symbols):
if (
i < len(symbols) - 1
and symbols[i] == best[0]
and symbols[i + 1] == best[1]
):
new_symbols.append(best[0] + best[1])
i += 2
else:
new_symbols.append(symbols[i])
i += 1
new_vocab[tuple(new_symbols)] = count
vocab = new_vocab
return merges
From text to pixels -- The Transformer architecture and attention mechanism were designed for sequences of text tokens, but the same principles transfer directly to images. The next section shows how generative models produce images -- first through the adversarial framework of GANs, then through the iterative denoising of diffusion models, and finally through Transformers applied directly to image patches. Continue to Image Generation.