Information Theory
“Entropy, cross-entropy, KL divergence — the math behind why loss functions work”
Entropy, cross-entropy loss, KL divergence, and mutual information — the mathematical backbone behind why cross-entropy works as a loss function, how VAEs work, and why transformers use attention.
Prerequisites
Concepts Covered
∑Key Formulas
Entropy
Average 'surprise' in bits — maximum when all outcomes equally likely, zero when deterministic
Cross-Entropy Loss
Expected bits needed to encode samples from p using code designed for q — the classification loss
KL Divergence
Extra bits needed to encode p with a code optimized for q. Always ≥ 0, equals 0 iff p=q
Mutual Information
How much knowing Y reduces uncertainty about X — used in feature selection and representation learning
▶Interactive Simulation
Why Information Theory Underpins ML Loss Functions
When you train a classifier with cross-entropy loss, you're minimizing the number of 'bits' needed to communicate ground-truth labels using the model's predicted distribution. When a VAE minimizes the ELBO, the regularization term is a KL divergence between the learned latent distribution and a prior. When you measure a decision tree split with information gain, you're computing the reduction in entropy. The connection to information theory is not an accident — it provides a principled, unified framework for understanding why these seemingly ad-hoc choices of loss functions are actually optimal for their respective goals.
Cross-entropy H(p,q) = Entropy H(p) + KL(p‖q). Since H(p) is fixed given the data, minimizing cross-entropy IS minimizing KL divergence from model q to truth p.
Entropy: Measuring Surprise
Think of entropy as the average surprise in a probability distribution. A fair coin (50/50) has entropy H = 1 bit — you gain exactly 1 bit of information on each flip. A biased coin (99/1) has near-zero entropy — you're rarely surprised. A uniform distribution over 256 outcomes has entropy H = 8 bits — you need 8 bits to describe each outcome. ML application: a well-calibrated model's predictions on a class boundary have high entropy (uncertain), and its predictions on clear examples have near-zero entropy (confident). Entropy-regularized RL (Soft Actor-Critic) maximizes expected reward PLUS entropy to encourage exploration.
Maximum entropy principle: given constraints, choose the distribution that maximizes entropy. This gives the Normal distribution for mean+variance constraints — it's the least informative/assumptive choice.
Entropy, Cross-Entropy & KL Divergence in Practice
import numpy as np from scipy.special import xlogy # handles 0 * log(0) = 0 safely from scipy.stats import entropy as scipy_entropy import matplotlib.pyplot as plt def entropy(p: np.ndarray, base: float = 2) -> float: """Shannon entropy H(p) in bits (base=2) or nats (base=e)""" p = np.asarray(p, dtype=float) p = p[p > 0] # 0 * log(0) = 0 by convention return -np.sum(p * np.log(p) / np.log(base)) def cross_entropy(p: np.ndarray, q: np.ndarray, eps: float = 1e-12) -> float: """H(p, q) = -sum p * log(q)""" p, q = np.asarray(p, dtype=float), np.asarray(q, dtype=float) return -np.sum(p * np.log(q + eps)) def kl_divergence(p: np.ndarray, q: np.ndarray, eps: float = 1e-12) -> float: """KL(p||q) — NOT symmetric""" p, q = np.asarray(p, dtype=float), np.asarray(q, dtype=float) mask = p > 0 return np.sum(p[mask] * np.log((p[mask] + eps) / (q[mask] + eps))) # ── 1. Entropy of various distributions ────────────────────────────────────── print("Entropy examples (bits):") print(f" Fair coin [0.5, 0.5]: {entropy([0.5, 0.5]):.4f}") # 1.0 bit print(f" Biased coin [0.99, 0.01]: {entropy([0.99, 0.01]):.4f}") # ≈ 0.08 bits print(f" Uniform 8 classes: {entropy([1/8]*8):.4f}") # 3.0 bits print(f" Certain [1.0, 0.0]: {entropy([1.0, 0.0]):.4f}") # 0.0 bits # ── 2. Cross-entropy loss (classification) ──────────────────────────────────── # Ground truth (one-hot): cat p_true = np.array([1., 0., 0.]) # cat # Model predictions: q_good = np.array([0.8, 0.1, 0.1]) # confident & correct q_bad = np.array([0.1, 0.8, 0.1]) # confident & wrong q_uncertain = np.array([0.4, 0.3, 0.3]) # uncertain & correct lean print("\nCross-entropy losses:") print(f" Good prediction: {cross_entropy(p_true, q_good):.4f}") # low print(f" Bad prediction: {cross_entropy(p_true, q_bad):.4f}") # high print(f" Uncertain but ok: {cross_entropy(p_true, q_uncertain):.4f}") # H(p,q) = H(p) + KL(p||q). Since H(p)=0 for one-hot: CE = KL(p||q) print(f" KL(p_true||q_good) = {kl_divergence(p_true, q_good):.4f}") # ── 3. KL divergence: asymmetry ─────────────────────────────────────────────── p = np.array([0.6, 0.3, 0.1]) q = np.array([0.3, 0.5, 0.2]) print(f"\nKL(p||q) = {kl_divergence(p,q):.4f}") print(f"KL(q||p) = {kl_divergence(q,p):.4f}") # different — not a distance # ── 4. Information gain in decision trees ───────────────────────────────────── def information_gain(parent, left, right): n = len(parent) n_l, n_r = len(left), len(right) h_p = scipy_entropy(np.bincount(parent) / n, base=2) h_l = scipy_entropy(np.bincount(left) / n_l, base=2) if n_l > 0 else 0 h_r = scipy_entropy(np.bincount(right) / n_r, base=2) if n_r > 0 else 0 return h_p - (n_l/n * h_l + n_r/n * h_r) # 10 samples: 6 class-0, 4 class-1. Split: left=[0,0,0,0,1], right=[0,0,1,1,1] parent = np.array([0,0,0,0,0,0,1,1,1,1]) left = np.array([0,0,0,0,1]) right = np.array([0,0,1,1,1]) print(f"\nInformation gain: {information_gain(parent, left, right):.4f} bits")
KL Divergence in Modern ML
KL divergence appears everywhere in modern ML: (1) VAE loss = reconstruction loss + KL(q(z|x) ‖ p(z)) — the KL term regularizes the latent space toward the prior. (2) Policy gradient RL — TRPO/PPO constrain the KL between old and new policy to avoid catastrophic updates. (3) Knowledge distillation — train student network to minimize KL between its outputs and the teacher's soft predictions. (4) RLHF (ChatGPT-style training) — KL penalty prevents the fine-tuned model from diverging too far from the base model during reward optimization. The asymmetry of KL matters: KL(p‖q) penalizes q assigning zero probability where p has mass (mode-covering), KL(q‖p) penalizes q having mass where p is zero (mode-seeking).
Forward KL (mode-covering) vs reverse KL (mode-seeking) is a fundamental design choice in generative models — VAEs use forward KL, GANs implicitly use reverse.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need a Data Scientist or AI Engineer?
I build custom ML models, RAG chatbots, data pipelines, and production APIs — from analysis to deployment.