Foundationsintermediate

Information Theory

“Entropy, cross-entropy, KL divergence — the math behind why loss functions work”

Entropy, cross-entropy loss, KL divergence, and mutual information — the mathematical backbone behind why cross-entropy works as a loss function, how VAEs work, and why transformers use attention.

35 min

7 diagrams

7 Concepts Covered

Prerequisites

→Probability & Statistics

Concepts Covered

EntropyCross-EntropyKL DivergenceMutual InformationInformation GainLog LossBits

Previous: Probability & Statistics Next: Linear & Logistic Regression

∑Key Formulas

Entropy

Average 'surprise' in bits — maximum when all outcomes equally likely, zero when deterministic

Cross-Entropy Loss

Expected bits needed to encode samples from p using code designed for q — the classification loss

KL Divergence

Extra bits needed to encode p with a code optimized for q. Always ≥ 0, equals 0 iff p=q

Mutual Information

How much knowing Y reduces uncertainty about X — used in feature selection and representation learning

▶Interactive Simulation

Loading visualization…

🎯

Why Information Theory Underpins ML Loss Functions

motivation

When you train a classifier with cross-entropy loss, you're minimizing the number of 'bits' needed to communicate ground-truth labels using the model's predicted distribution. When a VAE minimizes the ELBO, the regularization term is a KL divergence between the learned latent distribution and a prior. When you measure a decision tree split with information gain, you're computing the reduction in entropy. The connection to information theory is not an accident — it provides a principled, unified framework for understanding why these seemingly ad-hoc choices of loss functions are actually optimal for their respective goals.

Cross-entropy H(p,q) = Entropy H(p) + KL(p‖q). Since H(p) is fixed given the data, minimizing cross-entropy IS minimizing KL divergence from model q to truth p.

💡

Entropy: Measuring Surprise

intuition

Think of entropy as the average surprise in a probability distribution. A fair coin (50/50) has entropy H = 1 bit — you gain exactly 1 bit of information on each flip. A biased coin (99/1) has near-zero entropy — you're rarely surprised. A uniform distribution over 256 outcomes has entropy H = 8 bits — you need 8 bits to describe each outcome. ML application: a well-calibrated model's predictions on a class boundary have high entropy (uncertain), and its predictions on clear examples have near-zero entropy (confident). Entropy-regularized RL (Soft Actor-Critic) maximizes expected reward PLUS entropy to encourage exploration.

Maximum entropy principle: given constraints, choose the distribution that maximizes entropy. This gives the Normal distribution for mean+variance constraints — it's the least informative/assumptive choice.

</>

Entropy, Cross-Entropy & KL Divergence in Practice

code

python65 lines

import numpy as np
from scipy.special import xlogy    # handles 0 * log(0) = 0 safely
from scipy.stats import entropy as scipy_entropy
import matplotlib.pyplot as plt

def entropy(p: np.ndarray, base: float = 2) -> float:
    """Shannon entropy H(p) in bits (base=2) or nats (base=e)"""
    p = np.asarray(p, dtype=float)
    p = p[p > 0]                  # 0 * log(0) = 0 by convention
    return -np.sum(p * np.log(p) / np.log(base))

def cross_entropy(p: np.ndarray, q: np.ndarray, eps: float = 1e-12) -> float:
    """H(p, q) = -sum p * log(q)"""
    p, q = np.asarray(p, dtype=float), np.asarray(q, dtype=float)
    return -np.sum(p * np.log(q + eps))

def kl_divergence(p: np.ndarray, q: np.ndarray, eps: float = 1e-12) -> float:
    """KL(p||q) — NOT symmetric"""
    p, q = np.asarray(p, dtype=float), np.asarray(q, dtype=float)
    mask = p > 0
    return np.sum(p[mask] * np.log((p[mask] + eps) / (q[mask] + eps)))

# ── 1. Entropy of various distributions ──────────────────────────────────────
print("Entropy examples (bits):")
print(f"  Fair coin [0.5, 0.5]:        {entropy([0.5, 0.5]):.4f}")  # 1.0 bit
print(f"  Biased coin [0.99, 0.01]:    {entropy([0.99, 0.01]):.4f}")  # ≈ 0.08 bits
print(f"  Uniform 8 classes:           {entropy([1/8]*8):.4f}")  # 3.0 bits
print(f"  Certain [1.0, 0.0]:          {entropy([1.0, 0.0]):.4f}")  # 0.0 bits

# ── 2. Cross-entropy loss (classification) ────────────────────────────────────
# Ground truth (one-hot): cat
p_true = np.array([1., 0., 0.])       # cat
# Model predictions:
q_good = np.array([0.8, 0.1, 0.1])   # confident & correct
q_bad  = np.array([0.1, 0.8, 0.1])   # confident & wrong
q_uncertain = np.array([0.4, 0.3, 0.3])  # uncertain & correct lean

print("\nCross-entropy losses:")
print(f"  Good prediction:    {cross_entropy(p_true, q_good):.4f}")   # low
print(f"  Bad prediction:     {cross_entropy(p_true, q_bad):.4f}")    # high
print(f"  Uncertain but ok:   {cross_entropy(p_true, q_uncertain):.4f}")

# H(p,q) = H(p) + KL(p||q). Since H(p)=0 for one-hot: CE = KL(p||q)
print(f"  KL(p_true||q_good) = {kl_divergence(p_true, q_good):.4f}")

# ── 3. KL divergence: asymmetry ───────────────────────────────────────────────
p = np.array([0.6, 0.3, 0.1])
q = np.array([0.3, 0.5, 0.2])
print(f"\nKL(p||q) = {kl_divergence(p,q):.4f}")
print(f"KL(q||p) = {kl_divergence(q,p):.4f}")  # different — not a distance

# ── 4. Information gain in decision trees ─────────────────────────────────────
def information_gain(parent, left, right):
    n = len(parent)
    n_l, n_r = len(left), len(right)
    h_p = scipy_entropy(np.bincount(parent) / n, base=2)
    h_l = scipy_entropy(np.bincount(left)   / n_l, base=2) if n_l > 0 else 0
    h_r = scipy_entropy(np.bincount(right)  / n_r, base=2) if n_r > 0 else 0
    return h_p - (n_l/n * h_l + n_r/n * h_r)

# 10 samples: 6 class-0, 4 class-1. Split: left=[0,0,0,0,1], right=[0,0,1,1,1]
parent = np.array([0,0,0,0,0,0,1,1,1,1])
left   = np.array([0,0,0,0,1])
right  = np.array([0,0,1,1,1])
print(f"\nInformation gain: {information_gain(parent, left, right):.4f} bits")

🔭

KL Divergence in Modern ML

insight

KL divergence appears everywhere in modern ML: (1) VAE loss = reconstruction loss + KL(q(z|x) ‖ p(z)) — the KL term regularizes the latent space toward the prior. (2) Policy gradient RL — TRPO/PPO constrain the KL between old and new policy to avoid catastrophic updates. (3) Knowledge distillation — train student network to minimize KL between its outputs and the teacher's soft predictions. (4) RLHF (ChatGPT-style training) — KL penalty prevents the fine-tuned model from diverging too far from the base model during reward optimization. The asymmetry of KL matters: KL(p‖q) penalizes q assigning zero probability where p has mass (mode-covering), KL(q‖p) penalizes q having mass where p is zero (mode-seeking).

Forward KL (mode-covering) vs reverse KL (mode-seeking) is a fundamental design choice in generative models — VAEs use forward KL, GANs implicitly use reverse.

?Knowledge Check

Progress is saved in your browser — no account needed.

Probability & Statistics

Linear & Logistic Regression

Need a Data Scientist or AI Engineer?

I build custom ML models, RAG chatbots, data pipelines, and production APIs — from analysis to deployment.

Get in touch View services