Calculus & Optimization
“From derivatives to gradient descent — the engine that trains every neural network”
Derivatives, partial derivatives, the chain rule (= backpropagation), and gradient descent. Then Adam, momentum, learning rate scheduling — the full story of how neural networks actually learn.
Prerequisites
Concepts Covered
∑Key Formulas
Gradient
Vector of partial derivatives — points in the direction of steepest ascent
Chain Rule
The backbone of backpropagation — compose derivatives through layers
Gradient Descent
Iteratively move opposite to the gradient to minimize loss L
Adam Update
Gradient descent with adaptive per-parameter learning rates (bias-corrected 1st & 2nd moments)
▶Interactive Simulation
Optimization Is What Makes Models Learn
Training a machine learning model is an optimization problem: find the parameters θ that minimize the loss function L(θ). Gradient descent is the workhorse algorithm that solves this for problems with millions of parameters where closed-form solutions don't exist. The chain rule makes it possible to compute gradients through arbitrarily deep compositions of functions — that's backpropagation. Without calculus, there is no learning: every weight update in every neural network, every boosted tree fitted to residuals, every SVM soft-margin solution — all of it is optimization.
A GPT model has ~175 billion parameters. Gradient descent updates ALL of them simultaneously in a single backward pass thanks to the chain rule.
The Gradient as a Direction in Parameter Space
Imagine the loss function as a hilly landscape and your parameters as your position. The gradient ∇L(θ) is an arrow pointing uphill. Moving in the OPPOSITE direction (−η∇L) goes downhill — toward lower loss. The learning rate η controls step size: too large and you bounce around (diverge), too small and training takes forever. The classic problem: an elongated bowl (ill-conditioned loss surface) makes vanilla gradient descent zigzag across the valley instead of going straight to the minimum. Adam fixes this by maintaining a separate learning rate for each parameter based on its gradient history.
Intuition for the chain rule: if temperature change affects pressure, and pressure affects volume, how does temperature affect volume? Multiply the individual sensitivities.
Adam Optimizer — Step by Step
Initialize: θ, m₀=0 (1st moment), v₀=0 (2nd moment), t=0, β₁=0.9, β₂=0.999, ε=1e-8
Compute gradient: g_t = ∇_θ L(θ_{t-1})
Update biased 1st moment (momentum): m_t = β₁·m_{t-1} + (1-β₁)·g_t
Update biased 2nd moment (adaptive scale): v_t = β₂·v_{t-1} + (1-β₂)·g_t²
Bias correction: m̂_t = m_t/(1-β₁ᵗ), v̂_t = v_t/(1-β₂ᵗ)
Parameter update: θ_t = θ_{t-1} - η·m̂_t / (√v̂_t + ε)
Intuition: m̂_t is a running average of gradients (momentum). √v̂_t normalizes by gradient magnitude — features with large gradients get smaller learning rates.
Gradient Descent from Scratch
import numpy as np import matplotlib.pyplot as plt # ── Numerical derivatives (educational) ────────────────────────────────────── def numerical_grad(f, x, h=1e-5): """Central difference approximation: (f(x+h) - f(x-h)) / 2h""" grad = np.zeros_like(x, dtype=float) for i in range(len(x)): x_plus = x.copy(); x_plus[i] += h x_minus = x.copy(); x_minus[i] -= h grad[i] = (f(x_plus) - f(x_minus)) / (2 * h) return grad # ── 1. Gradient Descent on simple quadratic ─────────────────────────────────── def loss(theta): return (theta[0] - 3)**2 + (theta[1] + 1)**2 # minimum at (3,-1) def grad_loss(theta): return np.array([2*(theta[0]-3), 2*(theta[1]+1)]) theta = np.array([0., 0.]) lr = 0.1 history = [theta.copy()] for step in range(50): g = grad_loss(theta) theta -= lr * g history.append(theta.copy()) if np.linalg.norm(g) < 1e-6: print(f"Converged at step {step}") break print(f"Final θ: {theta.round(4)}") # ≈ [3, -1] # ── 2. Adam optimizer ──────────────────────────────────────────────────────── def adam(grad_fn, theta_init, lr=0.01, n_steps=100, b1=0.9, b2=0.999, eps=1e-8): theta = theta_init.copy().astype(float) m, v = np.zeros_like(theta), np.zeros_like(theta) history = [theta.copy()] for t in range(1, n_steps+1): g = grad_fn(theta) m = b1*m + (1-b1)*g v = b2*v + (1-b2)*g**2 m_hat = m / (1 - b1**t) v_hat = v / (1 - b2**t) theta -= lr * m_hat / (np.sqrt(v_hat) + eps) history.append(theta.copy()) return theta, history theta_adam, hist_adam = adam(grad_loss, np.array([0., 0.]), lr=0.1) print(f"Adam θ: {theta_adam.round(4)}") # ── 3. Chain rule in action (manual backprop) ───────────────────────────────── # f(x) = (2x + 1)^2. df/dx = 2 * (2x+1) * 2 = 4*(2x+1) x = 3.0 # Forward pass u = 2*x + 1 # u = 7 f = u**2 # f = 49 # Backward pass (chain rule) df_du = 2*u # = 14 du_dx = 2 # constant df_dx = df_du * du_dx # = 28 print(f"df/dx at x=3: {df_dx}") # analytical: 4*(2*3+1) = 28 ✓ # ── 4. Learning rate sensitivity ───────────────────────────────────────────── fig, axes = plt.subplots(1, 3, figsize=(12,3)) for ax, lr_val in zip(axes, [0.01, 0.1, 0.9]): theta = np.array([0.]) losses = [] for _ in range(100): g = 2*(theta[0] - 5) theta[0] -= lr_val * g losses.append((theta[0]-5)**2) ax.semilogy(losses) ax.set_title(f"lr = {lr_val}") ax.set_xlabel("Steps") ax.set_ylabel("Loss") plt.tight_layout() plt.show() # lr=0.01: slow, lr=0.1: perfect, lr=0.9: oscillates
Local Minima vs Saddle Points — What Actually Slows Training
In high-dimensional loss landscapes (modern neural networks have millions of parameters), true local minima are rare — most 'stuck' points are saddle points where the gradient is zero but the point is a minimum in some directions and a maximum in others. Gradient descent with noise (SGD) escapes saddle points naturally. The bigger practical problems are: (1) Exploding gradients in deep networks — use gradient clipping. (2) Vanishing gradients in RNNs — use LSTM/GRU. (3) Poor conditioning — use batch normalization or weight initialization (He init for ReLU, Xavier for tanh/sigmoid).
For convex problems (linear regression, logistic regression, SVMs), gradient descent is guaranteed to find the global minimum. For neural networks, it finds a 'good enough' basin.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need a Data Scientist or AI Engineer?
I build custom ML models, RAG chatbots, data pipelines, and production APIs — from analysis to deployment.