Foundationsbeginner

Calculus & Optimization

“From derivatives to gradient descent — the engine that trains every neural network”

Derivatives, partial derivatives, the chain rule (= backpropagation), and gradient descent. Then Adam, momentum, learning rate scheduling — the full story of how neural networks actually learn.

45 min

8 diagrams

8 Concepts Covered

Prerequisites

→Linear Algebra

Concepts Covered

DerivativesChain RuleGradientGradient DescentAdamMomentumLearning RateConvexity

Previous: Linear Algebra for ML Next: Probability & Statistics

∑Key Formulas

Gradient

Vector of partial derivatives — points in the direction of steepest ascent

Chain Rule

The backbone of backpropagation — compose derivatives through layers

Gradient Descent

Iteratively move opposite to the gradient to minimize loss L

Adam Update

Gradient descent with adaptive per-parameter learning rates (bias-corrected 1st & 2nd moments)

▶Interactive Simulation

Loading visualization…

🎯

Optimization Is What Makes Models Learn

motivation

Training a machine learning model is an optimization problem: find the parameters θ that minimize the loss function L(θ). Gradient descent is the workhorse algorithm that solves this for problems with millions of parameters where closed-form solutions don't exist. The chain rule makes it possible to compute gradients through arbitrarily deep compositions of functions — that's backpropagation. Without calculus, there is no learning: every weight update in every neural network, every boosted tree fitted to residuals, every SVM soft-margin solution — all of it is optimization.

A GPT model has ~175 billion parameters. Gradient descent updates ALL of them simultaneously in a single backward pass thanks to the chain rule.

💡

The Gradient as a Direction in Parameter Space

intuition

Imagine the loss function as a hilly landscape and your parameters as your position. The gradient ∇L(θ) is an arrow pointing uphill. Moving in the OPPOSITE direction (−η∇L) goes downhill — toward lower loss. The learning rate η controls step size: too large and you bounce around (diverge), too small and training takes forever. The classic problem: an elongated bowl (ill-conditioned loss surface) makes vanilla gradient descent zigzag across the valley instead of going straight to the minimum. Adam fixes this by maintaining a separate learning rate for each parameter based on its gradient history.

Intuition for the chain rule: if temperature change affects pressure, and pressure affects volume, how does temperature affect volume? Multiply the individual sensitivities.

⚙️

Adam Optimizer — Step by Step

algorithm

Initialize: θ, m₀=0 (1st moment), v₀=0 (2nd moment), t=0, β₁=0.9, β₂=0.999, ε=1e-8

Compute gradient: g_t = ∇_θ L(θ_{t-1})

Update biased 1st moment (momentum): m_t = β₁·m_{t-1} + (1-β₁)·g_t

Update biased 2nd moment (adaptive scale): v_t = β₂·v_{t-1} + (1-β₂)·g_t²

Bias correction: m̂_t = m_t/(1-β₁ᵗ), v̂_t = v_t/(1-β₂ᵗ)

Parameter update: θ_t = θ_{t-1} - η·m̂_t / (√v̂_t + ε)

Intuition: m̂_t is a running average of gradients (momentum). √v̂_t normalizes by gradient magnitude — features with large gradients get smaller learning rates.

</>

Gradient Descent from Scratch

code

python80 lines

import numpy as np
import matplotlib.pyplot as plt

# ── Numerical derivatives (educational) ──────────────────────────────────────
def numerical_grad(f, x, h=1e-5):
    """Central difference approximation: (f(x+h) - f(x-h)) / 2h"""
    grad = np.zeros_like(x, dtype=float)
    for i in range(len(x)):
        x_plus  = x.copy(); x_plus[i]  += h
        x_minus = x.copy(); x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad

# ── 1. Gradient Descent on simple quadratic ───────────────────────────────────
def loss(theta):
    return (theta[0] - 3)**2 + (theta[1] + 1)**2  # minimum at (3,-1)

def grad_loss(theta):
    return np.array([2*(theta[0]-3), 2*(theta[1]+1)])

theta = np.array([0., 0.])
lr = 0.1
history = [theta.copy()]

for step in range(50):
    g = grad_loss(theta)
    theta -= lr * g
    history.append(theta.copy())
    if np.linalg.norm(g) < 1e-6:
        print(f"Converged at step {step}")
        break

print(f"Final θ: {theta.round(4)}")  # ≈ [3, -1]

# ── 2. Adam optimizer ────────────────────────────────────────────────────────
def adam(grad_fn, theta_init, lr=0.01, n_steps=100, b1=0.9, b2=0.999, eps=1e-8):
    theta = theta_init.copy().astype(float)
    m, v = np.zeros_like(theta), np.zeros_like(theta)
    history = [theta.copy()]
    for t in range(1, n_steps+1):
        g = grad_fn(theta)
        m = b1*m + (1-b1)*g
        v = b2*v + (1-b2)*g**2
        m_hat = m / (1 - b1**t)
        v_hat = v / (1 - b2**t)
        theta -= lr * m_hat / (np.sqrt(v_hat) + eps)
        history.append(theta.copy())
    return theta, history

theta_adam, hist_adam = adam(grad_loss, np.array([0., 0.]), lr=0.1)
print(f"Adam θ: {theta_adam.round(4)}")

# ── 3. Chain rule in action (manual backprop) ─────────────────────────────────
# f(x) = (2x + 1)^2. df/dx = 2 * (2x+1) * 2 = 4*(2x+1)
x = 3.0
# Forward pass
u = 2*x + 1    # u = 7
f = u**2       # f = 49

# Backward pass (chain rule)
df_du = 2*u    # = 14
du_dx = 2      # constant
df_dx = df_du * du_dx   # = 28
print(f"df/dx at x=3: {df_dx}")  # analytical: 4*(2*3+1) = 28 ✓

# ── 4. Learning rate sensitivity ─────────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(12,3))
for ax, lr_val in zip(axes, [0.01, 0.1, 0.9]):
    theta = np.array([0.])
    losses = []
    for _ in range(100):
        g = 2*(theta[0] - 5)
        theta[0] -= lr_val * g
        losses.append((theta[0]-5)**2)
    ax.semilogy(losses)
    ax.set_title(f"lr = {lr_val}")
    ax.set_xlabel("Steps")
    ax.set_ylabel("Loss")
plt.tight_layout()
plt.show()  # lr=0.01: slow, lr=0.1: perfect, lr=0.9: oscillates

⚠️

Local Minima vs Saddle Points — What Actually Slows Training

pitfall

In high-dimensional loss landscapes (modern neural networks have millions of parameters), true local minima are rare — most 'stuck' points are saddle points where the gradient is zero but the point is a minimum in some directions and a maximum in others. Gradient descent with noise (SGD) escapes saddle points naturally. The bigger practical problems are: (1) Exploding gradients in deep networks — use gradient clipping. (2) Vanishing gradients in RNNs — use LSTM/GRU. (3) Poor conditioning — use batch normalization or weight initialization (He init for ReLU, Xavier for tanh/sigmoid).

For convex problems (linear regression, logistic regression, SVMs), gradient descent is guaranteed to find the global minimum. For neural networks, it finds a 'good enough' basin.

?Knowledge Check

Progress is saved in your browser — no account needed.

Linear Algebra for ML

Probability & Statistics

Need a Data Scientist or AI Engineer?

I build custom ML models, RAG chatbots, data pipelines, and production APIs — from analysis to deployment.

Get in touch View services