ML Learning Hub
Foundationsbeginner

Linear Algebra for ML

The geometry behind every model — dot products, matrix transforms, and eigendecomposition

Vectors, dot products, matrix multiplication, eigendecomposition and SVD — with visual intuition for how matrices transform space. The language every neural network is written in.

45 min
10 diagrams
8 Concepts Covered

Prerequisites

Python ML Stack

Concepts Covered

Dot ProductMatrix MultiplyEigenvaluesEigenvectorsSVDRankDeterminantPCA Connection

Key Formulas

Dot Product

Measures how aligned two vectors are — zero means orthogonal, maximum when parallel

Matrix Multiply

Composition of two linear transformations — apply B first, then A

Eigendecomposition

Eigenvectors v stay on their span under transformation A; λ is the scaling factor

SVD

Any matrix decomposes into rotation × scale × rotation — used in PCA, LSA, recommender systems

Interactive Simulation

Loading visualization…
🎯

Why Linear Algebra IS Machine Learning

motivation

A neural network layer is y = Wx + b — a matrix multiplication. Gradient descent requires computing the gradient, which is a Jacobian matrix. PCA finds the principal eigenvectors of the covariance matrix. SVMs maximize a dot-product-based margin. Attention in Transformers is Q·Kᵀ·V — three matrix multiplications. Every forward pass, every backpropagation, every optimization step is linear algebra. Understanding the geometric intuition — what matrices DO to vectors in space — is what separates engineers who debug by understanding from engineers who debug by trial and error.

The dot product a·b = ‖a‖‖b‖cos(θ) is the foundation of cosine similarity (used in NLP), the kernel trick (SVMs), and attention mechanisms (Transformers).

💡

Matrices as Space Transformers

intuition

Every m×n matrix A represents a linear transformation from ℝⁿ to ℝᵐ. Multiplying a vector v by A stretches, rotates, reflects, or projects it. The key insight: a matrix completely describes what happens to EVERY vector in the space — you only need to know what it does to the basis vectors (the columns of A, when A acts on the standard basis). The determinant tells you the volume scaling factor: |det(A)| = 2 means every region doubles in area. det = 0 means the matrix collapses space onto a lower dimension (rank-deficient, non-invertible).

Visualize any 2×2 matrix by watching where the unit square [0,1]×[0,1] gets sent. The four corners go to (0,0), the first column, the second column, and their sum.

⚙️

Eigendecomposition Step by Step

algorithm
1

Find eigenvalues: solve det(A - λI) = 0 (characteristic polynomial). For 2×2: λ = (tr(A) ± √(tr²-4det)) / 2.

2

For each eigenvalue λᵢ: solve (A - λᵢI)v = 0 to find the eigenvector vᵢ. Normalize: ‖vᵢ‖ = 1.

3

Stack eigenvectors as columns of Q: A = QΛQ⁻¹ where Λ = diag(λ₁, λ₂, …)

4

For symmetric matrices (covariance matrices): Q is orthogonal (Q⁻¹ = Qᵀ), eigenvalues are real.

5

Aⁿ = QΛⁿQ⁻¹ — large eigenvalues dominate repeated application (e.g., power iteration).

6

PCA: compute covariance C = XᵀX/n, eigendecompose, take top-k eigenvectors as projection matrix.

</>

Linear Algebra with NumPy

code
python65 lines
import numpy as np

# ── Vectors and dot products ──────────────────────────────────────────────────
a = np.array([3., 4.])
b = np.array([1., 0.])

print(f"a·b = {np.dot(a, b):.2f}")             # 3.0
print(f"‖a‖ = {np.linalg.norm(a):.2f}")        # 5.0
print(f"cos(θ) = {np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b)):.3f}")  # 0.6

# Cosine similarity (NLP/recommendation)
def cosine_sim(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

# ── Matrix operations ─────────────────────────────────────────────────────────
A = np.array([[2., 1.],
              [0., 3.]])

B = np.array([[1., 0.],
              [2., 1.]])

print("A @ B =")
print(A @ B)                     # matrix multiply (composition)
print(f"det(A) = {np.linalg.det(A):.2f}")   # 6.0 — volume scaling
print(f"rank(A) = {np.linalg.matrix_rank(A)}")   # 2 — full rank

A_inv = np.linalg.inv(A)
print("A @ A_inv ≈ I:", np.allclose(A @ A_inv, np.eye(2)))

# ── Eigendecomposition ────────────────────────────────────────────────────────
eigenvalues, eigenvectors = np.linalg.eig(A)
print(f"Eigenvalues: {eigenvalues}")          # [2. 3.]
print(f"Eigenvectors (columns):\n{eigenvectors.round(3)}")

# Verify: A @ v = λ * v
for i in range(len(eigenvalues)):
    v = eigenvectors[:, i]
    lam = eigenvalues[i]
    print(f"λ{i+1}={lam:.2f}, A@v = {A@v.round(3)}, λ*v = {(lam*v).round(3)}")

# Reconstruct A from eigendecomposition
Q = eigenvectors
Lambda = np.diag(eigenvalues)
A_reconstructed = Q @ Lambda @ np.linalg.inv(Q)
print("Reconstruction error:", np.linalg.norm(A - A_reconstructed))

# ── SVD ───────────────────────────────────────────────────────────────────────
M = np.random.randn(4, 3)               # 4×3 rectangular matrix
U, S, Vt = np.linalg.svd(M, full_matrices=False)
print(f"U: {U.shape}, S: {S.shape}, Vt: {Vt.shape}")

# Low-rank approximation (keep top-k singular values)
k = 2
M_approx = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]
print(f"Rank-{k} approx error: {np.linalg.norm(M - M_approx):.4f}")

# ── PCA from scratch ─────────────────────────────────────────────────────────
X = np.random.randn(200, 5)
X -= X.mean(axis=0)                     # center
C = (X.T @ X) / (len(X) - 1)           # covariance matrix
eigenvalues, eigenvectors = np.linalg.eigh(C)   # eigh for symmetric matrices
idx = np.argsort(eigenvalues)[::-1]     # sort descending
PC = eigenvectors[:, idx[:2]]           # top-2 principal components
X_proj = X @ PC                         # project to 2D
print(f"Explained variance: {eigenvalues[idx[:2]] / eigenvalues.sum() * 100}")
⚠️

Numerical Stability and Ill-Conditioning

pitfall

The condition number of a matrix κ(A) = σ_max/σ_min (ratio of largest to smallest singular value) measures how sensitive solutions are to perturbations. High condition number → ill-conditioned → numerical errors amplify. Gradient descent converges slowly on ill-conditioned loss landscapes (elongated bowl) — this is why feature scaling matters and why Adam adapts learning rates per-parameter. Never invert a matrix directly with np.linalg.inv(A) if you're solving Ax=b — use np.linalg.solve(A,b) which is faster and more stable (uses LU factorization).

np.linalg.cond(A) tells you the condition number. κ > 10⁶ means you're in trouble — solutions to linear systems will have ~6 fewer significant digits than you expect.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need a Data Scientist or AI Engineer?

I build custom ML models, RAG chatbots, data pipelines, and production APIs — from analysis to deployment.