ML Learning Hub
Unsupervisedintermediate

Anomaly & Outlier Detection

β€œFinding the one-in-a-thousand data point that doesn't belong”

Statistical (Z-Score, IQR fences) and algorithmic (Isolation Forest, LOF, One-Class SVM) approaches to finding rare abnormal observations β€” fraud detection, manufacturing defects, network intrusion.

35 min
8 diagrams
7 Concepts Covered

Prerequisites

β†’Probability & Statistics
β†’Model Evaluation

Concepts Covered

Z-ScoreIQRIsolation ForestLOFOne-Class SVMContaminationAUC-PR

βˆ‘Key Formulas

Z-Score

Standard deviations from the mean β€” |z| > 3 is conventionally anomalous

IQR Fence

Tukey fences β€” points outside this interval are outliers (IQR = Q3-Q1)

Isolation Score

Isolation Forest: anomalies have shorter average path lengths h(x)

LOF Score

Local Outlier Factor: ratio of local density to neighbours' density

β–ΆInteractive Simulation

Loading visualization…
🎯

Why Anomaly Detection Matters

motivation

Credit card fraud costs $32 billion annually. Network intrusion attacks cause trillions in damage. Industrial equipment failures cost $50 billion per year. Anomaly detection is the critical first line of defense in all these systems. The core challenge: you rarely have labeled examples of anomalies (they're rare by definition), so most anomaly detection is unsupervised β€” you only learn what 'normal' looks like, then flag deviations.

In medical diagnosis, a false negative (missing cancer) is catastrophic; in fraud detection, false positives (blocking real customers) destroy revenue. Choosing the right threshold is a business decision.

πŸ’‘

The Statistical Viewpoint

intuition

The simplest intuition: normal data concentrates in high-density regions. Anomalies live in low-density regions. Z-Score flags points more than k standard deviations from the mean β€” but assumes Gaussian distributions. IQR fences are non-parametric: they flag points outside 1.5Γ—IQR from the quartiles, making them robust to non-Gaussian data. Both are univariate β€” they check each feature independently and miss multivariate anomalies (a temperature of 20Β°C is normal; a pressure of 5 bar is normal; but temperature=20 AND pressure=5 together may be anomalous).

βš–οΈ

Statistical vs Algorithmic Methods

comparison

Z-Score and IQR are fast and interpretable but assume features are independent and Gaussian. Isolation Forest builds random trees and measures how quickly each point can be isolated β€” anomalies isolate fast because they're in sparse regions. Local Outlier Factor (LOF) compares each point's local density to its neighbors' density: if your neighbors are much denser than you, you're an outlier. One-Class SVM finds the minimal hypersphere enclosing normal points. Autoencoder anomaly detection trains a neural network to reconstruct normal data β€” high reconstruction error signals anomaly.

Isolation Forest scales to millions of points and handles high-dimensional data well. LOF is better for clustered data with varying densities. Autoencoders excel at anomaly detection in images and time series.

βš™οΈ

Isolation Forest Algorithm

algorithm
1

Build an ensemble of isolation trees (random binary trees)

2

For each tree: randomly select a feature, then a random split value

3

Recurse until each point is isolated (alone in a leaf)

4

Anomaly score = average path length across all trees

5

Short path β†’ point isolated quickly β†’ anomaly

6

Normal points need more splits β†’ longer average path

</>

scikit-learn Anomaly Detection

code
python39 lines
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
import numpy as np

# ── Sample data (5% anomalies) ─────────────────────────────────────────
X_normal, _ = make_classification(n_samples=475, n_features=10, random_state=42)
X_anom  = np.random.randn(25, 10) * 4    # 25 clear outliers
X = np.vstack([X_normal, X_anom])
y_true = np.array([0]*475 + [1]*25)       # 0=normal, 1=anomaly

X_scaled = StandardScaler().fit_transform(X)

# ── Isolation Forest ───────────────────────────────────────────────
iso = IsolationForest(
    n_estimators=200,
    contamination=0.05,   # expected fraction of outliers
    random_state=42
)
labels_iso = iso.fit_predict(X_scaled)  # 1=inlier, -1=outlier
scores_iso = iso.score_samples(X_scaled)  # lower = more anomalous

# ── Local Outlier Factor ────────────────────────────────────────────
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
labels_lof = lof.fit_predict(X_scaled)

# ── Z-Score (univariate, per-feature) ──────────────────────────────
from scipy import stats
z_scores = np.abs(stats.zscore(X))
outlier_mask = (z_scores > 3).any(axis=1)

# ── Evaluate with known labels ─────────────────────────────────────
from sklearn.metrics import roc_auc_score, average_precision_score
# Convert: 1=inlier β†’ 0=normal,  -1=outlier β†’ 1=anomaly
y_pred = (labels_iso == -1).astype(int)
print(f"AUC-ROC: {roc_auc_score(y_true, -scores_iso):.3f}")
print(f"AP:      {average_precision_score(y_true, -scores_iso):.3f}")
⚠️

Anomaly Detection Pitfalls

pitfall

The contamination parameter in Isolation Forest and LOF directly controls the decision threshold. If you set contamination=0.05 but your actual anomaly rate is 0.1%, you'll mislabel many normal points as anomalies. Always calibrate this with domain knowledge or holdout labeled data. Second pitfall: high dimensionality breaks Z-Score and distance-based methods (curse of dimensionality). Apply PCA first when features > 20. Third: concept drift β€” 'normal' changes over time. Retrain or use online anomaly detection for streaming data.

Never evaluate anomaly detection with accuracy β€” class imbalance makes it meaningless. Use Precision@k, AUC-PR (area under precision-recall curve), or F1 at the chosen threshold.

?Knowledge Check

Progress is saved in your browser β€” no account needed.

Need a Data Scientist or AI Engineer?

I build custom ML models, RAG chatbots, data pipelines, and production APIs β€” from analysis to deployment.