Anomaly & Outlier Detection
βFinding the one-in-a-thousand data point that doesn't belongβ
Statistical (Z-Score, IQR fences) and algorithmic (Isolation Forest, LOF, One-Class SVM) approaches to finding rare abnormal observations β fraud detection, manufacturing defects, network intrusion.
Prerequisites
Concepts Covered
βKey Formulas
Z-Score
Standard deviations from the mean β |z| > 3 is conventionally anomalous
IQR Fence
Tukey fences β points outside this interval are outliers (IQR = Q3-Q1)
Isolation Score
Isolation Forest: anomalies have shorter average path lengths h(x)
LOF Score
Local Outlier Factor: ratio of local density to neighbours' density
βΆInteractive Simulation
Why Anomaly Detection Matters
Credit card fraud costs $32 billion annually. Network intrusion attacks cause trillions in damage. Industrial equipment failures cost $50 billion per year. Anomaly detection is the critical first line of defense in all these systems. The core challenge: you rarely have labeled examples of anomalies (they're rare by definition), so most anomaly detection is unsupervised β you only learn what 'normal' looks like, then flag deviations.
In medical diagnosis, a false negative (missing cancer) is catastrophic; in fraud detection, false positives (blocking real customers) destroy revenue. Choosing the right threshold is a business decision.
The Statistical Viewpoint
The simplest intuition: normal data concentrates in high-density regions. Anomalies live in low-density regions. Z-Score flags points more than k standard deviations from the mean β but assumes Gaussian distributions. IQR fences are non-parametric: they flag points outside 1.5ΓIQR from the quartiles, making them robust to non-Gaussian data. Both are univariate β they check each feature independently and miss multivariate anomalies (a temperature of 20Β°C is normal; a pressure of 5 bar is normal; but temperature=20 AND pressure=5 together may be anomalous).
Statistical vs Algorithmic Methods
Z-Score and IQR are fast and interpretable but assume features are independent and Gaussian. Isolation Forest builds random trees and measures how quickly each point can be isolated β anomalies isolate fast because they're in sparse regions. Local Outlier Factor (LOF) compares each point's local density to its neighbors' density: if your neighbors are much denser than you, you're an outlier. One-Class SVM finds the minimal hypersphere enclosing normal points. Autoencoder anomaly detection trains a neural network to reconstruct normal data β high reconstruction error signals anomaly.
Isolation Forest scales to millions of points and handles high-dimensional data well. LOF is better for clustered data with varying densities. Autoencoders excel at anomaly detection in images and time series.
Isolation Forest Algorithm
Build an ensemble of isolation trees (random binary trees)
For each tree: randomly select a feature, then a random split value
Recurse until each point is isolated (alone in a leaf)
Anomaly score = average path length across all trees
Short path β point isolated quickly β anomaly
Normal points need more splits β longer average path
scikit-learn Anomaly Detection
from sklearn.ensemble import IsolationForest from sklearn.neighbors import LocalOutlierFactor from sklearn.svm import OneClassSVM from sklearn.preprocessing import StandardScaler from sklearn.datasets import make_classification import numpy as np # ββ Sample data (5% anomalies) βββββββββββββββββββββββββββββββββββββββββ X_normal, _ = make_classification(n_samples=475, n_features=10, random_state=42) X_anom = np.random.randn(25, 10) * 4 # 25 clear outliers X = np.vstack([X_normal, X_anom]) y_true = np.array([0]*475 + [1]*25) # 0=normal, 1=anomaly X_scaled = StandardScaler().fit_transform(X) # ββ Isolation Forest βββββββββββββββββββββββββββββββββββββββββββββββ iso = IsolationForest( n_estimators=200, contamination=0.05, # expected fraction of outliers random_state=42 ) labels_iso = iso.fit_predict(X_scaled) # 1=inlier, -1=outlier scores_iso = iso.score_samples(X_scaled) # lower = more anomalous # ββ Local Outlier Factor ββββββββββββββββββββββββββββββββββββββββββββ lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05) labels_lof = lof.fit_predict(X_scaled) # ββ Z-Score (univariate, per-feature) ββββββββββββββββββββββββββββββ from scipy import stats z_scores = np.abs(stats.zscore(X)) outlier_mask = (z_scores > 3).any(axis=1) # ββ Evaluate with known labels βββββββββββββββββββββββββββββββββββββ from sklearn.metrics import roc_auc_score, average_precision_score # Convert: 1=inlier β 0=normal, -1=outlier β 1=anomaly y_pred = (labels_iso == -1).astype(int) print(f"AUC-ROC: {roc_auc_score(y_true, -scores_iso):.3f}") print(f"AP: {average_precision_score(y_true, -scores_iso):.3f}")
Anomaly Detection Pitfalls
The contamination parameter in Isolation Forest and LOF directly controls the decision threshold. If you set contamination=0.05 but your actual anomaly rate is 0.1%, you'll mislabel many normal points as anomalies. Always calibrate this with domain knowledge or holdout labeled data. Second pitfall: high dimensionality breaks Z-Score and distance-based methods (curse of dimensionality). Apply PCA first when features > 20. Third: concept drift β 'normal' changes over time. Retrain or use online anomaly detection for streaming data.
Never evaluate anomaly detection with accuracy β class imbalance makes it meaningless. Use Precision@k, AUC-PR (area under precision-recall curve), or F1 at the chosen threshold.
?Knowledge Check
Progress is saved in your browser β no account needed.
Need a Data Scientist or AI Engineer?
I build custom ML models, RAG chatbots, data pipelines, and production APIs β from analysis to deployment.