Feature Importance & Selection
“Know which features your model actually relies on — then trust it more (or less)”
Permutation importance vs impurity (Gini) importance, SHAP unified attribution, drop-column importance, and how correlated features split scores unfairly — with interactive bar chart toggling between methods.
Prerequisites
Concepts Covered
∑Key Formulas
Permutation Importance
Accuracy drop when feature j is randomly shuffled — model-agnostic, works post-training
Gini Impurity Importance
Weighted impurity decrease across all splits on feature j — fast but biased toward cardinality
SHAP (kernel)
Shapley value: each feature's average marginal contribution over all feature coalitions
Drop-Column Importance
Gold standard but expensive — retrain once per feature
▶Interactive Simulation
Why Feature Importance Is Non-Negotiable
Machine learning models are often black boxes — they produce outputs but hide their reasoning. Feature importance methods peel back that opacity. They answer: which inputs does the model lean on most? This matters for three reasons: (1) Debugging: if your model leans heavily on 'random_noise', you have a data leakage problem. (2) Trust: regulators, doctors, and loan officers must understand model decisions — GDPR Article 22 mandates explainability for automated decisions. (3) Feature selection: importance scores guide dimensionality reduction. Dropping truly unimportant features reduces inference cost and prevents overfitting to noise.
A credit scoring model relying heavily on zip_code might be fair on training data but proxy for race — importance analysis surfaces this before deployment.
Two Philosophies: What Does 'Important' Mean?
There are fundamentally two schools: (A) Structural importance asks 'how much did this feature help build the model?' — tree-based impurity importance is the canonical example, computed from split statistics during training. It's fast (no extra computation) but has a known bias: it inflates importance for high-cardinality continuous features like zip_code because there are more possible splits. (B) Functional importance asks 'how much does the model's predictions degrade if I break this feature?' — permutation importance shuffles each feature independently and measures the accuracy drop. It's model-agnostic, works with any estimator, and correctly assigns near-zero importance to random_noise features. The two approaches often disagree — and that disagreement is informative.
If impurity importance says zip_code is important but permutation importance says near-zero, the model learned spurious correlations from cardinality rather than signal.
Permutation Importance: Step by Step
Train your model on (X_train, y_train). Compute baseline metric (e.g., accuracy) on X_val.
For feature j in {1, …, p}: shuffle column j in X_val (replace with random permutation), compute metric on shuffled data, restore column j.
Importance of j = baseline metric − shuffled metric. High drop = important feature.
Repeat K times (default K=5 in sklearn) and average to reduce variance from random shuffles.
Sort features by importance score. Features with negative scores (model improves when shuffled) indicate harmful or leaky features.
Feature Importance: Permutation & Impurity
from sklearn.ensemble import RandomForestClassifier from sklearn.inspection import permutation_importance from sklearn.model_selection import train_test_split import pandas as pd import numpy as np # ── Synthetic tabular dataset ────────────────────────────────────────────────── np.random.seed(42) n = 1000 X = pd.DataFrame({ "income": np.random.normal(50, 15, n), "age": np.random.randint(18, 70, n), "credit_score": np.random.normal(650, 80, n), "loan_amount": np.random.normal(20, 8, n), "employment_yrs": np.random.exponential(5, n), "num_accounts": np.random.poisson(3, n), "random_noise": np.random.randn(n), # truly useless "zip_code": np.random.randint(0, 10000, n), # high-cardinality noise }) y = ( 0.4 * (X["income"] > 55) + 0.3 * (X["credit_score"] > 660) + 0.2 * (X["age"] > 35) + 0.1 * np.random.rand(n) ) > 0.5 X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42) # ── 1. Train Random Forest ───────────────────────────────────────────────────── rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) # ── 2. Impurity (Gini) importance — fast, built-in ──────────────────────────── impurity_imp = pd.Series(rf.feature_importances_, index=X.columns) print("Impurity importance:") print(impurity_imp.sort_values(ascending=False).round(3)) # WARNING: zip_code (high cardinality) may appear inflated here # ── 3. Permutation importance — model-agnostic, honest ─────────────────────── perm = permutation_importance( rf, X_val, y_val, n_repeats=10, # shuffle 10 times, take mean ± std scoring="accuracy", random_state=42, n_jobs=-1 ) perm_imp = pd.DataFrame({ "mean": perm.importances_mean, "std": perm.importances_std, }, index=X.columns).sort_values("mean", ascending=False) print("\nPermutation importance:") print(perm_imp.round(3)) # random_noise and zip_code will be near zero or negative # ── 4. Compare the two methods ──────────────────────────────────────────────── comparison = pd.DataFrame({ "impurity": impurity_imp, "permutation": perm.importances_mean, }).sort_values("permutation", ascending=False) print("\nComparison (sorted by permutation):") print(comparison.round(3)) # ── 5. Feature selection using permutation importance ──────────────────────── selected = perm_imp[perm_imp["mean"] > 0.01].index.tolist() print(f"\nSelected features ({len(selected)}): {selected}")
SHAP: Unified Feature Attribution
SHAP (SHapley Additive exPlanations) unifies LIME, feature importance, and attention mechanisms under a single axiomatic framework. Every prediction is decomposed into a sum of per-feature contributions (ϕ_j) plus a base value. Unlike permutation importance (global), SHAP is local — it explains individual predictions. TreeSHAP computes exact Shapley values for tree ensembles in polynomial time using a path-based algorithm, making it practical for production Random Forests and XGBoost models.
Correlated Features Split Importance Unfairly
When two features are highly correlated (e.g., income and credit_score), permutation importance underestimates both. Shuffling income still leaves credit_score intact, so the model recovers most of the signal. The true joint importance is shared between them, but each individual importance looks small. Solution: use drop-column importance or SHAP with correlation-aware grouping. Also be aware that permutation importance is validation-set dependent — importance scores change if you use different splits.
Never interpret near-zero permutation importance as 'useless' for correlated features without checking pairwise correlations first.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need a Data Scientist or AI Engineer?
I build custom ML models, RAG chatbots, data pipelines, and production APIs — from analysis to deployment.