Applied MLintermediate

Feature Importance & Selection

“Know which features your model actually relies on — then trust it more (or less)”

Permutation importance vs impurity (Gini) importance, SHAP unified attribution, drop-column importance, and how correlated features split scores unfairly — with interactive bar chart toggling between methods.

35 min

6 diagrams

6 Concepts Covered

Prerequisites

→Random Forest

→Gradient Boosting

Concepts Covered

Permutation ImportanceGini ImportanceSHAPDrop-ColumnFeature SelectionCorrelation Bias

Previous: Hyperparameter Tuning Next: Partial Dependence & ICE Plots

∑Key Formulas

Permutation Importance

Accuracy drop when feature j is randomly shuffled — model-agnostic, works post-training

Gini Impurity Importance

Weighted impurity decrease across all splits on feature j — fast but biased toward cardinality

SHAP (kernel)

Shapley value: each feature's average marginal contribution over all feature coalitions

Drop-Column Importance

Gold standard but expensive — retrain once per feature

▶Interactive Simulation

Loading visualization…

🎯

Why Feature Importance Is Non-Negotiable

motivation

Machine learning models are often black boxes — they produce outputs but hide their reasoning. Feature importance methods peel back that opacity. They answer: which inputs does the model lean on most? This matters for three reasons: (1) Debugging: if your model leans heavily on 'random_noise', you have a data leakage problem. (2) Trust: regulators, doctors, and loan officers must understand model decisions — GDPR Article 22 mandates explainability for automated decisions. (3) Feature selection: importance scores guide dimensionality reduction. Dropping truly unimportant features reduces inference cost and prevents overfitting to noise.

A credit scoring model relying heavily on zip_code might be fair on training data but proxy for race — importance analysis surfaces this before deployment.

💡

Two Philosophies: What Does 'Important' Mean?

intuition

There are fundamentally two schools: (A) Structural importance asks 'how much did this feature help build the model?' — tree-based impurity importance is the canonical example, computed from split statistics during training. It's fast (no extra computation) but has a known bias: it inflates importance for high-cardinality continuous features like zip_code because there are more possible splits. (B) Functional importance asks 'how much does the model's predictions degrade if I break this feature?' — permutation importance shuffles each feature independently and measures the accuracy drop. It's model-agnostic, works with any estimator, and correctly assigns near-zero importance to random_noise features. The two approaches often disagree — and that disagreement is informative.

If impurity importance says zip_code is important but permutation importance says near-zero, the model learned spurious correlations from cardinality rather than signal.

⚙️

Permutation Importance: Step by Step

algorithm

Train your model on (X_train, y_train). Compute baseline metric (e.g., accuracy) on X_val.

For feature j in {1, …, p}: shuffle column j in X_val (replace with random permutation), compute metric on shuffled data, restore column j.

Importance of j = baseline metric − shuffled metric. High drop = important feature.

Repeat K times (default K=5 in sklearn) and average to reduce variance from random shuffles.

Sort features by importance score. Features with negative scores (model improves when shuffled) indicate harmful or leaky features.

</>

Feature Importance: Permutation & Impurity

code

python66 lines

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# ── Synthetic tabular dataset ──────────────────────────────────────────────────
np.random.seed(42)
n = 1000
X = pd.DataFrame({
    "income":         np.random.normal(50, 15, n),
    "age":            np.random.randint(18, 70, n),
    "credit_score":   np.random.normal(650, 80, n),
    "loan_amount":    np.random.normal(20, 8, n),
    "employment_yrs": np.random.exponential(5, n),
    "num_accounts":   np.random.poisson(3, n),
    "random_noise":   np.random.randn(n),           # truly useless
    "zip_code":       np.random.randint(0, 10000, n), # high-cardinality noise
})
y = (
    0.4 * (X["income"] > 55)
    + 0.3 * (X["credit_score"] > 660)
    + 0.2 * (X["age"] > 35)
    + 0.1 * np.random.rand(n)
) > 0.5

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# ── 1. Train Random Forest ─────────────────────────────────────────────────────
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# ── 2. Impurity (Gini) importance — fast, built-in ────────────────────────────
impurity_imp = pd.Series(rf.feature_importances_, index=X.columns)
print("Impurity importance:")
print(impurity_imp.sort_values(ascending=False).round(3))
# WARNING: zip_code (high cardinality) may appear inflated here

# ── 3. Permutation importance — model-agnostic, honest ───────────────────────
perm = permutation_importance(
    rf, X_val, y_val,
    n_repeats=10,          # shuffle 10 times, take mean ± std
    scoring="accuracy",
    random_state=42,
    n_jobs=-1
)
perm_imp = pd.DataFrame({
    "mean": perm.importances_mean,
    "std":  perm.importances_std,
}, index=X.columns).sort_values("mean", ascending=False)

print("\nPermutation importance:")
print(perm_imp.round(3))
# random_noise and zip_code will be near zero or negative

# ── 4. Compare the two methods ────────────────────────────────────────────────
comparison = pd.DataFrame({
    "impurity": impurity_imp,
    "permutation": perm.importances_mean,
}).sort_values("permutation", ascending=False)
print("\nComparison (sorted by permutation):")
print(comparison.round(3))

# ── 5. Feature selection using permutation importance ────────────────────────
selected = perm_imp[perm_imp["mean"] > 0.01].index.tolist()
print(f"\nSelected features ({len(selected)}): {selected}")

∑

SHAP: Unified Feature Attribution

math

SHAP (SHapley Additive exPlanations) unifies LIME, feature importance, and attention mechanisms under a single axiomatic framework. Every prediction is decomposed into a sum of per-feature contributions (ϕ_j) plus a base value. Unlike permutation importance (global), SHAP is local — it explains individual predictions. TreeSHAP computes exact Shapley values for tree ensembles in polynomial time using a path-based algorithm, making it practical for production Random Forests and XGBoost models.

⚠️

Correlated Features Split Importance Unfairly

pitfall

When two features are highly correlated (e.g., income and credit_score), permutation importance underestimates both. Shuffling income still leaves credit_score intact, so the model recovers most of the signal. The true joint importance is shared between them, but each individual importance looks small. Solution: use drop-column importance or SHAP with correlation-aware grouping. Also be aware that permutation importance is validation-set dependent — importance scores change if you use different splits.

Never interpret near-zero permutation importance as 'useless' for correlated features without checking pairwise correlations first.

?Knowledge Check

Progress is saved in your browser — no account needed.

Hyperparameter Tuning

Partial Dependence & ICE Plots

Need a Data Scientist or AI Engineer?

I build custom ML models, RAG chatbots, data pipelines, and production APIs — from analysis to deployment.

Get in touch View services