Python ML Stack: NumPy, Pandas & Matplotlib
“Your data science toolkit — NumPy, Pandas, Matplotlib and the Jupyter workflow”
Master the tools every ML engineer uses daily — NumPy vectorized operations, Pandas DataFrames for real-world data, and Matplotlib/Seaborn for exploratory visualization. The foundation everything else builds on.
Concepts Covered
∑Key Formulas
Vectorized Mean
np.mean(X) — NumPy computes this in C, orders of magnitude faster than a Python loop
Broadcasting
NumPy stretches the smaller array along the missing dimension — avoids explicit loops
Pearson Correlation
np.corrcoef(X,Y) — measures linear dependence between two features
▶Interactive Simulation
Why This Stack Before Anything Else
Every ML framework — scikit-learn, PyTorch, TensorFlow, JAX — sits on top of NumPy arrays. Understanding how arrays work in memory (contiguous C-order layout, dtype, strides) is the difference between writing O(n²) Python loops and vectorized O(n) NumPy operations that run at C speed. Pandas gives you labeled DataFrames for real-world messy data, and Matplotlib/Seaborn let you see what's happening before you model it. The entire ML ecosystem speaks NumPy — mastering it is mastering the lingua franca.
A Python for-loop over 10M numbers takes ~4 seconds. np.sum() takes ~8ms — 500× faster. This matters when you're computing gradients over a neural network.
NumPy Essentials — What You Actually Need
Array creation: np.array(), np.zeros(), np.ones(), np.linspace(), np.arange(), np.random.randn()
Shape manipulation: .reshape(), .T (transpose), np.concatenate(), np.stack(), np.squeeze()
Vectorized math: +, -, *, / broadcast element-wise; np.dot() / @ for matrix multiplication
Indexing: arr[2:5], arr[arr > 0] (boolean mask), arr[:, 0] (column slice)
Aggregations: .sum(), .mean(), .std(), .max(), .argmax() — all accept axis= parameter
Broadcasting rule: align shapes from the right, dimensions must match or be 1
NumPy, Pandas & Matplotlib — Full Workflow
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # ── NumPy: arrays, broadcasting, vectorized ops ─────────────────────────────── X = np.random.randn(1000, 5) # 1000 samples, 5 features y = 2*X[:,0] - X[:,1] + 0.5*np.random.randn(1000) print(X.shape, X.dtype) # (1000, 5) float64 print(X.mean(axis=0).round(3)) # per-feature means ≈ 0 print(X.std(axis=0).round(3)) # per-feature stds ≈ 1 # Broadcasting: subtract mean and divide by std (manual StandardScaler) X_scaled = (X - X.mean(axis=0)) / X.std(axis=0) # Matrix multiply: X @ W where W is 5×2 W = np.random.randn(5, 2) Z = X_scaled @ W # shape (1000, 2) # Boolean indexing high_income = X[X[:,0] > 1.0] # rows where feature 0 > 1σ print(f"High income rows: {len(high_income)}") # ── Pandas: DataFrames, EDA ─────────────────────────────────────────────────── df = pd.DataFrame(X, columns=[f"feat_{i}" for i in range(5)]) df["target"] = y # Quick EDA print(df.describe().round(2)) # count, mean, std, quartiles print(df.isnull().sum()) # check for missing values print(df.dtypes) # Groupby example df["group"] = np.where(df["feat_0"] > 0, "high", "low") print(df.groupby("group")["target"].agg(["mean","std"]).round(3)) # Correlations corr = df.drop(columns="group").corr() print(corr["target"].sort_values(ascending=False).round(3)) # ── Matplotlib / Seaborn: visualization ────────────────────────────────────── fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # 1. Distribution plot axes[0].hist(df["target"], bins=50, color="#6c63ff", alpha=0.8, edgecolor="white") axes[0].set_title("Target distribution") axes[0].set_xlabel("y") # 2. Scatter + regression line axes[1].scatter(df["feat_0"], df["target"], alpha=0.3, s=10, color="#06b6d4") m, b = np.polyfit(df["feat_0"], df["target"], 1) x_line = np.linspace(-3, 3, 100) axes[1].plot(x_line, m*x_line + b, color="#ff6b6b", lw=2, label=f"slope={m:.2f}") axes[1].set_title("Feature 0 vs Target") axes[1].legend() # 3. Correlation heatmap sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0, ax=axes[2], cbar=False) axes[2].set_title("Correlation matrix") plt.tight_layout() plt.show() # ── Jupyter tips ────────────────────────────────────────────────────────────── # %timeit np.dot(X, W) # benchmark any cell # %matplotlib inline # show plots in notebook # df.head() # preview first 5 rows # df.info() # dtypes + non-null counts # pd.set_option('display.max_columns', None) # show all columns
The Most Common NumPy Bugs
1) Shape mismatch: (100,) ≠ (100,1). Always check .shape before matrix ops. Use .reshape(-1,1) to add a dimension. 2) Integer division: np.array([3])/2 gives 1.5 in Python 3 but watch out with dtype=int arrays. 3) Copying vs views: arr[0:5] returns a VIEW — modifying it modifies the original. Use .copy() to be safe. 4) In-place vs out-of-place: X *= 2 modifies X in-place; Y = X * 2 creates a new array. 5) NaN propagation: np.mean([1,2,np.nan]) = NaN. Use np.nanmean() for NaN-safe aggregations.
np.shares_memory(a, b) tells you if two arrays share underlying data — crucial to know when you're 'copying' slices.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need a Data Scientist or AI Engineer?
I build custom ML models, RAG chatbots, data pipelines, and production APIs — from analysis to deployment.