ML Learning Hub
Foundationsbeginner

Python ML Stack: NumPy, Pandas & Matplotlib

Your data science toolkit — NumPy, Pandas, Matplotlib and the Jupyter workflow

Master the tools every ML engineer uses daily — NumPy vectorized operations, Pandas DataFrames for real-world data, and Matplotlib/Seaborn for exploratory visualization. The foundation everything else builds on.

35 min
6 diagrams
7 Concepts Covered

Concepts Covered

NumPy ArraysBroadcastingPandas DataFrameEDAMatplotlibSeabornVectorization

Key Formulas

Vectorized Mean

np.mean(X) — NumPy computes this in C, orders of magnitude faster than a Python loop

Broadcasting

NumPy stretches the smaller array along the missing dimension — avoids explicit loops

Pearson Correlation

np.corrcoef(X,Y) — measures linear dependence between two features

Interactive Simulation

Loading visualization…
🎯

Why This Stack Before Anything Else

motivation

Every ML framework — scikit-learn, PyTorch, TensorFlow, JAX — sits on top of NumPy arrays. Understanding how arrays work in memory (contiguous C-order layout, dtype, strides) is the difference between writing O(n²) Python loops and vectorized O(n) NumPy operations that run at C speed. Pandas gives you labeled DataFrames for real-world messy data, and Matplotlib/Seaborn let you see what's happening before you model it. The entire ML ecosystem speaks NumPy — mastering it is mastering the lingua franca.

A Python for-loop over 10M numbers takes ~4 seconds. np.sum() takes ~8ms — 500× faster. This matters when you're computing gradients over a neural network.

⚙️

NumPy Essentials — What You Actually Need

algorithm
1

Array creation: np.array(), np.zeros(), np.ones(), np.linspace(), np.arange(), np.random.randn()

2

Shape manipulation: .reshape(), .T (transpose), np.concatenate(), np.stack(), np.squeeze()

3

Vectorized math: +, -, *, / broadcast element-wise; np.dot() / @ for matrix multiplication

4

Indexing: arr[2:5], arr[arr > 0] (boolean mask), arr[:, 0] (column slice)

5

Aggregations: .sum(), .mean(), .std(), .max(), .argmax() — all accept axis= parameter

6

Broadcasting rule: align shapes from the right, dimensions must match or be 1

</>

NumPy, Pandas & Matplotlib — Full Workflow

code
python71 lines
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# ── NumPy: arrays, broadcasting, vectorized ops ───────────────────────────────
X = np.random.randn(1000, 5)          # 1000 samples, 5 features
y = 2*X[:,0] - X[:,1] + 0.5*np.random.randn(1000)

print(X.shape, X.dtype)               # (1000, 5) float64
print(X.mean(axis=0).round(3))        # per-feature means ≈ 0
print(X.std(axis=0).round(3))         # per-feature stds ≈ 1

# Broadcasting: subtract mean and divide by std (manual StandardScaler)
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)

# Matrix multiply: X @ W where W is 5×2
W = np.random.randn(5, 2)
Z = X_scaled @ W                       # shape (1000, 2)

# Boolean indexing
high_income = X[X[:,0] > 1.0]         # rows where feature 0 > 1σ
print(f"High income rows: {len(high_income)}")

# ── Pandas: DataFrames, EDA ───────────────────────────────────────────────────
df = pd.DataFrame(X, columns=[f"feat_{i}" for i in range(5)])
df["target"] = y

# Quick EDA
print(df.describe().round(2))          # count, mean, std, quartiles
print(df.isnull().sum())               # check for missing values
print(df.dtypes)

# Groupby example
df["group"] = np.where(df["feat_0"] > 0, "high", "low")
print(df.groupby("group")["target"].agg(["mean","std"]).round(3))

# Correlations
corr = df.drop(columns="group").corr()
print(corr["target"].sort_values(ascending=False).round(3))

# ── Matplotlib / Seaborn: visualization ──────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# 1. Distribution plot
axes[0].hist(df["target"], bins=50, color="#6c63ff", alpha=0.8, edgecolor="white")
axes[0].set_title("Target distribution")
axes[0].set_xlabel("y")

# 2. Scatter + regression line
axes[1].scatter(df["feat_0"], df["target"], alpha=0.3, s=10, color="#06b6d4")
m, b = np.polyfit(df["feat_0"], df["target"], 1)
x_line = np.linspace(-3, 3, 100)
axes[1].plot(x_line, m*x_line + b, color="#ff6b6b", lw=2, label=f"slope={m:.2f}")
axes[1].set_title("Feature 0 vs Target")
axes[1].legend()

# 3. Correlation heatmap
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm",
            center=0, ax=axes[2], cbar=False)
axes[2].set_title("Correlation matrix")

plt.tight_layout()
plt.show()

# ── Jupyter tips ──────────────────────────────────────────────────────────────
# %timeit np.dot(X, W)       # benchmark any cell
# %matplotlib inline          # show plots in notebook
# df.head()                   # preview first 5 rows
# df.info()                   # dtypes + non-null counts
# pd.set_option('display.max_columns', None)  # show all columns
⚠️

The Most Common NumPy Bugs

pitfall

1) Shape mismatch: (100,) ≠ (100,1). Always check .shape before matrix ops. Use .reshape(-1,1) to add a dimension. 2) Integer division: np.array([3])/2 gives 1.5 in Python 3 but watch out with dtype=int arrays. 3) Copying vs views: arr[0:5] returns a VIEW — modifying it modifies the original. Use .copy() to be safe. 4) In-place vs out-of-place: X *= 2 modifies X in-place; Y = X * 2 creates a new array. 5) NaN propagation: np.mean([1,2,np.nan]) = NaN. Use np.nanmean() for NaN-safe aggregations.

np.shares_memory(a, b) tells you if two arrays share underlying data — crucial to know when you're 'copying' slices.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need a Data Scientist or AI Engineer?

I build custom ML models, RAG chatbots, data pipelines, and production APIs — from analysis to deployment.