ML Learning Hub
Applied MLintermediate

NLP: Text Classification Pipeline

Teaching machines to read — from bag-of-words to transformers

The full classical NLP pipeline: tokenization, TF-IDF vectorization, Naïve Bayes/Logistic/SVM classification, evaluation (macro-F1), word embeddings vs TF-IDF, and sentence-transformers for semantic search.

45 min
10 diagrams
7 Concepts Covered

Prerequisites

Neural Networks
Naïve Bayes

Concepts Covered

TokenizationTF-IDFBag of WordsN-gramsCosine SimilarityWord2VecSentence Transformers

Key Formulas

TF-IDF

Term frequency × inverse document frequency — high when word is frequent in doc but rare globally

Cosine Similarity

Document similarity measure independent of document length

Perplexity

Language model quality — lower perplexity = better next-word prediction

Interactive Simulation

Loading visualization…
🎯

The NLP Revolution

motivation

In 2017, GPT-3 didn't exist. In 2023, LLMs write code, pass medical exams, and summarize legal documents. The foundation of all NLP — from bag-of-words spam filters to transformer-based LLMs — is the same: represent text numerically so models can process it. Understanding the classic NLP pipeline (tokenize → vectorize → model → evaluate) gives you the mental model to understand why modern transformers work and where they differ.

The GPT-4 technical report shows that the model trained on 100× more text than GPT-3 still benefits from classic NLP preprocessing (tokenization, deduplication, data quality filtering). Fundamentals matter at scale.

💡

The NLP Pipeline: 5 Stages

intuition

Raw text is just Unicode bytes — meaningless to a model. The NLP pipeline converts it to numbers: Tokenization (split text into tokens — words, subwords, or characters), Vocabulary building (assign an integer ID to each unique token), Vectorization (convert token IDs to dense numeric representations — one-hot, TF-IDF, or word embeddings), Model training (classify, cluster, generate, or retrieve), Evaluation (accuracy, F1, BLEU, perplexity depending on task).

TF-IDF: The Classic Vectorizer

math

Term Frequency (TF): how often does word t appear in document d? Document Frequency (DF): how many documents contain t? Inverse Document Frequency (IDF): log(N/DFt) — words that appear in every document (the, is, of) get near-zero IDF, making them irrelevant. Words that are specific to a few documents get high IDF. TF-IDF = TF × IDF. The result is a sparse matrix of shape (n_docs × vocab_size) where each entry reflects how characteristic that word is for that document.

🔬

From Bag-of-Words to Word Embeddings

deepdive

TF-IDF treats each word as independent — 'bank' and 'financial institution' are completely unrelated. Word embeddings (Word2Vec, GloVe, FastText) learn dense vector representations where similar words are nearby in vector space: king - man + woman ≈ queen. These 300-dimensional vectors capture semantic relationships that TF-IDF cannot. Modern sentence transformers (SBERT, all-MiniLM-L6-v2) produce fixed-length vectors for entire sentences, enabling semantic search, clustering, and zero-shot classification.

For production text classification in 2025: start with TF-IDF + LogisticRegression as baseline, then try sentence-transformers embeddings + classifier, then fine-tune a pre-trained BERT/DistilBERT if quality is still insufficient.

⚙️

Text Classification Pipeline

algorithm
1

Lowercasing, punctuation removal, optional stopword removal

2

Tokenization: word_tokenize or subword (BPE/WordPiece for transformers)

3

Vectorization: CountVectorizer → TfidfVectorizer → word2vec → BERT embeddings

4

Model: MultinomialNB (fast baseline), LogisticRegression (strong linear), SVM, fine-tuned BERT

5

Evaluation: macro-F1 for balanced classes, weighted-F1 for imbalanced, AUC-ROC

6

Error analysis: inspect misclassified samples → improve features or labeling

</>

Complete NLP Classification Pipeline

code
python61 lines
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import ComplementNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import classification_report
import numpy as np

# ── Sample text data ───────────────────────────────────────────────────
corpus = [
    "machine learning algorithms data science python",
    "neural network deep learning pytorch tensorflow",
    "natural language processing text classification bert",
    "computer vision image recognition convolutional",
    "reinforcement learning reward policy agent",
    "data preprocessing feature engineering pipeline",
] * 40   # 240 samples, 6 classes
labels = list(range(6)) * 40
X_text = corpus
y = np.array(labels)
X_train, X_test, y_train, y_test = train_test_split(
    X_text, y, test_size=0.2, stratify=y, random_state=42)
# indices for sentence-transformer section
train_idx = np.arange(len(X_train))
test_idx  = np.arange(len(X_test))

# ── Baseline: TF-IDF + Logistic Regression ────────────────────────
pipe_lr = Pipeline([
    ('tfidf', TfidfVectorizer(
        ngram_range=(1,2),
        max_features=100_000,
        sublinear_tf=True,          # log(1+tf) dampens high frequencies
        strip_accents='unicode',
        analyzer='word',
        token_pattern=r'\w{2,}',  # ignore single-char tokens
        min_df=2,                   # ignore very rare words
    )),
    ('clf', LogisticRegression(C=1.0, max_iter=1000, class_weight='balanced')),
])

# ── Alternative: TF-IDF + LinearSVC (fast, great for text) ────────
pipe_svm = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,2), max_features=100_000, sublinear_tf=True)),
    ('clf', LinearSVC(C=0.5, class_weight='balanced', max_iter=2000)),
])

# ── Evaluate both with cross-validation ───────────────────────────
for name, pipe in [('LR', pipe_lr), ('SVM', pipe_svm)]:
    scores = cross_val_score(pipe, X_text, y, cv=5, scoring='f1_macro', n_jobs=-1)
    print(f"{name}: macro-F1 = {scores.mean():.3f} ± {scores.std():.3f}")

# ── Modern approach: sentence embeddings ──────────────────────────
# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression

encoder = SentenceTransformer('all-MiniLM-L6-v2')
X_emb = encoder.encode(X_text, batch_size=256, show_progress_bar=True)
clf = LogisticRegression(max_iter=1000).fit(X_emb[train_idx], y[train_idx])
print(f"Sentence-BERT accuracy: {clf.score(X_emb[test_idx], y[test_idx]):.3f}")
⚠️

NLP Pipeline Pitfalls

pitfall

Fitting TfidfVectorizer on the full dataset (before splitting) leaks test vocabulary into training — the IDF values are computed with test document frequencies. Always fit inside a Pipeline applied to training data only. Second: using max_features without min_df — very rare words (appearing in 1-2 documents) are noisy but included. Set min_df=2 or min_df=0.001. Third: ignoring class imbalance — a 95% majority class makes accuracy useless; use macro-F1 and class_weight='balanced'. Fourth: not stemming/lemmatizing for small datasets — 'run', 'running', 'ran' should map to the same feature.

For non-English text, use language-specific tokenizers and pre-trained multilingual models (mBERT, XLM-RoBERTa) rather than English-centric pipelines. Many NLP libraries default to English-only behavior silently.

?Knowledge Check

Progress is saved in your browser — no account needed.

Need a Data Scientist or AI Engineer?

I build custom ML models, RAG chatbots, data pipelines, and production APIs — from analysis to deployment.