NLP: Text Classification Pipeline
“Teaching machines to read — from bag-of-words to transformers”
The full classical NLP pipeline: tokenization, TF-IDF vectorization, Naïve Bayes/Logistic/SVM classification, evaluation (macro-F1), word embeddings vs TF-IDF, and sentence-transformers for semantic search.
Prerequisites
Concepts Covered
∑Key Formulas
TF-IDF
Term frequency × inverse document frequency — high when word is frequent in doc but rare globally
Cosine Similarity
Document similarity measure independent of document length
Perplexity
Language model quality — lower perplexity = better next-word prediction
▶Interactive Simulation
The NLP Revolution
In 2017, GPT-3 didn't exist. In 2023, LLMs write code, pass medical exams, and summarize legal documents. The foundation of all NLP — from bag-of-words spam filters to transformer-based LLMs — is the same: represent text numerically so models can process it. Understanding the classic NLP pipeline (tokenize → vectorize → model → evaluate) gives you the mental model to understand why modern transformers work and where they differ.
The GPT-4 technical report shows that the model trained on 100× more text than GPT-3 still benefits from classic NLP preprocessing (tokenization, deduplication, data quality filtering). Fundamentals matter at scale.
The NLP Pipeline: 5 Stages
Raw text is just Unicode bytes — meaningless to a model. The NLP pipeline converts it to numbers: Tokenization (split text into tokens — words, subwords, or characters), Vocabulary building (assign an integer ID to each unique token), Vectorization (convert token IDs to dense numeric representations — one-hot, TF-IDF, or word embeddings), Model training (classify, cluster, generate, or retrieve), Evaluation (accuracy, F1, BLEU, perplexity depending on task).
TF-IDF: The Classic Vectorizer
Term Frequency (TF): how often does word t appear in document d? Document Frequency (DF): how many documents contain t? Inverse Document Frequency (IDF): log(N/DFt) — words that appear in every document (the, is, of) get near-zero IDF, making them irrelevant. Words that are specific to a few documents get high IDF. TF-IDF = TF × IDF. The result is a sparse matrix of shape (n_docs × vocab_size) where each entry reflects how characteristic that word is for that document.
From Bag-of-Words to Word Embeddings
TF-IDF treats each word as independent — 'bank' and 'financial institution' are completely unrelated. Word embeddings (Word2Vec, GloVe, FastText) learn dense vector representations where similar words are nearby in vector space: king - man + woman ≈ queen. These 300-dimensional vectors capture semantic relationships that TF-IDF cannot. Modern sentence transformers (SBERT, all-MiniLM-L6-v2) produce fixed-length vectors for entire sentences, enabling semantic search, clustering, and zero-shot classification.
For production text classification in 2025: start with TF-IDF + LogisticRegression as baseline, then try sentence-transformers embeddings + classifier, then fine-tune a pre-trained BERT/DistilBERT if quality is still insufficient.
Text Classification Pipeline
Lowercasing, punctuation removal, optional stopword removal
Tokenization: word_tokenize or subword (BPE/WordPiece for transformers)
Vectorization: CountVectorizer → TfidfVectorizer → word2vec → BERT embeddings
Model: MultinomialNB (fast baseline), LogisticRegression (strong linear), SVM, fine-tuned BERT
Evaluation: macro-F1 for balanced classes, weighted-F1 for imbalanced, AUC-ROC
Error analysis: inspect misclassified samples → improve features or labeling
Complete NLP Classification Pipeline
from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import ComplementNB from sklearn.svm import LinearSVC from sklearn.model_selection import cross_val_score, train_test_split from sklearn.metrics import classification_report import numpy as np # ── Sample text data ─────────────────────────────────────────────────── corpus = [ "machine learning algorithms data science python", "neural network deep learning pytorch tensorflow", "natural language processing text classification bert", "computer vision image recognition convolutional", "reinforcement learning reward policy agent", "data preprocessing feature engineering pipeline", ] * 40 # 240 samples, 6 classes labels = list(range(6)) * 40 X_text = corpus y = np.array(labels) X_train, X_test, y_train, y_test = train_test_split( X_text, y, test_size=0.2, stratify=y, random_state=42) # indices for sentence-transformer section train_idx = np.arange(len(X_train)) test_idx = np.arange(len(X_test)) # ── Baseline: TF-IDF + Logistic Regression ──────────────────────── pipe_lr = Pipeline([ ('tfidf', TfidfVectorizer( ngram_range=(1,2), max_features=100_000, sublinear_tf=True, # log(1+tf) dampens high frequencies strip_accents='unicode', analyzer='word', token_pattern=r'\w{2,}', # ignore single-char tokens min_df=2, # ignore very rare words )), ('clf', LogisticRegression(C=1.0, max_iter=1000, class_weight='balanced')), ]) # ── Alternative: TF-IDF + LinearSVC (fast, great for text) ──────── pipe_svm = Pipeline([ ('tfidf', TfidfVectorizer(ngram_range=(1,2), max_features=100_000, sublinear_tf=True)), ('clf', LinearSVC(C=0.5, class_weight='balanced', max_iter=2000)), ]) # ── Evaluate both with cross-validation ─────────────────────────── for name, pipe in [('LR', pipe_lr), ('SVM', pipe_svm)]: scores = cross_val_score(pipe, X_text, y, cv=5, scoring='f1_macro', n_jobs=-1) print(f"{name}: macro-F1 = {scores.mean():.3f} ± {scores.std():.3f}") # ── Modern approach: sentence embeddings ────────────────────────── # pip install sentence-transformers from sentence_transformers import SentenceTransformer from sklearn.linear_model import LogisticRegression encoder = SentenceTransformer('all-MiniLM-L6-v2') X_emb = encoder.encode(X_text, batch_size=256, show_progress_bar=True) clf = LogisticRegression(max_iter=1000).fit(X_emb[train_idx], y[train_idx]) print(f"Sentence-BERT accuracy: {clf.score(X_emb[test_idx], y[test_idx]):.3f}")
NLP Pipeline Pitfalls
Fitting TfidfVectorizer on the full dataset (before splitting) leaks test vocabulary into training — the IDF values are computed with test document frequencies. Always fit inside a Pipeline applied to training data only. Second: using max_features without min_df — very rare words (appearing in 1-2 documents) are noisy but included. Set min_df=2 or min_df=0.001. Third: ignoring class imbalance — a 95% majority class makes accuracy useless; use macro-F1 and class_weight='balanced'. Fourth: not stemming/lemmatizing for small datasets — 'run', 'running', 'ran' should map to the same feature.
For non-English text, use language-specific tokenizers and pre-trained multilingual models (mBERT, XLM-RoBERTa) rather than English-centric pipelines. Many NLP libraries default to English-only behavior silently.
?Knowledge Check
Progress is saved in your browser — no account needed.
Need a Data Scientist or AI Engineer?
I build custom ML models, RAG chatbots, data pipelines, and production APIs — from analysis to deployment.