Object Detection: YOLO & Faster-RCNN
βFrom image classification to locating and labelling every object in the sceneβ
From sliding windows to single-shot detectors β IoU, anchor boxes, NMS, mAP, and the two-stage vs one-stage architecture trade-off. How YOLO detects 80 object categories in real-time at 30 FPS.
Prerequisites
Concepts Covered
βKey Formulas
IoU
Intersection over Union β measure of bounding box quality; IoU > 0.5 is conventionally a correct detection
mAP
Mean Average Precision β area under Precision-Recall curve, averaged over all classes
YOLO Loss
Weighted sum: box regression + objectness confidence + class probabilities
NMS
Keep only the most confident box when multiple boxes heavily overlap the same object
βΆInteractive Simulation
Beyond Classification: Where and What?
Image classification answers 'is there a cat?' Detection answers 'where are the cats, and are there also dogs?' This shift from a single label to a variable number of (class, bounding-box) outputs is what makes object detection the core task in autonomous driving, medical imaging, retail checkout, and surveillance. Every self-driving car runs a real-time detector processing 30+ frames per second. The evolution from sliding-window classifiers (DPM, 2010) β two-stage detectors (RCNN, 2014; Faster-RCNN, 2015) β single-stage detectors (YOLO v1, 2016 β v8, 2023) is one of the fastest-moving areas in computer vision.
Tesla's Autopilot runs 8 cameras through a custom detection network at 36 FPS on a 72 TOPS custom chip. The entire model must fit in a tight latency budget while detecting objects 200m away.
Two-Stage vs One-Stage: The Fundamental Trade-off
**Two-stage detectors (Faster-RCNN):** Stage 1 β Region Proposal Network (RPN) suggests ~300 candidate regions that might contain objects. Stage 2 β a classification + regression head refines each proposal. Pro: high accuracy (easier to classify a cropped region). Con: slow (sequential stages). **One-stage detectors (YOLO, SSD):** Divide the image into a grid. Each cell directly predicts bounding box offsets, objectness score, and class probabilities in a single forward pass. Pro: fast (real-time capable). Con: harder to train, misses small/overlapping objects. **Anchor-based vs anchor-free:** YOLO v1-v3 used anchor boxes (predefined aspect ratios). YOLO v8 / FCOS / CenterNet are anchor-free β predict box center + width/height directly, simpler and often better.
YOLO = 'You Only Look Once.' The insight: instead of running a classifier at thousands of sliding window positions, predict all boxes simultaneously in one pass of the network.
YOLO Inference Pipeline
Divide input image into an SΓS grid (e.g., 13Γ13 for 416px input in YOLO v2).
For each cell: predict B bounding boxes (each: x, y, w, h relative to cell, + objectness score) and C class probabilities.
Box coordinates: x, y are offsets from cell center (0β1), w/h are log-scale offsets from anchor sizes.
Objectness Γ class probability = class-specific confidence score for each box.
Apply Non-Maximum Suppression (NMS): for each class, sort boxes by confidence, keep highest-confidence box, suppress boxes with IoU > 0.5 with the kept box, repeat.
Final output: variable-length list of (class, confidence, x1, y1, x2, y2) tuples.
Object Detection with YOLOv8 (Ultralytics)
# pip install ultralytics from ultralytics import YOLO import numpy as np import cv2 # ββ 1. Load pretrained YOLO v8 ββββββββββββββββββββββββββββββββββββββββββββββββ model = YOLO("yolov8n.pt") # nano model (3.2M params, fastest) # Other sizes: yolov8s.pt, yolov8m.pt, yolov8l.pt, yolov8x.pt # ββ 2. Inference on a single image ββββββββββββββββββββββββββββββββββββββββββββ results = model("path/to/image.jpg", conf=0.25, iou=0.5) for r in results: boxes = r.boxes # Boxes object for box in boxes: x1, y1, x2, y2 = box.xyxy[0].tolist() # absolute pixel coords conf = box.conf[0].item() # confidence score cls = int(box.cls[0].item()) # class index label = model.names[cls] print(f"{label}: {conf:.2f} at ({x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f})") # ββ 3. Fine-tuning on custom dataset βββββββββββββββββββββββββββββββββββββββββ # Dataset format: YOLO txt format # data.yaml: # train: /path/to/train/images # val: /path/to/val/images # nc: 3 # number of classes # names: ['cat', 'dog', 'car'] model = YOLO("yolov8s.pt") # start from ImageNet pretrained results = model.train( data="data.yaml", epochs=50, imgsz=640, batch=16, lr0=0.01, # initial learning rate lrf=0.01, # final lr fraction augment=True, # mosaic, flip, scale augmentation device=0, # GPU 0 ) print(f"mAP50: {results.metrics.mAP50:.4f}") # ββ 4. IoU calculation from scratch ββββββββββββββββββββββββββββββββββββββββββ def iou(box1, box2): """box = [x1, y1, x2, y2]""" x1 = max(box1[0], box2[0]); y1 = max(box1[1], box2[1]) x2 = min(box1[2], box2[2]); y2 = min(box1[3], box2[3]) inter = max(0, x2-x1) * max(0, y2-y1) area1 = (box1[2]-box1[0]) * (box1[3]-box1[1]) area2 = (box2[2]-box2[0]) * (box2[3]-box2[1]) return inter / (area1 + area2 - inter + 1e-6) gt = [100, 50, 250, 200] pred = [110, 60, 260, 210] print(f"\nIoU = {iou(gt, pred):.4f}") # ββ 5. Manual NMS βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ def nms(boxes, scores, iou_threshold=0.5): """Boxes: (N,4) xyxy, Scores: (N,)""" order = np.argsort(scores)[::-1] keep = [] while len(order) > 0: i = order[0] keep.append(i) ious = np.array([iou(boxes[i], boxes[j]) for j in order[1:]]) order = order[1:][ious < iou_threshold] return keep boxes = np.array([[100,50,250,200],[105,55,255,205],[200,100,350,250]]) scores = np.array([0.95, 0.87, 0.72]) kept = nms(boxes, scores) print(f"Kept boxes: {kept}") # [0, 2] β box 1 suppressed (overlaps with 0)
mAP and the IoU Threshold Trap
mAP@0.5 (IoU threshold 0.5) and mAP@0.5:0.95 (average over IoU thresholds from 0.5 to 0.95 in 0.05 steps) tell very different stories. A model with great mAP@0.5 but poor mAP@0.5:0.95 localises objects loosely β fine for coarse tasks, bad for robotic grasping. Also: mAP treats all classes equally, which hides poor performance on rare classes. For imbalanced datasets (e.g., rare traffic signs), report per-class AP separately. Common training pitfalls: (1) Forgetting to normalize bounding box coordinates to image size. (2) Using confidence threshold too low during NMS β keep conf_threshold β 0.25 during inference. (3) Overfitting on small datasets β always use strong augmentation (mosaic, random crop, color jitter).
A 1% mAP improvement on COCO benchmark (an 80-class, 330k image dataset) represents months of research β context matters when comparing models in your domain.
?Knowledge Check
Progress is saved in your browser β no account needed.
Need a Data Scientist or AI Engineer?
I build custom ML models, RAG chatbots, data pipelines, and production APIs β from analysis to deployment.