Object Detection
Object detection is a computer vision task that involves identifying the location and category of one or more objects within an image or video frame, producing bounding boxes and class labels for each detected instance.
Object detection is a core computer vision task that combines object classification — determining what category an object belongs to — with object localisation — determining where in an image the object is located. Unlike image classification, which assigns a single label to an entire image, object detection systems must identify and bound all object instances of interest within a scene, even when multiple objects of different classes overlap or occlude one another. The global computer vision market, driven substantially by object detection applications, reached USD 19.82 billion in 2024 and is projected to surpass USD 58 billion by 2030.[^1]
Problem Formulation
Given an input image, an object detection model produces a set of detections, where each detection consists of:
- A bounding box — typically represented as (x_min, y_min, x_max, y_max) or (centre_x, centre_y, width, height) — enclosing the detected object.
- A class label identifying the object's category (e.g., "car", "person", "defect").
- A confidence score representing the model's certainty about the detection.
Post-processing steps such as Non-Maximum Suppression (NMS) remove redundant overlapping detections by retaining only the box with the highest confidence score where multiple predictions overlap significantly (measured by Intersection over Union, or IoU).
Architectural Approaches
Two-Stage Detectors
Two-stage detectors, pioneered by the R-CNN family, first generate a set of candidate regions of interest (RoIs) using a region proposal network, then classify and refine each region independently. Faster R-CNN and Mask R-CNN fall into this category. These detectors tend to be highly accurate but slower, making them better suited to offline analysis than real-time applications.
Single-Stage Detectors
Single-stage detectors perform detection in a single pass over the image without a separate proposal stage, making them significantly faster. The YOLO (You Only Look Once) family, introduced by Joseph Redmon et al. in 2015, is the most prominent example.[^2] YOLO divides the image into a grid and predicts bounding boxes and class probabilities directly from each grid cell. Successive versions have progressively improved accuracy and speed: YOLOv8 (Ultralytics, 2023) became the community standard, while YOLOv12 (February 2025) introduced attention-centric architecture, integrating efficient attention mechanisms alongside convolutional operations to capture global context while maintaining real-time speeds.
Transformer-Based Detectors
DETR (Detection Transformer), introduced by Facebook AI Research in 2020, replaced the hand-engineered NMS post-processing with a set-prediction formulation using transformer attention, treating detection as a direct set prediction problem.[^3] RT-DETR and RF-DETR (2024–2025) have extended this approach to real-time performance while maintaining accuracy competitive with YOLO models, representing the current frontier for detection architectures.
| Model Family | Stage | Speed | Accuracy | Notable for | |---|---|---|---|---| | Faster R-CNN | Two-stage | Moderate | High | Accuracy-focused tasks | | YOLOv8 | Single-stage | Fast | High | Balance of speed and accuracy | | YOLOv12 | Single-stage | Very fast | Very high | Attention-centric, 2025 standard | | DETR / RT-DETR | Single-stage | Fast | Very high | Transformer-based, no NMS |
Training and Evaluation
Object detection models are typically trained on large annotated datasets such as COCO (Common Objects in Context, 330,000 images, 80 categories) and Open Images. The standard evaluation metric is mean Average Precision (mAP), computed by averaging the area under the precision–recall curve across all object categories and IoU thresholds.
Data annotation is a significant cost and bottleneck. Labelling tools (e.g., Label Studio, CVAT, Roboflow) allow annotators to draw bounding boxes, while semi-automatic approaches use pre-trained models to propose initial annotations that humans then verify.
Applications
Object detection underpins a wide range of deployed systems. In autonomous driving, vehicles use real-time detection of pedestrians, vehicles, cyclists, and traffic signs from camera and lidar feeds. In industrial quality control, cameras on production lines detect surface defects, misaligned components, or foreign objects. In retail, shelf-monitoring systems track stock levels. In healthcare, detection models identify anatomical structures or lesions in radiology images. In surveillance and public safety, detection identifies people, vehicles, and prohibited items.
References
- MarketsandMarkets. (2024). Computer Vision Market — Global Forecast to 2030. MarketsandMarkets Research.
- Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. CVPR 2016, 779–788.
- Carion, N., Massa, F., Synnaeve, G., et al. (2020). End-to-end object detection with transformers. ECCV 2020.
- Ultralytics. (2025). YOLOv12: Attention-centric real-time object detectors. Ultralytics Documentation.