YOLO (You Only Look Once)
YOLO is a family of real-time object detection models that frame detection as a single regression problem, predicting bounding boxes and class probabilities directly from an image in one network pass.
YOLO, short for You Only Look Once, is a family of object detection models that locate and classify multiple objects in an image in a single pass through a neural network. Introduced in 2015 by Joseph Redmon and colleagues, YOLO reframed object detection, previously a slow multi-stage pipeline, as a single regression problem in which a convolutional neural network predicts bounding box coordinates and class probabilities simultaneously. This design made real-time detection practical and has kept the YOLO line among the most widely used computer vision systems.
How YOLO works
A YOLO model divides the input image into a grid. Each grid cell is responsible for predicting a fixed number of bounding boxes, each accompanied by a confidence score reflecting how likely the box contains an object and how accurate the box is, along with class probabilities. Because the whole image is processed at once, the network reasons globally about context rather than examining isolated regions, which reduces certain kinds of background error.
The many candidate boxes a network produces are filtered by non-maximum suppression, which removes heavily overlapping detections and keeps the most confident box for each object. The result is a clean set of labelled boxes. The single-stage nature of this pipeline is what distinguishes YOLO from earlier two-stage detectors that first proposed regions and then classified them; YOLO trades a small amount of accuracy on small or crowded objects for a large gain in speed.
Evolution of the family
YOLO has advanced through many versions, each refining architecture and training. YOLOv2 added anchor boxes and batch normalisation to improve localisation. YOLOv3 introduced a deeper backbone and multi-scale predictions that markedly improved detection of small objects. Later releases, developed by different research groups and companies, brought anchor-free designs, improved feature aggregation and broader task support. From YOLOv8 onward, the models from Ultralytics adopted an anchor-free mechanism and unified detection, segmentation, classification, pose estimation and oriented bounding box tasks in a single framework. Versions through YOLOv11 and YOLOv12 have continued to push the balance of speed and accuracy, with some configurations rivalling slower two-stage methods.
| Version | Year | Notable change | | --- | --- | --- | | YOLOv1 | 2015 | Single-stage detection introduced | | YOLOv3 | 2018 | Multi-scale prediction, deeper backbone | | YOLOv8 | 2023 | Anchor-free, multi-task framework | | YOLOv11 | 2024 | Improved feature extraction modules |
Applications
YOLO is deployed wherever fast, on-the-fly detection matters: video surveillance, autonomous vehicles and driver assistance, retail analytics, manufacturing quality inspection, agriculture, medical imaging and robotics. Its modest computational footprint relative to its accuracy lets it run on edge devices and embedded hardware, extending object detection beyond the data centre.
References
- Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Redmon, J. and Farhadi, A. (2018). YOLOv3: An Incremental Improvement. arXiv:1804.02767.
- Ultralytics. (2024). YOLO11 Documentation. Ultralytics.