YOLO Object Detection Explained: Models, Tools, Use Cases

Table of contents

YOLO (You Only Look Once) is a real-time object detection model known for its speed and accuracy. Learn how YOLO works, explore the different model versions and tools, and discover real-world use cases from autonomous driving to surveillance.

Ideal For:
ML/CV Engineers
Reading time:
12 mins
Category:
Models

Share blog post

Quick summary of key points about AI model training techniques and their implementation.

TL;DR
  • What is YOLO in object detection? 

YOLO (You Only Look Once) is a real-time object detection algorithm that treats detection as a single regression problem. A single neural network predicts multiple bounding boxes and class probabilities for objects in one pass over the image​. This one-stage approach makes YOLO extremely fast compared to traditional two-stage detectors.

  • How does the YOLO algorithm work? 

YOLO divides the input image into a grid and predicts bounding boxes (with coordinates for each box) and confidence scores for objects in those grid cells​. If an object's center falls in a grid cell, that cell is responsible for detecting it. The network outputs the box coordinates, objectness score, and class probabilities for each predicted box, then uses Non-Maximum Suppression to filter overlapping detections​. Unlike region proposal methods, YOLO processes the entire image in one forward pass​ – hence "you only look once".

  • What are the different YOLO models (v1–v8)? 

The YOLO family has evolved from YOLOv1 (2016) to YOLOv8 (2023), each improving accuracy and speed. For example, YOLOv2 introduced anchor boxes and batch normalization for better localization​. YOLOv3 added a deeper backbone (Darknet-53) and multi-scale predictions (detecting small objects better)​. YOLOv4 incorporated CSPNet and mosaic data augmentation to further boost performance​. Modern versions like YOLOv5 to YOLOv8 focus on lighter models, new neural network layers, and easier training, keeping YOLO state-of-the-art in real-time detection.

  • What are common use cases of YOLO? 

YOLO is used in any application requiring fast object detection. Notable examples include autonomous driving (detecting cars, pedestrians in real time), video surveillance (people or package detection on security cameras), robotics (for vision in drones or industrial robots), and even medical imagery (e.g., detecting anomalies in scans). Its ability to detect objects in live video at high FPS makes it ideal for embedded vision systems and edge aplications.

  • How can you start using YOLO? 

YOLO is available in open-source implementations. The original C/C++ Darknet framework (by Joseph Redmon) provides pre-trained YOLOv1-v4 models​.. For easier use, Python-based libraries like Ultralytics YOLOv5/YOLOv8 offer pretrained models on COCO and simple APIs to detect objects in images or video. You can fine-tune YOLO on a custom dataset by annotating images with bounding boxes and training the network (many tutorials and GitHub repos guide this). Because YOLO is open-source, a large community has built tools, extensions, and improvements around it​, making it accessible even if you’re not training from scratch.

YOLO (You Only Look Once) is one of the most popular object detection models. It is known for its speed and accuracy. It processes images in real time, making it useful for applications like autonomous driving, surveillance, and robotics.

Here we will cover:

  • What is YOLO?
  • How does YOLO work?
  • Evolution of YOLO: From v1 to v8
  • How to implement YOLO
  • Usecases and applications


By the end, you'll understand how YOLO works, its strengths and trade-offs, and how to use it for various object detection tasks.

What is YOLO for Object Detection?

The YOLO is a real-time object detection model which can process an entire image in a single pass. Introduced by Joseph Redmon et al. in 2015, YOLO reframed object detection as a single end-to-end regression problem. It directly maps image pixels to bounding box co-ordinates and class probabilities. This design made YOLO significantly faster than previous approaches.

Fig 1: Real- time object detection.
Fig 1: Real- time object detection.

Why is YOLO Revolutionary?

Previously the popular object detection models like Fast R-CNN used a two stage approach. They would first generate region proposals and then classify them, which made object detection complex and cannot be processed in real-time. 

YOLO used a single convolutional neural network (CNN) and eliminated the region proposal step. This was revolutionary as it made the process simple and enabled real-time detection with competitive accuracy.

The first version of YOLO had lower localization accuracy compared to two-stage methods, but later versions (YOLOv2, v3, etc.) closed this gap. The ability to process at 30+ FPS with high mean Average Precision (mAP) made YOLO practical for real-time applications like video analysis, drone vision, and mobile object detection.

One-Stage vs. Two-Stage Detectors

Object detection models are typically categorized into: two-stage and one-stage detectors. The key difference lies in how they process an image to detect objects.

Two-Stage Detectors: High Accuracy, Slower Speed

Two-stage detectors, like Faster R-CNN, break object detection into two separate steps:

  1. Region Proposal: A Region Proposal Network (RPN) scans the image and suggests potential object locations.

  2. Classification & Refinement: Each proposed region is classified and refined to improve accuracy.
Fig 2: Architecture of the region proposal network.
Fig 2: Architecture of the region proposal network.

This method is highly accurate as the deep learning model focuses on likely object regions and then classifies the potential objects. However, this also adds computations and makes it slower, typically achieving 5-7 FPS on a high-end GPU.

One-Stage Detectors: Faster, Real-Time Performance

One-stage detectors, like YOLO and Single Shot Detection or SSD, skip the region proposal step and predict bounding boxes and class labels in a single network pass. This direct approach makes them significantly faster.

Fig 3: General architecture of single stage object detection.

YOLO was one of the first one-stage detectors to achieve high accuracy, outperforming earlier single-shot models like SSD. 

YOLO vs. Other Object Detection Algorithms

While YOLO dominates in speed, it’s useful to understand how it compares with other detection frameworks.

YOLO vs. Faster R-CNN

Faster R-CNN uses a Region Proposal Network (RPN) to generate ~300 object regions before classification. It achieves high accuracy but runs at roughly 5-7 FPS with a ResNet-101 backbone. YOLOv3, in contrast, runs at 20-45 FPS with slightly lower accuracy. While two-stage models historically had better localization for small objects, YOLOv7 has surpassed many two-stage models in accuracy.

Fig 4: Faster R-CNN framework showing two shot object detection.
Fig 4: Faster R-CNN framework showing two shot object detection.

YOLO vs. SSD

Single shot multibox detector or SSD also uses a single stage approach like YOLO with multi-scale feature maps and anchor boxes. But with YOLO v4 there was a significant improvement in the accuracy over SSD.

Fig 5: Single shot detector.

YOLO vs. RetinaNet

The RetinaNet used Focal Loss to handle class imbalance in classification and achieved accuracy comparable to two-stage detectors. It improved the detection of small objects but at the cost of speed. Later YOLO versions (v4, v5) outperformed RetinaNet in both speed and accuracy, making YOLO the better choice for real-time tasks.

Fig 6: Illustration of RetinaNet architecture.
Fig 6: Illustration of RetinaNet architecture.

YOLO vs. EfficientDet

EfficientDet has a pre trained model as a backbone followed by BiFPN as a feature network. This improved the accuracy, but at lower speeds. EfficientDet-D4 matched YOLOv4’s accuracy but ran at ~8-11 FPS, while YOLOv4 achieved 62 FPS. Even EfficientDet-D7X, the most accurate variant, was slower than YOLOv7. YOLOv7 outperformed it in accuracy as well.

Fig 7: Architecture of EfficientNet.
Fig 7: Architecture of EfficientNet.

See Lightly in Action

Curate data, train foundation models, deploy on edge today.

Book a demo

How YOLO Object Detection Works (Single-Shot Detection)

How YOLO Object Detection Works:

Here are some of the key components involved in the YOLO object detection algorithm:

Grid Division and Object Localization

YOLO first divides the input image into an S x S grid. Now each grid cell is used to detect objects that falls within it.

Class and Bounding Box Prediction

Each bounding box is defined by its center coordinates, width, height, and a confidence score that indicates the likelihood of an object being present. The model also assigns class probabilities to each grid cell, allowing it to identify different objects in a single inference step.

Non-Maximum Suppression (NMS)

The algorithm often predicts multiple overlapping boxes for the same object. To eliminate duplicates, the Non Maximum Algorithm filters out boxes with lower confidence. This ensures that the detections are not redundant.

Confidence Score

This score represents the probability that a bounding box contains an object and how well the predicted box fits the object. The confidence score here is different from the object confidence score assigned earlier to each bounding box.

Multi-Scale Detection in Later Versions

The earlier versions of YOLO struggled with small object detection tasks. YOLOv2 introduced multi-scale feature extraction and used FPN to detect objects at different resolutions.

YOLO's single-shot architecture enables real-time performance, even on local machines. Its balance of speed, accuracy, and accessibility has made it widely adopted in various applications.

Evolution of YOLO Models: From v1 to v12

Since its introduction in 2015, YOLO has evolved through multiple versions. It improved on YOLO architecture in each iteration to enhance accuracy, efficiency, and adaptability over the years. Here is an overview of each iteration:

YOLOv1 (2015): The Original YOLO

YOLOv1 was the first to unify object detection into a single neural network. It used a 24 layer CNN similar to GoogLeNet and predicted two bounding boxes per cell across 20 classes.

Key Features

  • Single-stage object detection using a 7×7 grid.
  • Introduced grid-cell responsibility for object localization.
  • Predicts bounding boxes and class probabilities in a single forward pass.
  • Realtime performance: process images at 45 frames per second.
Fig 8: YOLOv1 network pipeline.
Fig 8: YOLOv1 network pipeline.

Performance

YOLOv1 achieved 63.4% mAP on PASCAL VOC 2007 at 45 FPS on a GPU. However, it had lower accuracy than region-based methods like Faster R-CNN, particularly for small and overlapping objects.

Impact

YOLOv1 proved real-time object detection was feasible on a single GPU.

YOLOv2 (2016): Better and Faster

YOLOv2, also called YOLO9000, improved upon v1 by incorporating anchor boxes, a new backbone (Darknet-19), and batch normalization, allowing detection of over 9000 object categories via joint training on ImageNet.

Fig 9: YOLOv2 demo.
Fig 9: YOLOv2 demo.

Key Features

  • Introduced anchor boxes for better localization.
  • Also introduced batch normalization and high-resolution classification.
  • Used multi-scale training for better generalization across different object sizes.
  • High-resolution classifiers improved detection accuracy.

Performance

YOLOv2 runs at 67 FPS with a 78.6 mAP on VOC 2007, and 21.6% AP on VOVO at 40 FPS. It surpasses YOLOv1 in both speed and accuracy.

Impact

YOLOv2 bridged the performance gap with state-of-the-art detectors while maintaining real-time speed, making it practical for industry applications.

YOLOv3 (2018): Multi-Scale Predictions

YOLOv3 introduced a deeper backbone (Darknet-53) compared to Darket-19 used in YOLOv2. It uses the backbone with residual connections and a feature pyramid network (FPN) for multi-scale object detection tasks.

Key Features

  • Multi-scale predictions at 3 different resolutions (13×13, 26×26, 52×52).
  • Uses logistic classifiers for predicting object classes instead of softmax, allowing for multi-label classification.
  • Uses anchor boxes with different scales and aspect ratios to better match the size and shape of the objects being detected.
Fig 10: YOLOv3 architecture.
Fig 10: YOLOv3 architecture.

Performance

YOLOv3 achieved 3o FPS with 33% mAP while significantly improving detection accuracy, especially for small objects. It was a strong competitor to Faster R-CNN, SSD and RetinaNet and is 3-4 times faster. Hence it was preferred for practical applications.

Impact

YOLOv3 became a widely used real-time detector, balancing speed and accuracy. However, in 2020, Redmon ceased research on YOLO, leaving further development to the community.

YOLOv4 (2020): Community-Driven Enhancements

Developed by Bochkovskiy, Wang, and Liao, YOLOv4 improved both speed and accuracy using CSPDarknet-53 as a backbone and numerous architectural optimizations.

Key Features

  • Introduced CSPNet for reduced computation.
  • Used Mish activation for better feature learning.
  • Added Mosaic data augmentation and Self-Adversarial Training (SAT).
  • Improved loss functions (CIOU loss) and regularization techniques.
Fig 11: Overall structure of YOLOv4 object detector.
Fig 11: Overall structure of YOLOv4 object detector.

Performance

YOLOv4 reached 62 FPS on a Tesla V100, offering a superior speed-accuracy balance. It surpassed YOLOv3 in both mAP and efficiency.

Impact

YOLOv4 established itself as the top choice for real-time object detection in 2020, gaining widespread adoption in research and industry.

YOLOv5 (2020): PyTorch Implementation

YOLOv5, released by Ultralytics, was the first major YOLO version implemented in PyTorch. While not an official research paper, it became extremely popular due to its ease of use and modular framework.

Key Features

  • PyTorch implementation for easy training and deployment.
  • Smaller and faster deep learning models (YOLOv5s, YOLOv5m, etc.).
  • Augmentation techniques (Mosaic, MixUp, etc.) and improved anchor selection.

Performance

It was faster compared to YOLOv4 in terms of training and inference. The accuracy was competitive across benchmarks.

Fig 12: Performance of YOLOv5 variants.
Fig 12: Performance of YOLOv5 variants.

Impact

YOLOv5 became widely adopted in computer vision applications because of its ease of use and performance. Also, it was optimized for mobile deployment.

YOLOv6 (2022): Industry-Focused Efficiency

YOLOv6 was developed by Meituan and optimized for industrial applications. It focused on efficiency and introduced an anchor-free architecture to improve detection accuracy and speed.

Key Features

  • Introduces an anchor-free design and simplifies the training process, and enhances speed.
  • Uses RepVGG-based structures for optimized feature extraction.
  • Better optimization techniques with Knowledge Distillation and Quantization to improve performance.

Performance

It achieves a higher FPS than YOLOv5 while maintaining competitive accuracy.

Fig 13: Comparison of speed against other YOLO versions.
Fig 13: Comparison of speed against other YOLO versions.

Impact

It was optimized for edge deployment in industrial scenarios. Hence, it was widely used in manufacturing and automation due to its low latency and high speed inference capabilities.

YOLOv7 (2022)

YOLOv7 was developed by WongKinYiu and AlexeyAB as an independent research effort, focusing on balancing speed and accuracy. It introduced efficient reparameterization techniques.

Key Features

  • Introduced Extended Efficient Layer Aggregation Networks (E-ELAN) which improves gradient flow for better training.
  • Enhances inference speed by using reparameterized convolutions.
  • Has multiple model variants applicable for various applications.

Performance

It is faster and more accurate than YOLOv6 and YOLOv5. It achieved higher mAP at lower latency compared to previous YOLO versions.

Impact

It was used in real-time video analytics and robotics due to its high accuracy and efficiency.

YOLOv8 (2023)

YOLOv8 refined previous improvements with a more flexible architecture, optimized for various real-world applications.

Key Features

  • Used a new backbone and head which optimized the model for high accuracy and fast inference.
  • Supports instance segmentation and object tracking.
  • Further improved feature extraction and detection accuracy.
  • Optimizes anchor boxes for better performance with AutoAnchor mechanism.
  • Designed for better adaptability across different deployment environments.

Performance

YOLOv8 achieved higher accuracy and better generalization while keeping real-time performance intact. It remains one of the most widely used single-shot object detection models today.

Fig 14: Performance of YOLOv8 against other versions of YOLO.
Fig 14: Performance of YOLOv8 against other versions of YOLO.

Impact

It became popular version of YOLO due to its ease of use and high accuracy. It was commonly used in autonomous vehicles, surveillance and retail analytics.

YOLOv9 (2024)

YOLOv9 introduced a hybrid anchor-free detection approach, optimizing speed and accuracy. It refined feature aggregation and backbone efficiency for improved small-object detection in real-time applications.

Key Features

  • Used generalized efficient layer aggregation network (GELAN) architecture to improve feature extraction and gradient flow.
  • Optimizes the path aggregation network for better feature fusion across scales.
  • Explores the functionality of multi-level auxiliary information, using different feature pyramids for varied tasks in object detection.

Performance

YOLOv9 demonstrates improved mAP over YOLOv8, with reduced latency, making it suitable for applications requiring swift and accurate object detection. YOLOv9 is also capable of performing object detection, segmentation, and classification tasks.

Fig 15: Comparison of the real-time object detectors.

Impact

Better application in various industries because of its use.

YOLOv10 (2024)

YOLOv10 also known as YOLOE integrated transformer-based feature extraction, boosting performance in complex real-world scenarios. It improved generalization across diverse datasets while reducing computational overhead.

Fig 16: YOLOv10 bounding box predictions.
Fig 16: YOLOv10 bounding box predictions.

Key Features

  • Introduces a dual-label assignment system to improve the model's capability to detect and classify objects in real-time.
  • Introduces different prompt mechanisms like texts, visual inputs and prompt free paradigm.
  • Lightweight classification head is used to balance accuracy and computational efficiency.
  • Uses partial self attention or PSA modules to improve performance without significantly increasing computational cost.

Performance

YOLOv10 variants exhibit significant improvements over previous versions, achieving up to 54.4% APval with reduced latency. It is optimized for real-time edge computing applications, processing images at up to 1000fps.

Impact

YOLOv10 offers a range of model zises to accommodate different computational resources and accuracy needs. This efficiency driven design has set new benchmarks for real-time object detection, making it ideal for applications in resource constrained environments.

YOLOv11 

YOLOv11 shifts from purely CNN based architecture to  transformer based backbone. It introduces dynamic head design for improving accuracy with fewer parameters. It supports tasks such as object detection, segmentation, classification, keypoint detection, and oriented bounding box detection.

Key Features

  • The dynamic head design adapts based on image complexity and optimizes resource allocation.
  • Eliminates the need for Non Maximum Suppression and reduces inference time.
  • Uses dual label assignment to improve detection in overlapping and densely packed objects.

Performance

YOLOv11 outperforms previous versions in speed and accuracy on COCO dataset. It processes at 60 FPS with a mean Average Precision (mAP) of 61.5%, and fewer parameters, making it suitable for a range of applications.

Impact

It utilizes a better neck and backbone architecture, enhancing feature extraction capabilities for more precise object detection. YOLOv11 also expanded object detection use cases, particularly for dense scenes and complex environments.

YOLOv12 (2025)

YOLOv12 integrates attention mechanisms into the YOLO framework. This design combines CNN speed with transformer-based enhancements. 

Key Features

  • Uses Area Attention module which divides the featuremap into segments to preserve a large receptive field and reduces computational complexity.
  • Addresses optimization challenges introduced by attention mechanisms with residual efficient layer aggregation network (R-ELAN).
  • Introduces flash attention into the network to optimize memory.

Performance

It shows a 25% improvement in detection accuracy in poor lighting. The multiple object tracking enhances object tracking in motion heavy scenarios.

Fig 17: Comparison of YOLOv12 with popular methods.

Impact

Sets a new benchmark in object detection with improved speed and accuracy. This makes YOLOv12 particularly effective in applications such as autonomous driving, security surveillance, and industrial automation.

YOLO Series - Comparison

Take a look at this comparison table.

Table 1: Comparison R-CNN vs. Faster R-CNN (By Author)
Version Release Year Key Features Performance Impact
YOLOv1 2015 Unified architecture for real-time object detection 63.4% mAP at 45 FPS on PASCAL VOC 2007 Pioneered real-time object detection with a single neural network
YOLOv2 2016 Introduced batch normalization, high-resolution classifiers, and anchor boxes 76.8% mAP at 67 FPS on PASCAL VOC 2007 Improved accuracy and speed; expanded applicability
YOLOv3 2018 Used Darknet-53 backbone; multi-scale predictions; feature pyramid networks 57.9% AP on COCO dataset Enhanced detection of small objects and improved accuracy
YOLOv4 2020 CSPDarknet53 backbone; mosaic data augmentation; self-adversarial training 43.5% AP at 65 FPS on COCO dataset Balanced speed and accuracy; widely adopted in industry
YOLOv5 2020 Focused on ease of use; implemented auto-learning bounding box anchors 50.4% AP at 140 FPS on COCO dataset User-friendly; facilitated deployment in various applications
YOLOv6 2022 Optimized for mobile devices; introduced efficient backbone and neck designs 43.1% AP at 120 FPS on COCO dataset Enabled real-time detection on edge devices
YOLOv7 2022 Extended efficient layer aggregation networks; model scaling techniques 51.4% AP at 150 FPS on COCO dataset Achieved state-of-the-art performance; efficient for various tasks
YOLOv8 2023 Incorporated transformer layers; adaptive computation for dynamic scenes 53.9% AP at 160 FPS on COCO dataset Improved handling of complex scenes and occlusions
YOLOv9 2024 Introduced Generalized Efficient Layer Aggregation Network (GELAN) and Programmable Gradient Information (PGI) YOLOv9e variant achieved 55.6% mAP with 58.1M parameters Enhanced accuracy and efficiency; suitable for diverse applications
YOLOv10 2024 Advanced loss function; variants from nano to extra-large models YOLOv10-S achieved 46.3% APval with 2.49ms latency Reduced latency and parameter count; adaptable to various computational needs
YOLOv11 2024 Transformer-based backbone; dynamic head design; NMS-free training 61.5% mAP at 60 FPS with 40M parameters Improved speed and accuracy; efficient for real-time applications
YOLOv12 2025 Area Attention Module (A2); Residual Efficient Layer Aggregation Networks (R-ELAN); Flash Attention YOLOv12-Nano achieved 40.6% mAP with 1.64ms latency Combined attention mechanisms with speed; effective in real-time scenarios

Tools and Frameworks for Implementing YOLO

To train, fine-tune or even infer a YOLO model, you will need the right tools. Here are the libraries, frameworks and deployment solutions you will need to train, fine-tune or even infer your YOLO: 

Frameworks

PyTorch

It is a fan favorite for a good reason. It is flexible, easy to debug and has great support for GPU acceleration. Most modern YOLO versions are built on PyTorch as well.

You can install it with:

pip install torch torchvision

TersorFlow and Keras

YOLOv3 and YOLOv4 have TensorFlow implementations, and TensorFlow Lite (TFLite) makes it easy to deploy on mobile devices.

To get started:

pip install tensorflow

Darknet

Darknet is where YOLO started. It’s a C-based framework built for speed. While newer YOLO versions have moved to PyTorch, Darknet still supports YOLOv1 to YOLOv4 and YOLOv7 versions.

You can start with:

git clone https://github.com/pjreddie/darknet
cd darknet
make

Model Training and Deployment Tools

Ultralytics

It is a PyTorch-based implementation that simplifies training, fine-tuning, and deployment of YOLO models.

pip install ultralytics

Now to run inference on an image, run:

from ultralytics import YOLO

model = YOLO("yolov8n.pt")
results = model("image.jpg")

MMDetection

A PyTorch-based object detection framework developed by OpenMMLab. If you want more customization, then it is a solid choice. It’s modular and great for large-scale training.

pip install mmdet

ONNX Runtime

If you need to run YOLO models on a different hardware, then ONNX Runtime helps you convert your model so it can run on CPUs, GPUs or even on AI chips.

import onnxruntime as ort

session = ort.InferenceSession("yolov8.onnx")

Optimization Tools

  • NVIDIA TensorRT: If you’re using an NVIDIA GPU, TensorRT is a must. It significantly reduces YOLO’s latency.

  • OpenVINO: When using Intel’s hardware, OpenVINO helps in optimizing the model for CPUs and edge applications. It also reduces latency and power consumption.

Deploying YOLO Models

When not to use YOLO

  • High-precision tasks: YOLO trades accuracy for speed; models like Faster R-CNN work better for detailed detections.
  • Small or overlapping objects: Struggles with tiny objects in dense scenes.
  • Complex relationships: Not ideal for tasks needing multi-stage processing or object tracking.
  • Limited hardware: Requires GPUs for real-time performance; MobileNet SSD is better for low-power devices.

Use Cases and Applications of YOLO

YOLO’s combination of speed and accuracy has led to its adoption in a wide range of fields. Here are some prominent use cases:

  • Autonomous Vehicles: Detects pedestrians, traffic signs, and other vehicles in real-time.
  • Surveillance & Security: Enables real-time threat detection in CCTV footage.
  • Retail & Inventory Management: Tracks products and automates checkout systems.
  • Healthcare & Medical Imaging: Assists in detecting abnormalities in X-rays and MRIs.
    Robotics:  Helps robots recognize and interact with objects in dynamic environments.
  • Sports Analytics: Tracks player movements and ball trajectories in live games.
  • Augmented Reality (AR): Enhances AR applications by detecting objects for interactive overlays.

Conclusion

YOLO (You Only Look Once) has come a long way, evolving into one of the fastest and most efficient single shot object detection models out there. From YOLOv1 to the latest YOLOv12, each version has pushed the boundaries of speed, accuracy, and efficiency.

While YOLO remains a top choice for real-time vision tasks, selecting the right version ensures optimal performance for specific use cases.

Get Started with Lightly

Talk to Lightly’s computer vision team about your use case.
Book a Demo

Stay ahead in computer vision

Get exclusive insights, tips, and updates from the Lightly.ai team.