YOLOv12: Attention-Centric Real-Time Object Detectors

Yunjie Tian, Qixiang Ye, David Doermann

2025-02-19

YOLOv12: Attention-Centric Real-Time Object Detectors

Summary

This paper talks about YOLOv12, a new version of the YOLO object detection system that uses attention mechanisms to improve performance while maintaining fast processing speeds. It's like giving the AI a better way to focus on important parts of an image without slowing it down.

What's the problem?

Previous versions of YOLO relied on a type of AI called convolutional neural networks (CNNs) to detect objects in images. While these were fast, they weren't as good at understanding the whole image as newer attention-based models. However, these attention models were usually too slow for real-time object detection, which is crucial for applications like self-driving cars or surveillance cameras.

What's the solution?

The researchers created YOLOv12, which cleverly combines the speed of CNNs with the understanding power of attention mechanisms. They designed a special 'attention-centric' system that can focus on important parts of the image quickly. This allows YOLOv12 to be as fast as older versions while being much more accurate at detecting objects.

Why it matters?

This matters because it makes real-time object detection both faster and more accurate. YOLOv12 outperforms other popular object detectors, which could lead to improvements in many areas that use computer vision. For example, self-driving cars could better identify obstacles, security cameras could more accurately detect suspicious activities, and augmented reality apps could interact with the real world more precisely. All of this could happen without needing more powerful computers, making advanced AI more accessible and efficient.

Abstract

Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 <PRE_TAG>GPU</POST_TAG>, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.

View Paper