YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection

Xu Lin, Jinlong Peng, Zhenye Gan, Jiawen Zhu, Jun Liu

2025-12-30

YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection

Summary

This paper introduces YOLO-Master, a new approach to real-time object detection that aims to improve both accuracy and speed by intelligently allocating computing power based on how complex an image is.

What's the problem?

Current real-time object detection systems, like those based on YOLO, process every part of an image equally, even if some parts are simple and don't need much processing power. This is inefficient because it wastes resources on easy areas while potentially missing details in complex scenes, leading to less accurate results and slower performance overall.

What's the solution?

YOLO-Master solves this by using a system called an Efficient Sparse Mixture-of-Experts (ES-MoE). Think of it like having a team of specialists; the system dynamically decides which specialists (experts) to use for each part of an image, based on how difficult that part is. A 'routing network' learns which experts are best at handling different types of image features, and only activates the most relevant ones, saving computational effort. During training, the system is encouraged to develop experts that each have unique skills, making the team more effective.

Why it matters?

This research is important because it demonstrates a way to make object detection systems more efficient and accurate, especially in challenging situations like crowded scenes. By focusing computing power where it's needed most, YOLO-Master achieves better performance than existing methods while still maintaining real-time speed, which is crucial for applications like self-driving cars and robotics.

Abstract

Existing Real-Time Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance. To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. This is achieved through a Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference. Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Code will be available.

View Paper